Compare commits

...

1323 Commits

Author SHA1 Message Date
fe39c07826 [pipelining][doc] Remove duplicated words (#128368)
"for execution" is used in both step titles

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128368
Approved by: https://github.com/wconstab
ghstack dependencies: #128361
2024-06-11 04:52:57 +00:00
cba195c8ed Support aten operations with out tensor (#124926)
This PR intends to support the aten operations with the `out` tensor.

Currently, the AOT compile always does **NOT** keep input tensor mutations. According to the comments, this is because it has not encountered such a use case.
> For now there's no use case involving keeping input mutations in the graph (which we can only do in the inference case anyway). We can add this later if we need to.

However, for aten operations, it is popular that the `out` tensor is an input parameter and needs to be mutated. This PR intends to support it by adding a `keep_inference_input_mutations` flag to `aot_inductor.keep_inference_input_mutations`. This flag can provide flexibility to the callee in deciding whether the AOT compile needs to keep input tensor mutations in the graph.

Take `clamp` as an example as follows.
```python
out_tensor = torch.randn(128, dtype=torch.float, device=device).fill_(-2.0)
inp_tensor = torch.randn(128, dtype=torch.float, device=device).fill_(1.0)
min_tensor = inp_tensor - 0.05
max_tensor = inp_tensor + 0.05
torch.clamp(input=inp_tensor, min=min_tensor, max=max_tensor, out=out_tensor)
```

W/O this PR
```python
def forward(self):
    arg0_1: "f32[128]"; arg1_1: "f32[128]"; arg2_1: "f32[128]"; arg3_1: "f32[128]";

    arg0_1, arg1_1, arg2_1, arg3_1, = fx_pytree.tree_flatten_spec([], self._in_spec)
    clamp_min: "f32[128]" = torch.ops.aten.clamp_min.Tensor(arg0_1, arg1_1);  arg0_1 = arg1_1 = None
    clamp_max: "f32[128]" = torch.ops.aten.clamp_max.Tensor(clamp_min, arg2_1);  clamp_min = arg2_1 = None
    return (clamp_max, clamp_max)
```

W/ this PR
```python
def forward(self):
    arg0_1: "f32[128]"; arg1_1: "f32[128]"; arg2_1: "f32[128]"; arg3_1: "f32[128]";

    arg0_1, arg1_1, arg2_1, arg3_1, = fx_pytree.tree_flatten_spec([], self._in_spec)
    clamp_min: "f32[128]" = torch.ops.aten.clamp_min.Tensor(arg0_1, arg1_1);  arg0_1 = arg1_1 = None
    clamp_max: "f32[128]" = torch.ops.aten.clamp_max.Tensor(clamp_min, arg2_1);  clamp_min = arg2_1 = None
    copy_: "f32[128]" = torch.ops.aten.copy_.default(arg3_1, clamp_max);  arg3_1 = clamp_max = None
    return (copy_,)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124926
Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/angelayi
2024-06-11 04:35:27 +00:00
16e67be7f1 Also preserve unbacked SymInts when partitioning as backward inputs (#128338)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128338
Approved by: https://github.com/IvanKobzarev
2024-06-11 04:27:09 +00:00
7afffdf48b [CI] Comment hf_T5_generate, hf_GPT2 and timm_efficientnet in inductor cpu smoketest for performance unstable issue (#127588)
Fixes #126993

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127588
Approved by: https://github.com/chuanqi129, https://github.com/jgong5, https://github.com/desertfire
2024-06-11 03:12:11 +00:00
ca45649eb5 [easy][dynamo][inline work] Fix test with inlining inbuilt nn modules (#128254)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128254
Approved by: https://github.com/williamwen42
ghstack dependencies: #128295, #126578, #128268
2024-06-11 03:02:51 +00:00
665e568381 [inductor][inlining nn module] Skip batchnorm version check test for inlining (#128268)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128268
Approved by: https://github.com/zou3519
ghstack dependencies: #128295, #126578
2024-06-11 03:02:51 +00:00
4077cdd589 [pipelining][doc] Update arg list of pipeline API (#128361)
And document the use of `build_stage` API.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128361
Approved by: https://github.com/wconstab
2024-06-11 02:55:17 +00:00
cyy
e4bd0adca5 [6/N] Remove unused functions (#128309)
Follows #127185

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128309
Approved by: https://github.com/ezyang
2024-06-11 02:46:33 +00:00
793df7b7cb Prevent expansion of cat indexing to avoid int64 intermediate (#127815)
Fix for https://github.com/pytorch/pytorch/issues/127652

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127815
Approved by: https://github.com/shunting314, https://github.com/peterbell10
2024-06-11 02:41:07 +00:00
d1d9bc7aa6 init add comment (#128083)
Fixes #127898

### Description

Add docstring to torch/onnx/symbolic_opset9.py:sigmoid function

### Checklist
- [x] The issue that is being fixed is referred in the description
- [x] Only one issue is addressed in this pull request
- [x] Labels from the issue that this PR is fixing are added to this pull request
- [x] No unnecessary issues are included into this pull request

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128083
Approved by: https://github.com/titaiwangms
2024-06-11 02:37:04 +00:00
841d87177a Make sure #126704 is BC for torch.save-ed nn.Module (#128344)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128344
Approved by: https://github.com/albanD
ghstack dependencies: #126906, #126704
2024-06-11 02:26:06 +00:00
3b555ba477 Add docstring for torch.utils.data.datapipes.decoder.basicandlers (#128018)
Fixes #127912

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128018
Approved by: https://github.com/andrewkho
2024-06-11 01:32:45 +00:00
734e8f6ad7 [inductor] enable fx graph cache on torchbench (#128239)
Summary: We've already enabled for timm and huggingface, but we had failures saving cache entries for moco. It looks like https://github.com/pytorch/pytorch/pull/128052 has fixed that issue, so we can enable for torchbench.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128239
Approved by: https://github.com/oulgen
2024-06-11 00:40:31 +00:00
cyy
99f5a85a09 [Clang Tidy] Fix misc-header-include-cycle errors in clang-tidy and ignore some files (#127233)
Since there are such cycles in libfmt and PyTorch, which are detected by clang-tidy.
```
/home/cyy/pytorch/third_party/fmt/include/fmt/format-inl.h:25:10: error: circular header file dependency detected while including 'format.h', please check the include path [misc-header-include-cycle,-warnings-as-errors]
   25 | #include "format.h"
      |          ^
/home/cyy/pytorch/third_party/fmt/include/fmt/format.h:4530:12: note: 'format-inl.h' included from here
 4530 | #  include "format-inl.h"
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127233
Approved by: https://github.com/ezyang
2024-06-10 23:49:58 +00:00
f843ccbb1a [MTIA] Add set_device support (#128040)
Summary: Support set_device API in MTIA backend.

Reviewed By: gnahzg

Differential Revision: D58089498

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128040
Approved by: https://github.com/gnahzg
2024-06-10 23:42:52 +00:00
cyy
30875953a4 [1/N] Remove inclusion of c10/util/string_utils.h (#128300)
As a first step to remove it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128300
Approved by: https://github.com/ezyang, https://github.com/eqy
2024-06-10 23:40:47 +00:00
cyy
2126ae186e Remove caffe2/perfkernels files (#128186)
These files are not used.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128186
Approved by: https://github.com/ezyang, https://github.com/r-barnes
2024-06-10 23:40:18 +00:00
739aa224ec [Fix] Parameter un/lifting issues in the TorchScript to ExportedProgram converter (#127975)
This PR fixes issues related to parameters and inputs lifting in the converter.

#### Issue 1
```
> Graph[linear.weights, bias.weights, x.1]
%1 ...
%2 ...
%3 = CreateObject()

	> Block 0[]
        %linear.0 = GetAttr(linear)[%3]

	             > Block 0.0[]
	             %weight.0 = GetAttr(weights)[%linear.0]

	> Block 1[]
	...
```
* Model parameters for the top level module should be unlifted, while parameters from sub-blocks should be lifted.
#### Fixes
* Bottom-up traversal (i.e., start from the inner most block) to figure out which parameters to be lifted for sub-blocks.

#### Test Plan
* Add test cases for nested block without control flow `pytest test/export/test_converter.py -s -k test_convert_nn_module_with_nested_param`
* Add test cases for nested block with control flow `pytest test/export/test_converter.py -s -k test_convert_nn_module_with_nested_if_and_param`

#### Outcome
##### TorchScript
```
graph(%x.1 : Float(3, strides=[1], requires_grad=0, device=cpu),
      %m1.m1.linear.weight : Float(3, 3, strides=[3, 1], requires_grad=0, device=cpu),
      %m1.m1.linear.bias : Float(3, strides=[1], requires_grad=0, device=cpu),
      %m1.linear.weight : Float(3, 3, strides=[3, 1], requires_grad=0, device=cpu),
      %m1.linear.bias : Float(3, strides=[1], requires_grad=0, device=cpu),
      %m1.m2.linear.weight : Float(3, 3, strides=[3, 1], requires_grad=0, device=cpu),
      %m1.m2.linear.bias : Float(3, strides=[1], requires_grad=0, device=cpu),
      %linear.weight : Float(3, 3, strides=[3, 1], requires_grad=0, device=cpu),
      %linear.bias : Float(3, strides=[1], requires_grad=0, device=cpu),
      %m2.m1.linear.weight : Float(3, 3, strides=[3, 1], requires_grad=0, device=cpu),
      %m2.m1.linear.bias : Float(3, strides=[1], requires_grad=0, device=cpu),
      %m2.linear.weight : Float(3, 3, strides=[3, 1], requires_grad=0, device=cpu),
      %m2.linear.bias : Float(3, strides=[1], requires_grad=0, device=cpu),
      %m2.m2.linear.weight : Float(3, 3, strides=[3, 1], requires_grad=0, device=cpu),
      %m2.m2.linear.bias : Float(3, strides=[1], requires_grad=0, device=cpu)):
  %15 : __torch__.export.test_converter.___torch_mangle_14.SuperNestedM1 = prim::CreateObject()
  %16 : NoneType = prim::Constant(), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1
  %17 : int = prim::Constant[value=1](), scope: export.test_converter.SuperNestedM1:: # /data/users/jiashenc/pytorch/test/export/test_converter.py:342:34
  %18 : Tensor = aten::max(%x.1), scope: export.test_converter.SuperNestedM1:: # /data/users/jiashenc/pytorch/test/export/test_converter.py:342:19
  %19 : Tensor = aten::gt(%18, %17), scope: export.test_converter.SuperNestedM1:: # /data/users/jiashenc/pytorch/test/export/test_converter.py:342:19
  %20 : bool = aten::Bool(%19), scope: export.test_converter.SuperNestedM1:: # /data/users/jiashenc/pytorch/test/export/test_converter.py:342:19
  %21 : Tensor = prim::If(%20), scope: export.test_converter.SuperNestedM1:: # /data/users/jiashenc/pytorch/test/export/test_converter.py:342:16
    block0():
      %linear.6 : __torch__.torch.nn.modules.linear.___torch_mangle_17.Linear = prim::GetAttr[name="linear"](%15), scope: export.test_converter.SuperNestedM1::
      %m1.1 : __torch__.export.test_converter.___torch_mangle_15.NestedM = prim::GetAttr[name="m1"](%15), scope: export.test_converter.SuperNestedM1::
      %24 : Tensor = aten::sum(%x.1, %16), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1 # /data/users/jiashenc/pytorch/test/export/test_converter.py:327:19
      %25 : Tensor = aten::gt(%24, %17), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1 # /data/users/jiashenc/pytorch/test/export/test_converter.py:327:19
      %26 : bool = aten::Bool(%25), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1 # /data/users/jiashenc/pytorch/test/export/test_converter.py:327:19
      %27 : Tensor = prim::If(%26), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1 # /data/users/jiashenc/pytorch/test/export/test_converter.py:327:16
        block0():
          %linear.10 : __torch__.torch.nn.modules.linear.___torch_mangle_17.Linear = prim::GetAttr[name="linear"](%m1.1), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1
          %m1.3 : __torch__.export.test_converter.___torch_mangle_16.M = prim::GetAttr[name="m1"](%m1.1), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1
          %linear.12 : __torch__.torch.nn.modules.linear.___torch_mangle_17.Linear = prim::GetAttr[name="linear"](%m1.3), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1
          %weight.4 : Tensor = prim::GetAttr[name="weight"](%linear.12), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1
          %bias.4 : Tensor = prim::GetAttr[name="bias"](%linear.12), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1
          %33 : Tensor = aten::linear(%x.1, %weight.4, %bias.4), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1 # /data/users/jiashenc/pytorch/torch/nn/modules/linear.py:116:15
          %weight.6 : Tensor = prim::GetAttr[name="weight"](%linear.10), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1
          %bias.6 : Tensor = prim::GetAttr[name="bias"](%linear.10), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1
          %36 : Tensor = aten::linear(%33, %weight.6, %bias.6), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1 # /data/users/jiashenc/pytorch/torch/nn/modules/linear.py:116:15
          -> (%36)
        block1():
          %linear.14 : __torch__.torch.nn.modules.linear.___torch_mangle_17.Linear = prim::GetAttr[name="linear"](%m1.1), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1
          %m2.3 : __torch__.export.test_converter.___torch_mangle_16.M = prim::GetAttr[name="m2"](%m1.1), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1
          %linear.16 : __torch__.torch.nn.modules.linear.___torch_mangle_17.Linear = prim::GetAttr[name="linear"](%m2.3), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1
          %weight.8 : Tensor = prim::GetAttr[name="weight"](%linear.16), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1
          %bias.8 : Tensor = prim::GetAttr[name="bias"](%linear.16), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1
          %42 : Tensor = aten::linear(%x.1, %weight.8, %bias.8), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1 # /data/users/jiashenc/pytorch/torch/nn/modules/linear.py:116:15
          %weight.2 : Tensor = prim::GetAttr[name="weight"](%linear.14), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1
          %bias.2 : Tensor = prim::GetAttr[name="bias"](%linear.14), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1
          %45 : Tensor = aten::linear(%42, %weight.2, %bias.2), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1 # /data/users/jiashenc/pytorch/torch/nn/modules/linear.py:116:15
          -> (%45)
      %weight.10 : Tensor = prim::GetAttr[name="weight"](%linear.6), scope: export.test_converter.SuperNestedM1::/torch.nn.modules.linear.Linear::linear
      %bias.10 : Tensor = prim::GetAttr[name="bias"](%linear.6), scope: export.test_converter.SuperNestedM1::/torch.nn.modules.linear.Linear::linear
      %48 : Tensor = aten::linear(%27, %weight.10, %bias.10), scope: export.test_converter.SuperNestedM1::/torch.nn.modules.linear.Linear::linear # /data/users/jiashenc/pytorch/torch/nn/modules/linear.py:116:15
      -> (%48)
    block1():
      %linear.8 : __torch__.torch.nn.modules.linear.___torch_mangle_17.Linear = prim::GetAttr[name="linear"](%15), scope: export.test_converter.SuperNestedM1::
      %m2.1 : __torch__.export.test_converter.___torch_mangle_15.NestedM = prim::GetAttr[name="m2"](%15), scope: export.test_converter.SuperNestedM1::
      %51 : Tensor = aten::sum(%x.1, %16), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2 # /data/users/jiashenc/pytorch/test/export/test_converter.py:327:19
      %52 : Tensor = aten::gt(%51, %17), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2 # /data/users/jiashenc/pytorch/test/export/test_converter.py:327:19
      %53 : bool = aten::Bool(%52), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2 # /data/users/jiashenc/pytorch/test/export/test_converter.py:327:19
      %54 : Tensor = prim::If(%53), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2 # /data/users/jiashenc/pytorch/test/export/test_converter.py:327:16
        block0():
          %linear.1 : __torch__.torch.nn.modules.linear.___torch_mangle_17.Linear = prim::GetAttr[name="linear"](%m2.1), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2
          %m1 : __torch__.export.test_converter.___torch_mangle_16.M = prim::GetAttr[name="m1"](%m2.1), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2
          %linear.5 : __torch__.torch.nn.modules.linear.___torch_mangle_17.Linear = prim::GetAttr[name="linear"](%m1), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2
          %weight.1 : Tensor = prim::GetAttr[name="weight"](%linear.5), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2
          %bias.1 : Tensor = prim::GetAttr[name="bias"](%linear.5), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2
          %60 : Tensor = aten::linear(%x.1, %weight.1, %bias.1), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2 # /data/users/jiashenc/pytorch/torch/nn/modules/linear.py:116:15
          %weight.3 : Tensor = prim::GetAttr[name="weight"](%linear.1), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2
          %bias.3 : Tensor = prim::GetAttr[name="bias"](%linear.1), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2
          %63 : Tensor = aten::linear(%60, %weight.3, %bias.3), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2 # /data/users/jiashenc/pytorch/torch/nn/modules/linear.py:116:15
          -> (%63)
        block1():
          %linear.3 : __torch__.torch.nn.modules.linear.___torch_mangle_17.Linear = prim::GetAttr[name="linear"](%m2.1), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2
          %m2 : __torch__.export.test_converter.___torch_mangle_16.M = prim::GetAttr[name="m2"](%m2.1), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2
          %linear : __torch__.torch.nn.modules.linear.___torch_mangle_17.Linear = prim::GetAttr[name="linear"](%m2), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2
          %weight.5 : Tensor = prim::GetAttr[name="weight"](%linear), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2
          %bias.5 : Tensor = prim::GetAttr[name="bias"](%linear), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2
          %69 : Tensor = aten::linear(%x.1, %weight.5, %bias.5), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2 # /data/users/jiashenc/pytorch/torch/nn/modules/linear.py:116:15
          %weight.12 : Tensor = prim::GetAttr[name="weight"](%linear.3), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2
          %bias.12 : Tensor = prim::GetAttr[name="bias"](%linear.3), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2
          %72 : Tensor = aten::linear(%69, %weight.12, %bias.12), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2 # /data/users/jiashenc/pytorch/torch/nn/modules/linear.py:116:15
          -> (%72)
      %weight : Tensor = prim::GetAttr[name="weight"](%linear.8), scope: export.test_converter.SuperNestedM1::/torch.nn.modules.linear.Linear::linear
      %bias : Tensor = prim::GetAttr[name="bias"](%linear.8), scope: export.test_converter.SuperNestedM1::/torch.nn.modules.linear.Linear::linear
      %75 : Tensor = aten::linear(%54, %weight, %bias), scope: export.test_converter.SuperNestedM1::/torch.nn.modules.linear.Linear::linear # /data/users/jiashenc/pytorch/torch/nn/modules/linear.py:116:15
      -> (%75)
  return (%21)
```
##### ExportedProgram
```
ExportedProgram:
    class GraphModule(torch.nn.Module):
        def forward(self, p_linear_weight: "f32[3, 3]", p_linear_bias: "f32[3]", p_m1_linear_weight: "f32[3, 3]", p_m1_linear_bias: "f32[3]", p_m1_m1_linear_weight: "f32[3, 3]", p_m1_m1_linear_bias: "f32[3]", p_m1_m2_linear_weight: "f32[3, 3]", p_m1_m2_linear_bias: "f32[3]", p_m2_linear_weight: "f32[3, 3]", p_m2_linear_bias: "f32[3]", p_m2_m1_linear_weight: "f32[3, 3]", p_m2_m1_linear_bias: "f32[3]", p_m2_m2_linear_weight: "f32[3, 3]", p_m2_m2_linear_bias: "f32[3]", x_1: "f32[3]"):
            # No stacktrace found for following nodes
            max_1: "f32[]" = torch.ops.aten.max.default(x_1)
            gt: "b8[]" = torch.ops.aten.gt.Scalar(max_1, 1);  max_1 = None

            # File: <eval_with_key>.137:23 in forward, code: cond = torch.ops.higher_order.cond(l_args_0_, cond_true_2, cond_false_2, [l_args_3_0_, l_args_3_13_, l_args_3_5_, l_args_3_12_, l_args_3_14_, l_args_3_1_, l_args_3_3_, l_args_3_4_, l_args_3_7_, l_args_3_10_, l_args_3_11_, l_args_3_2_, l_args_3_6_, l_args_3_8_, l_args_3_9_]);  l_args_0_ = cond_true_2 = cond_false_2 = l_args_3_0_ = l_args_3_13_ = l_args_3_5_ = l_args_3_12_ = l_args_3_14_ = l_args_3_1_ = l_args_3_3_ = l_args_3_4_ = l_args_3_7_ = l_args_3_10_ = l_args_3_11_ = l_args_3_2_ = l_args_3_6_ = l_args_3_8_ = l_args_3_9_ = None
            true_graph_0 = self.true_graph_0
            false_graph_0 = self.false_graph_0
            conditional = torch.ops.higher_order.cond(gt, true_graph_0, false_graph_0, [p_linear_weight, p_linear_bias, x_1, p_m1_linear_weight, p_m1_m1_linear_bias, p_m1_linear_bias, p_m1_m2_linear_weight, p_m1_m2_linear_bias, p_m1_m1_linear_weight, p_m2_m2_linear_bias, p_m2_m1_linear_weight, p_m2_linear_weight, p_m2_m1_linear_bias, p_m2_m2_linear_weight, p_m2_linear_bias]);  gt = true_graph_0 = false_graph_0 = p_linear_weight = p_linear_bias = x_1 = p_m1_linear_weight = p_m1_m1_linear_bias = p_m1_linear_bias = p_m1_m2_linear_weight = p_m1_m2_linear_bias = p_m1_m1_linear_weight = p_m2_m2_linear_bias = p_m2_m1_linear_weight = p_m2_linear_weight = p_m2_m1_linear_bias = p_m2_m2_linear_weight = p_m2_linear_bias = None
            getitem: "f32[3]" = conditional[0];  conditional = None
            return (getitem,)

        class <lambda>(torch.nn.Module):
            def forward(self, p_linear_weight: "f32[3, 3]", p_linear_bias: "f32[3]", x_1: "f32[3]", p_m1_linear_weight: "f32[3, 3]", p_m1_m1_linear_bias: "f32[3]", p_m1_linear_bias: "f32[3]", p_m1_m2_linear_weight: "f32[3, 3]", p_m1_m2_linear_bias: "f32[3]", p_m1_m1_linear_weight: "f32[3, 3]", p_m2_m2_linear_bias: "f32[3]", p_m2_m1_linear_weight: "f32[3, 3]", p_m2_linear_weight: "f32[3, 3]", p_m2_m1_linear_bias: "f32[3]", p_m2_m2_linear_weight: "f32[3, 3]", p_m2_linear_bias: "f32[3]"):
                # File: <eval_with_key>.134:8 in forward, code: sum_default = torch.ops.aten.sum.default(l_args_3_5__1, dtype = None)
                sum_1: "f32[]" = torch.ops.aten.sum.default(x_1)

                # File: <eval_with_key>.134:9 in forward, code: gt_scalar = torch.ops.aten.gt.Scalar(sum_default, 1);  sum_default = None
                gt: "b8[]" = torch.ops.aten.gt.Scalar(sum_1, 1);  sum_1 = None

                # File: <eval_with_key>.134:12 in forward, code: cond = torch.ops.higher_order.cond(gt_scalar, cond_true_0, cond_false_0, [l_args_3_12__true_branch, l_args_3_1__true_branch, l_args_3_5__1, l_args_3_14__true_branch, l_args_3_7__true_branch, l_args_3_3__true_branch, l_args_3_4__true_branch]);  gt_scalar = cond_true_0 = cond_false_0 = l_args_3_12__true_branch = l_args_3_1__true_branch = l_args_3_5__1 = l_args_3_14__true_branch = l_args_3_7__true_branch = l_args_3_3__true_branch = l_args_3_4__true_branch = None
                true_graph_0 = self.true_graph_0
                false_graph_0 = self.false_graph_0
                conditional = torch.ops.higher_order.cond(gt, true_graph_0, false_graph_0, [p_m1_linear_weight, p_m1_linear_bias, x_1, p_m1_m1_linear_bias, p_m1_m1_linear_weight, p_m1_m2_linear_weight, p_m1_m2_linear_bias]);  gt = true_graph_0 = false_graph_0 = p_m1_linear_weight = p_m1_linear_bias = x_1 = p_m1_m1_linear_bias = p_m1_m1_linear_weight = p_m1_m2_linear_weight = p_m1_m2_linear_bias = None
                getitem: "f32[3]" = conditional[0];  conditional = None

                # File: <eval_with_key>.134:14 in forward, code: linear_default = torch.ops.aten.linear.default(getitem, l_args_3_0__1, l_args_3_13__1);  getitem = l_args_3_0__1 = l_args_3_13__1 = None
                linear: "f32[3]" = torch.ops.aten.linear.default(getitem, p_linear_weight, p_linear_bias);  getitem = p_linear_weight = p_linear_bias = None
                return (linear,)

            class <lambda>(torch.nn.Module):
                def forward(self, p_m1_linear_weight: "f32[3, 3]", p_m1_linear_bias: "f32[3]", x_1: "f32[3]", p_m1_m1_linear_bias: "f32[3]", p_m1_m1_linear_weight: "f32[3, 3]", p_m1_m2_linear_weight: "f32[3, 3]", p_m1_m2_linear_bias: "f32[3]"):
                    # File: <eval_with_key>.130:8 in forward, code: linear_default = torch.ops.aten.linear.default(l_args_3_5__1, l_args_3_7__true_branch, l_args_3_14__true_branch);  l_args_3_5__1 = l_args_3_7__true_branch = l_args_3_14__true_branch = None
                    linear: "f32[3]" = torch.ops.aten.linear.default(x_1, p_m1_m1_linear_weight, p_m1_m1_linear_bias);  x_1 = p_m1_m1_linear_weight = p_m1_m1_linear_bias = None

                    # File: <eval_with_key>.130:9 in forward, code: linear_default_1 = torch.ops.aten.linear.default(linear_default, l_args_3_12__1, l_args_3_1__1);  linear_default = l_args_3_12__1 = l_args_3_1__1 = None
                    linear_1: "f32[3]" = torch.ops.aten.linear.default(linear, p_m1_linear_weight, p_m1_linear_bias);  linear = p_m1_linear_weight = p_m1_linear_bias = None
                    return (linear_1,)

            class <lambda>(torch.nn.Module):
                def forward(self, p_m1_linear_weight: "f32[3, 3]", p_m1_linear_bias: "f32[3]", x_1: "f32[3]", p_m1_m1_linear_bias: "f32[3]", p_m1_m1_linear_weight: "f32[3, 3]", p_m1_m2_linear_weight: "f32[3, 3]", p_m1_m2_linear_bias: "f32[3]"):
                    # File: <eval_with_key>.131:8 in forward, code: linear_default = torch.ops.aten.linear.default(l_args_3_5__1, l_args_3_3__false_branch, l_args_3_4__false_branch);  l_args_3_5__1 = l_args_3_3__false_branch = l_args_3_4__false_branch = None
                    linear: "f32[3]" = torch.ops.aten.linear.default(x_1, p_m1_m2_linear_weight, p_m1_m2_linear_bias);  x_1 = p_m1_m2_linear_weight = p_m1_m2_linear_bias = None

                    # File: <eval_with_key>.131:9 in forward, code: linear_default_1 = torch.ops.aten.linear.default(linear_default, l_args_3_12__1, l_args_3_1__1);  linear_default = l_args_3_12__1 = l_args_3_1__1 = None
                    linear_1: "f32[3]" = torch.ops.aten.linear.default(linear, p_m1_linear_weight, p_m1_linear_bias);  linear = p_m1_linear_weight = p_m1_linear_bias = None
                    return (linear_1,)

        class <lambda>(torch.nn.Module):
            def forward(self, p_linear_weight: "f32[3, 3]", p_linear_bias: "f32[3]", x_1: "f32[3]", p_m1_linear_weight: "f32[3, 3]", p_m1_m1_linear_bias: "f32[3]", p_m1_linear_bias: "f32[3]", p_m1_m2_linear_weight: "f32[3, 3]", p_m1_m2_linear_bias: "f32[3]", p_m1_m1_linear_weight: "f32[3, 3]", p_m2_m2_linear_bias: "f32[3]", p_m2_m1_linear_weight: "f32[3, 3]", p_m2_linear_weight: "f32[3, 3]", p_m2_m1_linear_bias: "f32[3]", p_m2_m2_linear_weight: "f32[3, 3]", p_m2_linear_bias: "f32[3]"):
                # File: <eval_with_key>.135:8 in forward, code: sum_default = torch.ops.aten.sum.default(l_args_3_5__1, dtype = None)
                sum_1: "f32[]" = torch.ops.aten.sum.default(x_1)

                # File: <eval_with_key>.135:9 in forward, code: gt_scalar = torch.ops.aten.gt.Scalar(sum_default, 1);  sum_default = None
                gt: "b8[]" = torch.ops.aten.gt.Scalar(sum_1, 1);  sum_1 = None

                # File: <eval_with_key>.135:12 in forward, code: cond = torch.ops.higher_order.cond(gt_scalar, cond_true_1, cond_false_1, [l_args_3_2__false_branch, l_args_3_5__1, l_args_3_9__false_branch, l_args_3_11__false_branch, l_args_3_6__false_branch, l_args_3_10__false_branch, l_args_3_8__false_branch]);  gt_scalar = cond_true_1 = cond_false_1 = l_args_3_2__false_branch = l_args_3_5__1 = l_args_3_9__false_branch = l_args_3_11__false_branch = l_args_3_6__false_branch = l_args_3_10__false_branch = l_args_3_8__false_branch = None
                true_graph_0 = self.true_graph_0
                false_graph_0 = self.false_graph_0
                conditional = torch.ops.higher_order.cond(gt, true_graph_0, false_graph_0, [p_m2_linear_weight, x_1, p_m2_linear_bias, p_m2_m1_linear_weight, p_m2_m1_linear_bias, p_m2_m2_linear_bias, p_m2_m2_linear_weight]);  gt = true_graph_0 = false_graph_0 = p_m2_linear_weight = x_1 = p_m2_linear_bias = p_m2_m1_linear_weight = p_m2_m1_linear_bias = p_m2_m2_linear_bias = p_m2_m2_linear_weight = None
                getitem: "f32[3]" = conditional[0];  conditional = None

                # File: <eval_with_key>.135:14 in forward, code: linear_default = torch.ops.aten.linear.default(getitem, l_args_3_0__1, l_args_3_13__1);  getitem = l_args_3_0__1 = l_args_3_13__1 = None
                linear: "f32[3]" = torch.ops.aten.linear.default(getitem, p_linear_weight, p_linear_bias);  getitem = p_linear_weight = p_linear_bias = None
                return (linear,)

            class <lambda>(torch.nn.Module):
                def forward(self, p_m2_linear_weight: "f32[3, 3]", x_1: "f32[3]", p_m2_linear_bias: "f32[3]", p_m2_m1_linear_weight: "f32[3, 3]", p_m2_m1_linear_bias: "f32[3]", p_m2_m2_linear_bias: "f32[3]", p_m2_m2_linear_weight: "f32[3, 3]"):
                    # File: <eval_with_key>.132:8 in forward, code: linear_default = torch.ops.aten.linear.default(l_args_3_5__1, l_args_3_11__true_branch, l_args_3_6__true_branch);  l_args_3_5__1 = l_args_3_11__true_branch = l_args_3_6__true_branch = None
                    linear: "f32[3]" = torch.ops.aten.linear.default(x_1, p_m2_m1_linear_weight, p_m2_m1_linear_bias);  x_1 = p_m2_m1_linear_weight = p_m2_m1_linear_bias = None

                    # File: <eval_with_key>.132:9 in forward, code: linear_default_1 = torch.ops.aten.linear.default(linear_default, l_args_3_2__1, l_args_3_9__1);  linear_default = l_args_3_2__1 = l_args_3_9__1 = None
                    linear_1: "f32[3]" = torch.ops.aten.linear.default(linear, p_m2_linear_weight, p_m2_linear_bias);  linear = p_m2_linear_weight = p_m2_linear_bias = None
                    return (linear_1,)

            class <lambda>(torch.nn.Module):
                def forward(self, p_m2_linear_weight: "f32[3, 3]", x_1: "f32[3]", p_m2_linear_bias: "f32[3]", p_m2_m1_linear_weight: "f32[3, 3]", p_m2_m1_linear_bias: "f32[3]", p_m2_m2_linear_bias: "f32[3]", p_m2_m2_linear_weight: "f32[3, 3]"):
                    # File: <eval_with_key>.133:8 in forward, code: linear_default = torch.ops.aten.linear.default(l_args_3_5__1, l_args_3_8__false_branch, l_args_3_10__false_branch);  l_args_3_5__1 = l_args_3_8__false_branch = l_args_3_10__false_branch = None
                    linear: "f32[3]" = torch.ops.aten.linear.default(x_1, p_m2_m2_linear_weight, p_m2_m2_linear_bias);  x_1 = p_m2_m2_linear_weight = p_m2_m2_linear_bias = None

                    # File: <eval_with_key>.133:9 in forward, code: linear_default_1 = torch.ops.aten.linear.default(linear_default, l_args_3_2__1, l_args_3_9__1);  linear_default = l_args_3_2__1 = l_args_3_9__1 = None
                    linear_1: "f32[3]" = torch.ops.aten.linear.default(linear, p_m2_linear_weight, p_m2_linear_bias);  linear = p_m2_linear_weight = p_m2_linear_bias = None
                    return (linear_1,)

Graph signature: ExportGraphSignature(input_specs=[InputSpec(kind=<InputKind.PARAMETER: 2>, arg=TensorArgument(name='p_linear_weight'), target='linear.weight', persistent=None), InputSpec(kind=<InputKind.PARAMETER: 2>, arg=TensorArgument(name='p_linear_bias'), target='linear.bias', persistent=None), InputSpec(kind=<InputKind.PARAMETER: 2>, arg=TensorArgument(name='p_m1_linear_weight'), target='m1.linear.weight', persistent=None), InputSpec(kind=<InputKind.PARAMETER: 2>, arg=TensorArgument(name='p_m1_linear_bias'), target='m1.linear.bias', persistent=None), InputSpec(kind=<InputKind.PARAMETER: 2>, arg=TensorArgument(name='p_m1_m1_linear_weight'), target='m1.m1.linear.weight', persistent=None), InputSpec(kind=<InputKind.PARAMETER: 2>, arg=TensorArgument(name='p_m1_m1_linear_bias'), target='m1.m1.linear.bias', persistent=None), InputSpec(kind=<InputKind.PARAMETER: 2>, arg=TensorArgument(name='p_m1_m2_linear_weight'), target='m1.m2.linear.weight', persistent=None), InputSpec(kind=<InputKind.PARAMETER: 2>, arg=TensorArgument(name='p_m1_m2_linear_bias'), target='m1.m2.linear.bias', persistent=None), InputSpec(kind=<InputKind.PARAMETER: 2>, arg=TensorArgument(name='p_m2_linear_weight'), target='m2.linear.weight', persistent=None), InputSpec(kind=<InputKind.PARAMETER: 2>, arg=TensorArgument(name='p_m2_linear_bias'), target='m2.linear.bias', persistent=None), InputSpec(kind=<InputKind.PARAMETER: 2>, arg=TensorArgument(name='p_m2_m1_linear_weight'), target='m2.m1.linear.weight', persistent=None), InputSpec(kind=<InputKind.PARAMETER: 2>, arg=TensorArgument(name='p_m2_m1_linear_bias'), target='m2.m1.linear.bias', persistent=None), InputSpec(kind=<InputKind.PARAMETER: 2>, arg=TensorArgument(name='p_m2_m2_linear_weight'), target='m2.m2.linear.weight', persistent=None), InputSpec(kind=<InputKind.PARAMETER: 2>, arg=TensorArgument(name='p_m2_m2_linear_bias'), target='m2.m2.linear.bias', persistent=None), InputSpec(kind=<InputKind.USER_INPUT: 1>, arg=TensorArgument(name='x_1'), target=None, persistent=None)], output_specs=[OutputSpec(kind=<OutputKind.USER_OUTPUT: 1>, arg=TensorArgument(name='getitem'), target=None)])
Range constraints: {}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127975
Approved by: https://github.com/angelayi, https://github.com/ydwu4
2024-06-10 23:24:16 +00:00
b2d602306a [RELAND][dynamo][nn-modules] Trace through nn.Module dunder methods for UnspecializedNNModule (#126578)
Tracing through `__init__`  is important because it initializes (calls STORE_ATTR) on members. By doing that, we kick in the mutation tracking for these objects. So, things like mutating `_modules` etc is tracked automatically.

Fixes https://github.com/pytorch/pytorch/issues/111837

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126578
Approved by: https://github.com/jansel
ghstack dependencies: #128295
2024-06-10 23:11:04 +00:00
05711eece9 [dynamo][inlining inbuilt modules] Ensure BC for nn_module_stack (#128295)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128295
Approved by: https://github.com/ydwu4
2024-06-10 23:11:04 +00:00
a287ff75d0 Use init_torchbind_implementations in inductor torchbind tests. (#128341)
Summary: To unify how we load the torch bind libraries for testing.

Test Plan: Existing tests.

Differential Revision: D58372372

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128341
Approved by: https://github.com/angelayi
2024-06-10 23:02:48 +00:00
4bbadeee8a Revert "Set simdlen based on ATEN_CPU_CAPABILITY (#123514)"
This reverts commit b66e3f0957b96b058c9b632ca60833d9717a9d8a.

Reverted https://github.com/pytorch/pytorch/pull/123514 on behalf of https://github.com/clee2000 due to broke test/inductor/test_torchinductor.py::CpuTests::test_new_cpp_build_logical_cpu on periodic test on the no gpu tests b66e3f0957 https://github.com/pytorch/pytorch/actions/runs/9453518547/job/26040077301 ([comment](https://github.com/pytorch/pytorch/pull/123514#issuecomment-2159433432))
2024-06-10 22:46:01 +00:00
2176ef7dfa [compiled autograd] support .backward(inputs=) (#128252)
autograd already marks nodes as needed or not before calling calling compiled autograd. so our worklist already skips nodes not specified in the `inputs` kwarg.

For the .backward(inputs=) case, I'm keeping the grads as outputs, just like for .grad(inputs=), this is to still guard on graph_output when we collect the nodes. This does not get DCE'd rn, and is ignored in the post graph bytecode.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128252
Approved by: https://github.com/jansel
2024-06-10 22:20:51 +00:00
583a56d5a8 DOC: add docstring to construct_and_record_rdzv_event() (#128189)
Fixes #127902

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128189
Approved by: https://github.com/kurman
2024-06-10 22:17:33 +00:00
c38b3381a1 Make nn.Module state_dict load_state_dict pre-hook and state_dict post hook public (#126704)
Fixes https://github.com/pytorch/pytorch/issues/75287 and https://github.com/pytorch/pytorch/issues/117437

- `nn.Module._register_state_dict_hook` --> add public `nn.Module.register_state_dict_post_hook`
   - Add a test as this API was previously untested
- `nn.Module._register_load_state_dict_pre_hook` --> add public `nn.Module.register_load_state_dict_pre_hook` (remove the `with_module` flag, default it to `True`
    ~- For consistency with optimizer `load_state_dict_pre_hook` raised by @janeyx99, allow the pre-hook to return a new `state_dict`~
 - Document issue pointed out by https://github.com/pytorch/pytorch/issues/117437 regarding `_register_state_dict_hook` semantic of returning a new state_dict only being respected for the root for private hook
       - Remove this for the public `register_state_dict_post_hook`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126704
Approved by: https://github.com/albanD
ghstack dependencies: #126906
2024-06-10 21:50:17 +00:00
a2d4fea872 [easy] Move state_dict hooks tests to test_module_hooks and decorate tests that call load_state_dict with swap (#126906)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126906
Approved by: https://github.com/albanD
2024-06-10 21:50:17 +00:00
58083ffb10 Improve unbacked reasoning involving has internal overlap (#128332)
Fixes https://github.com/pytorch/pytorch/issues/122477
Partially addresses https://github.com/pytorch/pytorch/issues/116336

This PR is slightly overkill: not only does it disable the overlap test
when there are unbacked SymInts, it also improves the is non-overlapping
and dense test for some more unbacked situations.  We technically don't
need the latter change, but I was already deep in the sauce and just
went ahead and did it.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128332
Approved by: https://github.com/lezcano
2024-06-10 21:49:38 +00:00
6630dcd53c Add docstring for the torch.serialization.default_restore_location function (#128132)
Fixes: #127887

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128132
Approved by: https://github.com/mikaylagawarecki
2024-06-10 21:33:56 +00:00
3a2d0755a4 enable test_ParameterList with dynamo if nn module inlining enabled only (#128308)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128308
Approved by: https://github.com/anijain2305
2024-06-10 21:25:40 +00:00
b459713ca7 [aota] compiled forward outputs requires_grad alignment with eager (#128016)
Original issue: https://github.com/pytorch/pytorch/issues/114338

We assume only two possible mutually exclusive scenarios:

1. Running compiled region for training (Any of inputs has requires_grad)
	- Produced differentiable outputs should have requires_grad.

2. Running compiled region for inference (None of inputs has requires_grad)
	- All outputs do not have requires_grad.

Even if user runs the region under no_grad(), but has an input Tensor with requires_grad - we go Training scenario (1).

With current state that means:
1/ needs_autograd should not check torch.is_grad_enabled(), only that any of inputs requires_grad
2/ if needs_autograd => trace_joint (We are in training scenario 1.) => always run compiled region under with.enable_grad()

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128016
Approved by: https://github.com/bdhirsh
2024-06-10 20:51:22 +00:00
4460e481bc Disable jacrev/jacfwd/hessian if compiling with dynamo (#128255)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128255
Approved by: https://github.com/zou3519
2024-06-10 20:47:53 +00:00
90bb510ece Revert "Deprecate torch._utils.is_compiling() and torch._dynamo.external_utils.is_compiling() (#127690)"
This reverts commit 348b181a97abc2e636a6c18e5880a78e5d1dab94.

Reverted https://github.com/pytorch/pytorch/pull/127690 on behalf of https://github.com/clee2000 due to sorry I think https://github.com/pytorch/pytorch/pull/126898#issuecomment-2142884456 is still relevant, I will reach out to them to see what needs to be done in internal to get this remerged ([comment](https://github.com/pytorch/pytorch/pull/127690#issuecomment-2159248859))
2024-06-10 20:44:42 +00:00
38e0a0440c [AMD] Default to hipblaslt in gemm (#127944)
Summary: It has been a constant pain that we have to specify env var to go with the hipblaslt path. The default path is very slow on MI300. Therefore, let's default to hipblaslt.

Differential Revision: D58150764

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127944
Approved by: https://github.com/aaronenyeshi, https://github.com/houseroad
2024-06-10 19:55:21 +00:00
946f554c8f Flip default value for mypy disallow_untyped_defs [10+1/11] (#128293)
See #127836 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128293
Approved by: https://github.com/oulgen
2024-06-10 19:32:44 +00:00
55646554b7 [EZ] Fix typos in SECURITY.md (#128340)
permisisons -> permissions
lates -> latest

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128340
Approved by: https://github.com/clee2000, https://github.com/atalman, https://github.com/kit1980
2024-06-10 19:21:39 +00:00
9cab5987bd Introduce int_oo (#127693)
In a previous life, we used sympy.oo to represent the lower/upper bounds of integer ranges. Later, we changed this to be sys.maxsize - 1 for a few reasons: (1) sometimes we do tests on a value being exactly sys.maxsize, and we wanted to avoid a data dependent guard in this case, (2) sympy.oo corresponds to floating point infinity, so you get incorrect types for value ranges with oo, and (3) you can do slightly better reasoning if you assume that input sizes fall within representable 64-bit integer range.

After working in the sys.maxsize regime for a bit, I've concluded that this was actually a bad idea. Specifically, the problem is that you end up with sys.maxsize in your upper bound, and then whenever you do any sort of size-increasing computation like size * 2, you end up with 2 * sys.maxsize, and you end up doing a ton of arbitrary precision int computation that is totally unnecessary. A symbolic bound is better.

But especially after #126905, we can't go back to using sympy.oo, because that advertises that it's not an integer, and now your ValueRanges is typed incorrectly. So what do we do? We define a new numeric constant `int_oo`, which is like `sympy.oo` but it advertises `is_integer`. **test/test_sympy_utils.py** describes some basic properties of the number, and **torch/utils/_sympy/numbers.py** has the actual implementation.

The rest of the changes of the PR are working out the implications of this change. I'll give more commentary as inline comments.

Fixes https://github.com/pytorch/pytorch/issues/127396

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127693
Approved by: https://github.com/lezcano
ghstack dependencies: #126905
2024-06-10 19:09:53 +00:00
db2fa7b827 Revert "[export] FIx unflattener for preserving modules containing unused inputs (#128260)"
This reverts commit 093a4ff5f859ccbbd8ba62dd189f76e5faadfb04.

Reverted https://github.com/pytorch/pytorch/pull/128260 on behalf of https://github.com/angelayi due to breaking windows test ([comment](https://github.com/pytorch/pytorch/pull/128260#issuecomment-2159050726))
2024-06-10 18:42:33 +00:00
093a4ff5f8 [export] FIx unflattener for preserving modules containing unused inputs (#128260)
Currently unflattener fails if the module its preserving the module signature for contains unused inputs/outputs.

This also fixes unflattener issues in D57829276.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128260
Approved by: https://github.com/pianpwk
2024-06-10 18:39:33 +00:00
fa8ec8e718 [dynamo] handle hashable exceptions in trace_rules lookup (#128078)
Summary: Found during user empathy day when attempting to hash a fractions.Fraction object before it was fully constructed. See https://github.com/pytorch/pytorch/issues/128075

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128078
Approved by: https://github.com/anijain2305
2024-06-10 18:23:22 +00:00
136bdb96cb Update Kineto submodule with fix to test_basic_chrome_trace (#128333)
Summary: We've updated the sort_index in Kineto chrome traces to support device ids up to 16 devices. This should make chrome trace rows be ordered in the same way as CUDA. We need to update the unit test as well.

Test Plan:
Ran locally the changing test:
```
$ buck2 test 'fbcode//mode/opt' fbcode//caffe2/test:test_profiler_cuda -- --exact 'caffe2/test:test_profiler_cuda - test_basic_chrome_trace (profiler.test_profiler.TestProfiler)'
File changed: fbcode//caffe2/third_party/kineto.submodule.txt
Buck UI: https://www.internalfb.com/buck2/f4fd1e9a-99f1-4422-aeed-b54903c64146
Test UI: https://www.internalfb.com/intern/testinfra/testrun/16888498639845776
Network: Up: 5.4KiB  Down: 8.6KiB  (reSessionID-0329120e-7fa2-4bc0-b539-7e58058f8fce)
Jobs completed: 6. Time elapsed: 1:01.2s.
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0
```

Differential Revision: D58362964

Pulled By: aaronenyeshi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128333
Approved by: https://github.com/Skylion007
2024-06-10 18:12:34 +00:00
83941482f7 Add docstring for the torch.distributed.elastic.utils.distributed.get_free_port function (#128133)
Fixes: #127914

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128133
Approved by: https://github.com/H-Huang
2024-06-10 18:10:58 +00:00
08d038f8a8 [PT2] Fix a typo and lint problem (#128258)
Summary: Titled

Test Plan: see signal

Differential Revision: D58310169

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128258
Approved by: https://github.com/dshi7, https://github.com/Yuzhen11
2024-06-10 18:03:40 +00:00
46948300a2 [c10d] integrate PMI NCCL initialization to NCCL-PG (#128243)
Summary: Move broadcastUniqueID check to NCCLUtils

Differential Revision: D58273755

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128243
Approved by: https://github.com/wconstab
2024-06-10 17:20:03 +00:00
ab3a0b192a [RFC] add per-collective timeout value in flight recorder (#128190)
Summary:
Add timeout value field on every collected record.

Test Plan:
Unit tests

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128190
Approved by: https://github.com/wconstab
2024-06-10 17:12:57 +00:00
8e482e909b Add some guard to size oblivious has_internal_overlap (#128328)
This doesn't actually help on
https://github.com/pytorch/pytorch/issues/122477 but I noticed this
modest improvement so sure, why not.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128328
Approved by: https://github.com/Skylion007
2024-06-10 17:11:26 +00:00
7b9c5e0e3f Turn on GraphTransformObserver for inductor (#127962)
The FX graphs for some PT2 models are very complicated, Inductor usually goes through many passes of graph optimization to generate the final FX graph. It’s very difficult to see the change in each pass, and check if the optimized graph is correct and optimal.

GraphTransformObserver is an observer listening to all add/erase node events on GraphModule during a graph transform pass, and save the changed nodes. When the pass is done and if there is any change in the graph, GraphTransformObserver will save the SVG files of the input graph and the output graph for that pass.

This PR is to enable GraphTransformObserver for inductor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127962
Approved by: https://github.com/jansel
2024-06-10 16:49:02 +00:00
ca561d639b Revert "Fix 'get_attr' call in dynamo 'run_node' (#127696)"
This reverts commit b741819b0580204e6a6b60c62ce44dacaf7787c8.

Reverted https://github.com/pytorch/pytorch/pull/127696 on behalf of https://github.com/clee2000 due to broke (executorch?) internal tests D58295865 ([comment](https://github.com/pytorch/pytorch/pull/127696#issuecomment-2158820093))
2024-06-10 16:29:20 +00:00
d22287d1ad Revert "Fix 'get_real_value' on placeholder nodes (#127698)"
This reverts commit 19b31d899a78a6806314bcc73b88172dabf0c26e.

Reverted https://github.com/pytorch/pytorch/pull/127698 on behalf of https://github.com/clee2000 due to broke (executorch?) internal tests D58295865 ([comment](https://github.com/pytorch/pytorch/pull/127696#issuecomment-2158820093))
2024-06-10 16:29:20 +00:00
3b73f5de3a Revert "Add OpInfo entry for alias_copy (#127232) (#128142)"
This reverts commit 04da6aeb61f4d57bf73ed1054dd897abbcceca83.

Reverted https://github.com/pytorch/pytorch/pull/128142 on behalf of https://github.com/DanilBaibak due to The changes broke the test_output_match_alias_copy_cpu_complex64 test. ([comment](https://github.com/pytorch/pytorch/pull/128142#issuecomment-2158793878))
2024-06-10 16:17:16 +00:00
c993f1b37f Fix edge cases for gather in inductor (#126893)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126893
Approved by: https://github.com/peterbell10
ghstack dependencies: #126876
2024-06-10 15:31:03 +00:00
04da6aeb61 Add OpInfo entry for alias_copy (#127232) (#128142)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128142
Approved by: https://github.com/lezcano
2024-06-10 15:01:53 +00:00
b66e3f0957 Set simdlen based on ATEN_CPU_CAPABILITY (#123514)
It is part of https://github.com/pytorch/pytorch/issues/123224. Set simdlen based on the environment ATEN_CPU_CAPABILITY to control CPU vec ISA like eager.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123514
Approved by: https://github.com/jgong5, https://github.com/peterbell10
2024-06-10 09:02:14 +00:00
df43d5843e fix miss isa bool check (#128274)
New cpp builder missed ISA bool(dry-compile) check.
<img width="941" alt="image" src="https://github.com/pytorch/pytorch/assets/8433590/695ce911-7f6d-401d-b96b-2b9bda751b15">
@jgong5 Found this missing and then I submit this PR to fix it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128274
Approved by: https://github.com/jgong5, https://github.com/ezyang
2024-06-10 02:45:46 +00:00
cyy
26f6a87ae9 [5/N] Remove unused functions (#127185)
Follows #128193

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127185
Approved by: https://github.com/ezyang
2024-06-10 01:57:49 +00:00
d3817d8a60 Don't create python tuple when _maybe_handle_torch_function is called from C++ (#128187)
Marginal overhead reduction when calling through the `torch.ops` API.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128187
Approved by: https://github.com/lezcano
ghstack dependencies: #128183, #128184, #128185
2024-06-10 00:16:59 +00:00
cd2ad29afe [inductor] Reduce binding overhead of _reinterpret_tensor (#128185)
Going through the dispatcher + pybind11 + torch.ops adds about 2 us overhead
per call compared to `PyArgParser`.

Note that views of inputs are reconstructed by AOTAutograd before being returned
to the python code, so dispatching for autograd's sake shouldn't be required
here.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128185
Approved by: https://github.com/lezcano
ghstack dependencies: #128183, #128184
2024-06-09 23:33:03 +00:00
253fa9c711 [AOTAutograd] Remove runtime import from view replay function (#128184)
`gen_alias_from_base` spends about ~0.5 us in this import statement,
which is called for each view in the graph output.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128184
Approved by: https://github.com/lezcano
ghstack dependencies: #128183
2024-06-09 23:33:03 +00:00
55b2a0a002 [AOTAutograd] Use _set_grad_enabled instead of no_grad (#128183)
This saves ~1us of overhead from each inductor graph call.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128183
Approved by: https://github.com/lezcano
2024-06-09 23:33:03 +00:00
5e7377e044 [Dynamo][TVM] Make the opt_level parameter adjustable (#127876)
Fixes #127874

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127876
Approved by: https://github.com/jansel
2024-06-09 21:38:00 +00:00
c7e2c9c37e [c10d][doc] add a doc page for NCCL ENVs (#128235)
Addressing issue: https://github.com/pytorch/pytorch/issues/128204

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128235
Approved by: https://github.com/wconstab
2024-06-09 16:08:38 +00:00
0bf2fe522a [RFC] Provide optional switches to _dump_nccl_trace (#127651)
Summary:
Data from PyTorch distributed is mostly useful during initial stages of model development.
Provide options to reduce data sent/dumped.
`_dump_nccl_trace` takes 3 optional switches. Default as before returns everything
- `includeCollectives`: option to also include collectives: Default is True.
- `includeStacktraces`: option to include stack traces in collectives. Default is True.
- `onlyActive`: option to only send active collective work - i.e. not completed. Default is
    False (i.e. send everything)

Test Plan:
Unit tests

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127651
Approved by: https://github.com/wconstab
2024-06-09 14:00:57 +00:00
75b0720a97 Revert "Use hidden visibility in OBJECTCXX files (#127265)"
This reverts commit 669560d51aa1e81ebd09e2aa8288d0d314407d82.

Reverted https://github.com/pytorch/pytorch/pull/127265 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but I suspect that it causes this failure https://github.com/pytorch/vision/issues/8478 on vision where its C++ extension could not be loaded on macOS ([comment](https://github.com/pytorch/pytorch/pull/127265#issuecomment-2156401838))
2024-06-09 09:05:17 +00:00
eqy
4c971932e8 [cuDNN][SDPA] Remove TORCH_CUDNN_SDPA_ENABLED=1, enable cuDNN SDPA by default on H100 and 2nd on other archs >= sm80 (#125343)
Looks like one of the first failures seen is `test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda` when `test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda` passes.

What seems interesting here is that the `torch.compile` version fails while the eager version passes. Not sure what the difference would be here...

Nevertheless, is there a recommended mechanism to skip cuDNN SDPA as a backend for this test? CC @drisspg
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125343
Approved by: https://github.com/Skylion007
2024-06-09 06:53:34 +00:00
3964a3ec73 Complete revamp of float/promotion sympy handling (#126905)
At a high level, the idea behind this PR is:

* Make it clearer what the promotion and int/float rules for various Sympy operations are. Operators that previously were polymorphic over int/float are now split into separate operators for clarity. We never do mixed int/float addition/multiplication etc in sympy, instead, we always promote to the appropriate operator. (However, equality is currently not done correctly.)
* Enforce strict typing on ValueRanges: if you have a ValueRange for a float, the lower and upper MUST be floats, and so forth for integers.

The story begins in **torch/utils/_sympy/functions.py**. Here, I make some changes to how we represent certain operations in sympy expressions:

* FloorDiv now only supports integer inputs; to do float floor division, do a truediv and then a trunc. Additionally, we remove the divide out addition by gcd optimization, because sympy gcd is over fields and is willing to generate rationals (but rationals are bad for ValueRange strict typing).
* ModularIndexing, LShift, RShift now assert they are given integer inputs.
* Mod only supports integer inputs; eventually we will support FloatMod (left for later work, when we build out Sympy support for floating operations). Unfortunately, I couldn't assert integer inputs here, because of a bad interaction with sympy's inequality solver that is used by the offline solver
* TrueDiv is split into FloatTrueDiv and IntTrueDiv. This allows for us to eventually generate accurate code for Python semantics IntTrueDiv, which is written in a special way to preserve precision when the inputs are >= 2**53 beyond what first coercing the integer to floats and then doing true division.
* Trunc is split to TruncToFloat and TruncToInt.
* Round is updated to return a float, not an int, making it consistent with the round op handler in Inductor. To get Python-style conversion to int, we call TruncToInt on the result.
* RoundDecimal updated to consistently only ever return a float
* Add ToFloat for explicit coercion to float (required so we can enforce strict ValueRanges typing)

In **torch/__init__.py**, we modify SymInt and SymFloat to appropriately call into new bindings that route to these refined sympy operations.  Also, we modify `torch.sym_min` and `torch.sym_max` to have promotion semantics (if one argument is a float, the return result is always a float), making them inconsistent with builtins.min/max, but possible to do type analysis without runtime information.

We also need to introduce some new op handlers in **torch/_inductor/ops_handler.py**:

* `to_int` for truncation to int64, directly corresponding to TruncToInt; this can be implemented by trunc and dtype, but with a dedicated handler it is more convenient for roundtripping in Sympy
* `int_truediv` for Python-style integer true division, which has higher precision than casting to floats and then running `truediv`

These changes have consequences. First, we need to make some administrative changes:

* Actually wire up these Sympy functions from SymInt/SymFloat in **torch/fx/experimental/sym_node.py**, including the new promotion rules (promote2)
* Add support for new Sympy functions in **torch/utils/_sympy/interp.py**, **torch/utils/_sympy/reference.py**
  * In particular, in torch.utils._sympy.reference, we have a strong preference to NOT do nontrivial compute, instead, everything in ops handler should map to a singular sympy function
  * TODO: I chose to roundtrip mod back to our Mod function, but I think I'm going to have to deal with the C/Python inconsistency this to fix tests here
* Add printer support for the Sympy functions in **torch/_inductor/codegen/common.py**, **torch/_inductor/codegen/cpp_utils.py**, **torch/_inductor/codegen/triton.py**. `int_truediv` and mixed precision equality is currently not implemented soundly, so we will lose precision in codegen for large values. TODO: The additions here are not exhaustive yet
* Update ValueRanges logic to use new sympy functions in **torch/utils/_sympy/value_ranges.py**. In general, we prefer to use the new Sympy function rather than try to roll things by hand, which is what was done previously for many VR analysis functions.

In **torch/fx/experimental/symbolic_shapes.py** we need to make some symbolic reasoning adjustments:

* Avoid generation of rational subexpressions by removing simplification of `x // y` into `floor(x / y)`. This simplification then triggers an addition simplification rule `(x + y) / c --> x / c + y / c` which is bad because x / c is a rational number now
* `_assert_bound_is_rational` is no more, we no longer generate rational bounds
* Don't intersect non-int value ranges with the `int_range`
* Support more sympy Functions for guard SYMPY_INTERP
* Assert the type of value range is consistent with the variable type

The new asserts uncovered necessary bug fixes:

* **torch/_inductor/codegen/cpp.py**, **torch/_inductor/select_algorithm.py**, **torch/_inductor/sizevars.py** - Ensure Wild/Symbol manually allocated in Inductor is marked `is_integer` so it's accepted to build expressions
* **torch/_inductor/utils.py** - make sure you actually pass in sympy.Expr to these functions
* **torch/_inductor/ir.py** - make_contiguous_strides_for takes int/SymInt, not sympy.Expr!
* **torch/export/dynamic_shapes.py** - don't use infinity to represent int ranges, instead use sys.maxsize - 1

Because of the removal of some symbolic reasoning that produced rationals, some of our symbolic reasoning has gotten worse and we are unable to simplify some guards. Check the TODO at **test/test_proxy_tensor.py**

**Reland notes.** This requires this internal fbcode diff https://www.internalfb.com/phabricator/paste/view/P1403322587 but I cannot prepare the diff codev due to https://fb.workplace.com/groups/osssupport/posts/26343544518600814/

It also requires this Executorch PR https://github.com/pytorch/executorch/pull/3911 but the ET PR can be landed prior to this landing.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126905
Approved by: https://github.com/xadupre, https://github.com/lezcano
2024-06-09 06:20:25 +00:00
31c3fa6cf5 [audio hash update] update the pinned audio hash (#128178)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128178
Approved by: https://github.com/pytorchbot
2024-06-09 04:29:04 +00:00
cyy
7bfd1db53a [4/N] Change static functions in headers to inline (#128286)
Follows #128194.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128286
Approved by: https://github.com/Skylion007, https://github.com/XuehaiPan
2024-06-09 03:08:53 +00:00
f681e3689b [dtensor][experiment] experimenting with displaying distributed model parameters and printing sharding info (#127987)
**Summary**
Example code to display distributed model parameters and verify them against ground truth. Also prints sharding information.

**Test Plan**
torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/display_sharding_example.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127987
Approved by: https://github.com/XilunWu
ghstack dependencies: #127358, #127360, #127630
2024-06-09 00:14:07 +00:00
2c2cf1d779 [dtensor][experiment] experimenting with displaying model parameters (#127630)
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

**Summary**
Example code to display model parameters and verify them against ground truth. Also expanded on moduletracker to accomplish this.

**Test Plan**
python3 torch/distributed/_tensor/examples/display_sharding_example.py

* #127987
* __->__ #127630
* #127360
* #127358

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127630
Approved by: https://github.com/XilunWu
ghstack dependencies: #127358, #127360
2024-06-09 00:14:07 +00:00
d34075e0bd Add Efficient Attention support on ROCM (#124885)
This patch implements `with sdpa_kernel(SDPBackend.EFFICIENT_ATTENTION):` by reusing AOTriton's accelerated SDPA implementation

Known limitations:
- Only supports MI200/MI300X GPUs
- Does not support varlen
- Does not support `CausalVariant`
- Optional arguments `causal_diagonal` and `seqlen_k` in `_efficient_attention_forward/backward` must be null
- Does not work well with inductor's SDPA rewriter. The rewriter has been updated to only use math and flash attention on ROCM.

This PR also uses a different approach of installing AOTriton binary instead of building it from source in the base docker image. More details on motivation: https://github.com/pytorch/pytorch/pull/124885#issuecomment-2153229129

`PYTORCH_TEST_WITH_ROCM=1 PYTORCH_TESTING_DEVICE_ONLY_FOR="cuda" python test/test_transformers.py` yields "55028 passed, 20784 skipped" results with this change.  [Previous result](https://hud.pytorch.org/pr/127528) of `test_transformers.py` was 0 error, 0 failure, 55229 skipped out of 75517 tests in total (the XML report does not contain total number of passed tests).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124885
Approved by: https://github.com/malfet
2024-06-08 22:41:05 +00:00
6e7a23475d [easy] Run autograd if any mutations on inputs that require grad (#128229)
If any inputs are mutated that require grad, even if all the outputs don't require grad, we should still run autograd with a backwards graph. This fixes two tests: test_input_mutation_alias_everything and test_view_detach.

Fixes #128035
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128229
Approved by: https://github.com/aorenste
2024-06-08 21:18:38 +00:00
aee154edbe [Traceable FSDP2] Make FSDPParam._unsharded_param creation traceable (#127245)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127245
Approved by: https://github.com/awgu
2024-06-08 21:10:15 +00:00
0dd55ee159 Fix bug in _update_process_group API (#128262)
`local_used_map_` was undefined in case of `find_unused_parameters=False`, this resulted in an error when we ran `local_used_map_.fill_(0);`

Added a unit test as well

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128262
Approved by: https://github.com/awgu
2024-06-08 19:52:24 +00:00
3494f3f991 [dynamo] Skip inlining builtin nn modules for torch.compile inside cond (#128247)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128247
Approved by: https://github.com/ydwu4
ghstack dependencies: #128246
2024-06-08 19:20:00 +00:00
33972dfd58 [easy][inline-inbuilt-nn-modules] Fix expected graph for control flow test (#128246)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128246
Approved by: https://github.com/ydwu4
2024-06-08 19:20:00 +00:00
57536286e2 Flip default value for mypy disallow_untyped_defs [10/11] (#127847)
See #127836 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127847
Approved by: https://github.com/oulgen
ghstack dependencies: #127842, #127843, #127844, #127845, #127846
2024-06-08 18:50:06 +00:00
8db9dfa2d7 Flip default value for mypy disallow_untyped_defs [9/11] (#127846)
See #127836 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127846
Approved by: https://github.com/ezyang
ghstack dependencies: #127842, #127843, #127844, #127845
2024-06-08 18:50:06 +00:00
27f9d3b0a1 Flip default value for mypy disallow_untyped_defs [8/11] (#127845)
See #127836 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127845
Approved by: https://github.com/oulgen
ghstack dependencies: #127842, #127843, #127844
2024-06-08 18:49:56 +00:00
038b927590 Flip default value for mypy disallow_untyped_defs [7/11] (#127844)
See #127836 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127844
Approved by: https://github.com/oulgen
ghstack dependencies: #127842, #127843
2024-06-08 18:49:45 +00:00
7c12cc7ce4 Flip default value for mypy disallow_untyped_defs [6/11] (#127843)
See #127836 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127843
Approved by: https://github.com/oulgen
ghstack dependencies: #127842
2024-06-08 18:49:29 +00:00
3a0d088517 Flip default value for mypy disallow_untyped_defs [5/11] (#127842)
See #127836 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127842
Approved by: https://github.com/oulgen
2024-06-08 18:49:18 +00:00
62bcdc0ac9 Flip default value for mypy disallow_untyped_defs [4/11] (#127841)
See #127836 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127841
Approved by: https://github.com/oulgen
2024-06-08 18:36:48 +00:00
afe15d2d2f Flip default value for mypy disallow_untyped_defs [3/11] (#127840)
See #127836 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127840
Approved by: https://github.com/oulgen
2024-06-08 18:28:01 +00:00
ea614fb2b1 Flip default value for mypy disallow_untyped_defs [2/11] (#127839)
See #127836 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127839
Approved by: https://github.com/oulgen
2024-06-08 18:23:08 +00:00
dcfa7702c3 Flip default value for mypy disallow_untyped_defs [1/11] (#127838)
See #127836 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127838
Approved by: https://github.com/oulgen
2024-06-08 18:16:33 +00:00
2369c719d4 [DSD][BE] Cleanup unused variables and rename variables to avoid exposure to the users (#128249)
These APIs and variables should not be exposed to users as they are designed to be used internally.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128249
Approved by: https://github.com/wz337
2024-06-08 17:12:17 +00:00
02a901f1e9 Revert "[RFC] Provide optional switches to _dump_nccl_trace (#127651)"
This reverts commit 0a761f0627130e739f0e2748e3f71a0c347552c4.

Reverted https://github.com/pytorch/pytorch/pull/127651 on behalf of https://github.com/atalman due to Breaks internal CI ([comment](https://github.com/pytorch/pytorch/pull/127651#issuecomment-2156076838))
2024-06-08 15:30:04 +00:00
57a24c4fdb Revert "[RFC] add per-collective timeout value in flight recorder (#128190)"
This reverts commit 09cccbc1c74c9d1157c1caca5526e79ee9b7ea01.

Reverted https://github.com/pytorch/pytorch/pull/128190 on behalf of https://github.com/atalman due to Sorry need to revert this, in conflict with https://github.com/pytorch/pytorch/pull/127651 that needs reverting ([comment](https://github.com/pytorch/pytorch/pull/128190#issuecomment-2156075318))
2024-06-08 15:25:27 +00:00
348b181a97 Deprecate torch._utils.is_compiling() and torch._dynamo.external_utils.is_compiling() (#127690)
This PR is split from PR #126898.

- #126898

------

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127690
Approved by: https://github.com/Skylion007
2024-06-08 15:25:03 +00:00
917387f66d [AOTI] fix a constant tensor device move issue (#128265)
Summary: When copying a constant tensor to another device, `.to` returns a fake tensor and causes a problem when a real tensor is expected.

Test Plan: CI

Differential Revision: D58313034

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128265
Approved by: https://github.com/chenyang78
2024-06-08 13:23:49 +00:00
cyy
695502ca65 [3/N] Change static functions in headers to inline (#128194)
Follows #127764

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128194
Approved by: https://github.com/ezyang, https://github.com/Skylion007
2024-06-08 08:06:31 +00:00
73d6ec2db6 Increase verbosity of FX graph dumps (#128042)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128042
Approved by: https://github.com/aorenste
2024-06-08 07:24:58 +00:00
0e6c204642 [pipelining] Friendly error message when not traceable (#128276)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128276
Approved by: https://github.com/H-Huang
2024-06-08 06:36:11 +00:00
44371bd432 Revert "[dynamo][nn-modules] Trace through nn.Module dunder methods for UnspecializedNNModule (#126578)"
This reverts commit 7ede78f9f5d7e6c993faa1a70a5f0b0eaec5640d.

Reverted https://github.com/pytorch/pytorch/pull/126578 on behalf of https://github.com/anijain2305 due to pippy tests fail ([comment](https://github.com/pytorch/pytorch/pull/126578#issuecomment-2155836555))
2024-06-08 06:35:34 +00:00
6e13c7e874 Revert "[dynamo] Support if cond on UnspecializedNNModuleVariable and add inline tests (#128158)"
This reverts commit 747fc35ff54154ddec2a5ab5661f57c28d65c591.

Reverted https://github.com/pytorch/pytorch/pull/128158 on behalf of https://github.com/anijain2305 due to pippy tests fail ([comment](https://github.com/pytorch/pytorch/pull/128158#issuecomment-2155835787))
2024-06-08 06:32:28 +00:00
94165dba7b Revert "[dynamo] Inline the getattr of fx graph and proxy graph (#128172)"
This reverts commit 662a78f957fb89e53ebeba7deb880561e10ecaf6.

Reverted https://github.com/pytorch/pytorch/pull/128172 on behalf of https://github.com/anijain2305 due to pippy tests fail ([comment](https://github.com/pytorch/pytorch/pull/128172#issuecomment-2155835201))
2024-06-08 06:29:36 +00:00
8a0bc8c9ee [fsdp2] simplify fsdp_param logic with DTensorSpec (#128242)
as titled, we can use a single DTensorSpec to save the SPMD sharding
spec, plus the global shape/stride to simplify the FSDPParam logic

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128242
Approved by: https://github.com/awgu
2024-06-08 05:56:41 +00:00
cbb7e3053f View specialization (#127641)
This PR adds specialization shortcuts for converting n-d to 1-d and 1-d to 2-d views.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127641
Approved by: https://github.com/ezyang
2024-06-08 05:52:52 +00:00
310f80995b Added memory budget to partitioner (#126320)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126320
Approved by: https://github.com/shunting314
2024-06-08 05:52:40 +00:00
ffc202a1b9 Added remove_noop_ops to joint_graph_passes (#124451)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124451
Approved by: https://github.com/ezyang, https://github.com/fmassa
2024-06-08 05:48:11 +00:00
c446851829 [fsdp2] update foreach_reduce accumulate_grad (#128117)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128117
Approved by: https://github.com/awgu
2024-06-08 05:13:57 +00:00
613c7d270d [pipelining] Format doc (#128279)
- Should use two dots around `var`
- Wrap lines
- Add section cross ref
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128279
Approved by: https://github.com/H-Huang
ghstack dependencies: #128273, #128278
2024-06-08 04:59:04 +00:00
2e42671619 [pipelining] Rename to stage.py and schedules.py (#128278)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128278
Approved by: https://github.com/H-Huang
ghstack dependencies: #128273
2024-06-08 04:42:35 +00:00
0e3fe694d1 [pipelining] Restore a stage constructor for tracer path (#128273)
In case user modified stage module out of place, such as
mod = DDP(mod)
mod = torch.compile(mod)

They need a stage builder else than `pipe.build_stage()`.

This PR provides an API to do so:
```
def build_stage(
  stage_module,
  stage_index,
  pipe.info(),
  ...
)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128273
Approved by: https://github.com/wconstab
2024-06-08 04:42:35 +00:00
8a45cf4c64 [AOTI] align data_size of the constants (#127610)
https://github.com/pytorch/pytorch/pull/124272 set the alignment to the `consts_o` but if there're `data_size` of tensor in the `consts_o` non divisible by the alignment, the following tensors are not aligned anymore, resulting in poor performance on CPU.
We align the `data_size` as well in this PR and pad the serialized bytes. Since `size` of the tensor instead of the `data_size` is used when creating tensor from the serialized bytes ([link](f4d7cdc5e6/torch/csrc/inductor/aoti_runtime/model.h (L236-L259))), there won't be correctness issue. `data_size` is only used to record the [bytes_read](f4d7cdc5e6/torch/csrc/inductor/aoti_runtime/model.h (L217)).

This PR will improve the performance on CPU for 4 models in HF, 7 models in TIMM and 1 model in Torchbench.

For the unit test, I add a bias value the original `data_size` of which is not divisible by the alignment to test the correctness:
```
constants_info_[0].dtype = static_cast<int32_t>(at::kFloat);
constants_info_[0].data_size = 64; # was 40 before this PR
constants_info_[0].shape = {10};

constants_info_[1].dtype = static_cast<int32_t>(at::kFloat);
......
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127610
Approved by: https://github.com/jgong5, https://github.com/desertfire
2024-06-08 04:31:00 +00:00
1d84c7e100 [DeviceMesh] Update get_group and add get_all_groups (#128097)
Fixes #121984

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128097
Approved by: https://github.com/wconstab, https://github.com/wanchaol
2024-06-08 04:28:56 +00:00
6e5c2a1a3b [inductor] Add missing files to torch_key (#128230)
Previosly all subdirs (like torch.inductor.codegen) were not hashed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128230
Approved by: https://github.com/oulgen
2024-06-08 03:26:48 +00:00
6220602943 [torchbind] support query schema of methods (#128267)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128267
Approved by: https://github.com/angelayi
2024-06-08 03:20:44 +00:00
0ef5229569 Revert "Change lerp decomp to use aten.as_strided_copy instead of prims.copy_strided (#128030)"
This reverts commit fdf1666b20f63e4acf01798f009e478d997a7f7f.

Reverted https://github.com/pytorch/pytorch/pull/128030 on behalf of https://github.com/nWEIdia due to breaking cuda12.1 test_cuda, see HUD https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=inductor ([comment](https://github.com/pytorch/pytorch/pull/128030#issuecomment-2155764546))
2024-06-08 02:34:06 +00:00
f9508b4c1f [pipelining] Update Pipelining Docs (#128236)
----

- Bring PipelineStage/Schedule more front-and-center
- provide details on how to manually construct PipelineStage
- move tracer example and manual example below so the high-level flow
  (e2e) is closer to the top
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128236
Approved by: https://github.com/H-Huang
ghstack dependencies: #128201, #128228
2024-06-08 02:03:46 +00:00
fe74bbd6f0 init sigmoid comments (#127983)
Fixes #127913

### Description
Add docstring to `torch/onnx/symbolic_opset9.py`:`sigmoid` function

### Checklist

- [x] The issue that is being fixed is referred in the description
- [x] Only one issue is addressed in this pull request
- [x] Labels from the issue that this PR is fixing are added to this pull request
- [x] No unnecessary issues are included into this pull request

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127983
Approved by: https://github.com/xadupre
2024-06-08 01:48:00 +00:00
921aa194c7 [pipelining] Move modify_graph_op_device to _IR.py (#128241)
This part is more IR related.
Thus moving from `PipelineStage` constructor to `pipe.build_stage(..., device, ...)`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128241
Approved by: https://github.com/wconstab
ghstack dependencies: #128240
2024-06-08 01:35:07 +00:00
ad96f991a5 [pipelining] Add pipe.build_stage() (#128240)
Given `PipelineStage` name to manual side.
Thus adding a method under `Pipe` to create PipelineStage.
Moved `PipeInfo` to utils.py to avoid circular dependency between `_IR` and `PipelineStage`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128240
Approved by: https://github.com/wconstab, https://github.com/H-Huang
2024-06-08 01:26:02 +00:00
5ef081031e [MPS] Include MPSGraphVenturaOps.h for complex types on macOS 12 (#127859)
Fixes this on macOS 12:

```
/Users/qqaatw/Forks/pytorch/aten/src/ATen/native/mps/operations/FastFourierTransform.mm:108:60: error: use of undeclared identifier 'MPSDataTypeComplexFloat16'; did you mean 'MPSDataTypeFloat16'?
            (inputTensor.dataType == MPSDataTypeFloat16) ? MPSDataTypeComplexFloat16 : MPSDataTypeComplexFloat32;
                                                           ^~~~~~~~~~~~~~~~~~~~~~~~~
                                                           MPSDataTypeFloat16
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127859
Approved by: https://github.com/kulinseth
2024-06-08 00:54:30 +00:00
647815049e Inductor: Allow small sizes of m for mixed mm autotuning (#127663)
For mixed mm with small sizes of m, such as in the example provided in #127056, being able to set BLOCK_M to 16 leads to better performance. This PR introduces kernel configs that are specific to mixed mm by extending the mm configs with two configs that work well for the example provided in #127056.
I am excluding configs with (BLOCK_M=16, BLOCK_K=16, BLOCK_N=64) because triton crashes when this config is used.

For the example in #127056:
- Without my changes, skip_triton is evaluated to true which disables autotuning. On my machine I achieve 146GB/s.
- If autotuning is enabled, but BLOCK_M>=32, I achieve 614 GB/s.
- With the changes in this PR (i.e. autotuning enabled and BLOCK_M=16), I achieve 772 GB/s.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127663
Approved by: https://github.com/Chillee
2024-06-08 00:46:16 +00:00
cyy
ef2b5ed500 [4/N] Remove unused functions (#128193)
Follows #128179

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128193
Approved by: https://github.com/ezyang
2024-06-08 00:09:26 +00:00
39dd4740e6 [inductor][dynamo-inline-nn-modules] Fix test with inlining flag (#128200)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128200
Approved by: https://github.com/Skylion007
ghstack dependencies: #128001, #126578, #128158, #128172
2024-06-07 23:51:58 +00:00
bef586111a [pipelining] pipelining.rst updates (#128228)
fix some nits and add `PipelineStage` (manual)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128228
Approved by: https://github.com/wconstab
ghstack dependencies: #128201
2024-06-07 23:29:54 +00:00
09cccbc1c7 [RFC] add per-collective timeout value in flight recorder (#128190)
Summary:
Add timeout value field on every collected record.

Test Plan:
Unit tests

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128190
Approved by: https://github.com/wconstab
2024-06-07 23:29:35 +00:00
11f2d8e823 Move inductor cuda 124 jobs to a separate workflow that is not triggered by ciflow/inductor (#128250)
https://github.com/pytorch/pytorch/pull/127825

The majority of the g5 runner usage comes from inductor (its something like 2x everything else)
in the past week, inductor ran 1300 ish times on PRs and 300 times on main.  Inductor-periodic ran 50 times on main, so the previous move from inductor -> inductor-periodic only results in 250 fewer runs.

I was under the impression that cu124 is experimental currently and eventually we'll need to switch to it, so this will stay until we switch or inductor uses much fewer runners

Are we expected to be able to handle two versions of cuda in CI?  Because currently we cannot, at least not comfortably

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128250
Approved by: https://github.com/huydhn
2024-06-07 23:01:52 +00:00
5b3624117a update test_issue175 to handle inline_inbuilt_nn_modules (#128026)
with inlining the output graph have more function calls reflecting those on the test that count number of function calls.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128026
Approved by: https://github.com/anijain2305
ghstack dependencies: #127553
2024-06-07 22:07:16 +00:00
ba81c3c290 [inductor] add cpp builder code. (take 2) (#125849)
Fully manual rebase the code of PR: https://github.com/pytorch/pytorch/pull/124045
The old PR seems crashed due to too many commits, and too many times rebase. Please reference: https://github.com/pytorch/pytorch/pull/124045#issuecomment-2103744588

-------
It is the first step of RFC https://github.com/pytorch/pytorch/issues/124245.
Changes:
1. Add cpp builder code, the new cpp_builder support Windows OS.
2. Add CPU ISA checker which is cross OS and exported from backend cpuinfo.
3. Switch compiler ISA checker to new cpp builder.
4. CppCodeCache use the new ISA checker.
5. Add temprary `test_new_cpp_build_logical` UT to help on transfer to new code.
<img width="1853" alt="Image" src="https://github.com/pytorch/pytorch/assets/8433590/ce6519ab-ba92-4204-b1d6-7d15d2ba2cbe">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125849
Approved by: https://github.com/jgong5, https://github.com/desertfire
2024-06-07 20:49:58 +00:00
3a620a0f65 bug fix of dynamo_timed in cprofile (#128203)
Fixes #ISSUE_NUMBER

fb-only: "Entire Frame" was missing before this change.

Before: https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/f565966006-TrainingApplication/20240527/rank_0/5_0_1/compilation_metrics_23.html
After: https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/f569854578-TrainingApplication/20240606/rank_0/0_0_0/compilation_metrics_16.html

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128203
Approved by: https://github.com/Chillee
2024-06-07 20:47:27 +00:00
8892ddaacc [TD] Test removal on sm86 (#127131)
Yolo

I'm excited to break CI :')
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127131
Approved by: https://github.com/huydhn, https://github.com/ZainRizvi
2024-06-07 20:19:18 +00:00
fdf1666b20 Change lerp decomp to use aten.as_strided_copy instead of prims.copy_strided (#128030)
aten.lerp decomposition causes prims::copy_strided to appear in the graph, which is not core aten.

Internal ref: https://fb.workplace.com/groups/pytorch.edge.users/permalink/1525644288305859/
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128030
Approved by: https://github.com/Skylion007, https://github.com/zou3519
2024-06-07 20:12:52 +00:00
e647ea55a3 [pipelining] redirect README to document (#128205)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128205
Approved by: https://github.com/wconstab, https://github.com/H-Huang
2024-06-07 19:34:52 +00:00
dcb63fcedb [pipelining] Remove num_microbatches from stage (#128201)
This is similar to https://github.com/pytorch/pytorch/pull/127979, but instead of removing `num_microbatches` from schedule, we remove it from `PipelineStage`. This also means that during `PipelineSchedule` init we need to setup the buffers for the stage(s).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128201
Approved by: https://github.com/kwen2501
2024-06-07 18:56:44 +00:00
cafbcb6376 [BE]: Update ruff to 0.4.8 (#128214)
Updates ruff to 0.4.8. Some minor fixes, but noticably is 10% faster on microbenchmark and should further reduce local and CI runtime of the linter. Also includes a few bugfixes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128214
Approved by: https://github.com/ezyang
2024-06-07 18:41:35 +00:00
8ca4cefc7d [C10D] Ensure gil is not released when calling toPyBytes (#128212)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128212
Approved by: https://github.com/Skylion007, https://github.com/XilunWu
2024-06-07 18:24:10 +00:00
0a6df4fca6 delete inductor config.trace.compile_profile (#127143)
Fixes #ISSUE_NUMBER

https://fb.workplace.com/groups/257735836456307/posts/687858786777341/?comment_id=687861123443774&reply_comment_id=687865486776671

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127143
Approved by: https://github.com/Chillee
2024-06-07 18:05:50 +00:00
82d7a36a27 Added torchao nightly workflow (#128152)
Summary:
Add torchao benchmark workflow, upload the artifacts to GHA.

X-link: https://github.com/pytorch/benchmark/pull/2273

Test Plan:
```
python run_benchmark.py torchao --ci
```

Differential Revision: D58140479

Pulled By: xuzhao9

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128152
Approved by: https://github.com/jerryzh168
2024-06-07 17:52:15 +00:00
0c7f4353e5 [inductor] simplify indexing (#127661)
This is a short term fix for: https://github.com/pytorch/pytorch/issues/124002

We found the cause of bad perf for the int8_unpack kernel is due to sub-optimal indexing. In this PR we introduce 2 indexing optimizations:
1. expand FloorDiv to the entire expression when feasible. E.g. `x1 * 1024 + x2 // 2`  will be transformed to `(x1 * 2048 + x2) // 2`. The motivation is that we have more chance to simplify loops for `x1 * 2048 + x2`.
2. merge ModularIndexing pairs: `ModularIndexing(ModularIndex(x, 1, a), 1, b)`, can be simplified to `ModularIndexing(x, 1, b)` if a is a multiple of b.

With both indexing optimizations, we improve int8_unpack perf by 1.54x (183us -> 119us).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127661
Approved by: https://github.com/jansel
2024-06-07 17:51:30 +00:00
662a78f957 [dynamo] Inline the getattr of fx graph and proxy graph (#128172)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128172
Approved by: https://github.com/yanboliang
ghstack dependencies: #128001, #126578, #128158
2024-06-07 17:14:58 +00:00
19b31d899a Fix 'get_real_value' on placeholder nodes (#127698)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127698
Approved by: https://github.com/jansel
ghstack dependencies: #127695, #127696
2024-06-07 17:13:43 +00:00
b741819b05 Fix 'get_attr' call in dynamo 'run_node' (#127696)
Fixes #124858

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127696
Approved by: https://github.com/jansel
ghstack dependencies: #127695
2024-06-07 17:13:43 +00:00
3aa623d407 Fix assume_constant_result for UnspecializedNNModuleVariable methods (#127695)
Fixes #127509

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127695
Approved by: https://github.com/jansel
2024-06-07 17:13:43 +00:00
754e6d4ad0 Make jobs with LF runners still pass lint (#128175)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128175
Approved by: https://github.com/huydhn
2024-06-07 17:13:04 +00:00
85758fa5ae [c10d][TCPStore] make TCPStore server use libuv by default (#127957)
**Summary**
This PR switches the default TCPStore server backend to a new implementation that utilizes [`libuv`](https://github.com/libuv/libuv) for significantly lower initialization time and better scalability:
<img width="714" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/18503011-da5d-4104-8ba9-abc456438b02">

We hope this improvement would benefit users from a much shorter startup time in large-scale jobs. Eventually, we hope to fully replace the old TCPStore backend implementation with the libuv one.

**What it changes**
This PR changes the underlying TCPStore server backend to `libuv` if users don't explicitly specify to use the old TCPStore server. This change is not supposed to cause any user notice except significant faster TCPStore startup for large-scale jobs.

One thing to note is, we do not support the initialization approach where user passes in a socket for libuv backend. We plan to support it as a next step but we choose to disable it before fully testing. If you are initializing TCPStore in this approach, you can see the next section to remain using the old TCPStore server.

**Fallback/Remain using the old TCPStore server**
For users who want to stay with the old TCPStore backend, there're 3 ways:

1. If user is directly instantiating TCPStore object, user can pass in argument `use_libuv=False` to use the old TCPStore server backend e.g. `store = torch.distributed.TCPStore(..., use_libuv=False)`.
2. Or, specify the TCPStore backend option in `init_method` when calling default ProcessGroup init, e.g. `torch.distributed.init_process_group(..., init_method="{YOUR_RENDEZVOUS_METHOD}://{YOUR_HOSTNAME}:{YOUR_PORT}?use_libuv=0")`
3. Or, user can set environment variable `USE_LIBUV` to `"0"` when launching.

These 3 approach are in order of precedence. That being said, if user specifies `use_libuv=0` in `init_method` and also sets environment var `USE_LIBUV="1"`, the former will take effect and the TCPStore backend instantiated will be the old one instead of the one using libuv.

**Operating Systems Compatibility**
From the CI signals, we believe the new implementation has the same behavior as the old TCPStore server on all supported platforms. If you notice any behavior discrepancy, please file an issue with `oncall: distributed` label.

**Test Plan**
`pytest test/distributed/test_store.py`
<img width="2548" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/dc0aebeb-6d5a-4daa-b98c-e56bd39aa588">
note: `TestMultiThreadedWait::test_wait` is a broken test that has been there for some time.

`test/distributed/elastic/utils/distributed_test.py`
<img width="2558" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/a6a3266d-b798-41c4-94d2-152056a034f6">

**TODO**
1. Update the doc at

- https://pytorch.org/docs/stable/distributed.html#distributed-key-value-store
- https://pytorch.org/docs/stable/distributed.html#tcp-initialization

2. Make torch elastic rendezvous to use libuv TCPStore as well. See `torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py` cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k @kurman
3. Test if libuv backend is okay with initialization with socket. Change `LibUvTCPStoreTest::test_take_over_listen_socket`.

**Test Plan**
`pytest test/distributed/test_store.py`
<img width="2548" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/dc0aebeb-6d5a-4daa-b98c-e56bd39aa588">
note: `TestMultiThreadedWait::test_wait` is a broken test that has been there for some time.

`test/distributed/elastic/utils/distributed_test.py`
<img width="2558" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/a6a3266d-b798-41c4-94d2-152056a034f6">

Differential Revision: [D58259591](https://our.internmc.facebook.com/intern/diff/D58259591)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127957
Approved by: https://github.com/kurman
ghstack dependencies: #127956
2024-06-07 16:53:01 +00:00
6c824cd9fb [BE][c10d] fix use of TORCH_ERROR in TCPStore libuv backend (#127956)
**Summary**
The use of TORCH_ERROR in TCPStore libuv backend code needs update.

Differential Revision: [D58259589](https://our.internmc.facebook.com/intern/diff/D58259589)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127956
Approved by: https://github.com/shuqiangzhang, https://github.com/cyyever
2024-06-07 16:53:01 +00:00
b9b89ed638 [pipelining] fix LoopedBFS (#127796)
# Issues

Currently two issues need to be fixed with LoopedBFS:
1. The wrap around send operation to the looped around stage blocks will cause a hang. For some reason this doesn't surface on single node, but on multihost this surfaces in a hang.
<img width="1311" alt="image" src="https://github.com/pytorch/pytorch/assets/14858254/210d9d18-455f-4f65-8a11-7ce2c1ec73fd">
2. When microbatches are popped off in `backward_one_chunk` will automatically use the `bwd_chunk_id` starting from 0. This works for interleaved 1f1b and 1f1b, but for loopedBFS we want to pop from starting at `num_microbatches - 1`. Same needs to be fixed for gpipe?

# Changes
- Update LoopedBFS implementation to share `_step_microbatches` with `Interleaved1F1B`
- Also share the tests between the two schedules for varying num_microbatches, local_stages, and world_sizes
- Update `backward_one_chunk` to optionally take a `bwd_chunk_id` argument.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127796
Approved by: https://github.com/wconstab
2024-06-07 16:46:38 +00:00
d9696ea624 [AOTInductor] [Tooling] Update NaN and INF Checker for AOTInductor (#127574)
Summary:
1. Integrate NaN and INF checker with existing config, controllable by env var.
2. Move inject point of NaN & INF checker earlier, this could prevent buffer freeing before check.
3. Inject debugging code in Kernel level, which prevents us trying to read buffers that are fused inplace and into a single kernel.

Test Plan:
Debugging utility.
Test and check by existing tests with env var:
```
TORCHINDUCTOR_NAN_ASSERTS=1 TORCHINDUCTOR_MAX_AUTOTUNE=0 python test/inductor/test_aot_inductor.py -k AOTInductorTestNonABICompatibleCuda.test_seq_non_abi_compatible_cuda
```

Reviewed By: ColinPeppler

Differential Revision: D57989176

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127574
Approved by: https://github.com/chenyang78, https://github.com/desertfire
2024-06-07 16:46:26 +00:00
fc6e3ff96d [ROCm] Update triton pin to fix libtanh issue (#125396)
There were some internal build issues related to tanh when we moved to upstream triton in ROCm. These issues were fixed by the following triton commit: https://github.com/triton-lang/triton/pull/3810 . This PR moves the triton pin to incorporate that change. Added some skips for unit tests that regressed due to the triton commit bump in this PR.

Needs https://github.com/pytorch/pytorch/pull/127968 since this PR introduces a triton dependency on llnl-hatchet, which doesn't have py3.12 wheels available currently.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125396
Approved by: https://github.com/pruthvistony, https://github.com/malfet
2024-06-07 16:23:04 +00:00
128952625b Revert "Added memory budget to partitioner (#126320)"
This reverts commit 2184cdd29128a924583e4702489177f83fb8270a.

Reverted https://github.com/pytorch/pytorch/pull/126320 on behalf of https://github.com/ZainRizvi due to The new test_ac.py fails on ROCm machines ([comment](https://github.com/pytorch/pytorch/pull/126320#issuecomment-2155141886))
2024-06-07 16:15:03 +00:00
cyy
c219fa5eb9 [3/N] Remove unused functions (#128179)
Following https://github.com/pytorch/pytorch/pull/128005, this PR continues to remove unused functions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128179
Approved by: https://github.com/ezyang
2024-06-07 16:13:16 +00:00
8d16a73f0f Manipulate triton_hash_with_backend so that it doesn't contain any keywords (#128159)
Summary: See https://github.com/pytorch/pytorch/issues/127637 where "def" appears in the backend_hash and causes a problem.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128159
Approved by: https://github.com/jansel
2024-06-07 16:10:44 +00:00
852b7b4c99 [inductor] Enable subprocess-based parallel compile as the default (#126817)
Differential Revision: [D58239826](https://our.internmc.facebook.com/intern/diff/D58239826)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126817
Approved by: https://github.com/eellison
ghstack dependencies: #128037, #128086
2024-06-07 16:10:11 +00:00
ac51f782fe Revert "Complete revamp of float/promotion sympy handling (#126905)"
This reverts commit 2f7cfecd86009a9d396fdbdcdfb4ba7a005db16b.

Reverted https://github.com/pytorch/pytorch/pull/126905 on behalf of https://github.com/atalman due to Sorry need to revert - failing internally ([comment](https://github.com/pytorch/pytorch/pull/126905#issuecomment-2155118778))
2024-06-07 16:01:46 +00:00
23c156cd2d Revert "[inductor] simplify indexing (#127661)"
This reverts commit 901226ae837bd4629b34735c84a3481c4988bb5b.

Reverted https://github.com/pytorch/pytorch/pull/127661 on behalf of https://github.com/atalman due to Sorry reverting because in conflict with https://github.com/pytorch/pytorch/pull/126905 which needs to be reverted, will be relanding it ([comment](https://github.com/pytorch/pytorch/pull/127661#issuecomment-2155115388))
2024-06-07 15:58:36 +00:00
cyy
a1b664adeb Add default values to PyTorchMemEffAttention::AttentionKernel::Params members (#112215)
Default values were added to Params in order to eliminate CUDA warnings like
```
and the implicitly-defined constructor does not initialize ‘PyTorchMemEffAttention::AttentionKernel<float, cutlass::arch::Sm80, true, 64, 64, 64, true, true>::accum_t PyTorchMemEffAttention::AttentionKernel<float, cutlass::arch::Sm80, true, 64, 64, 64, true, true>::Params::scale’
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112215
Approved by: https://github.com/eqy, https://github.com/ezyang
2024-06-07 15:54:07 +00:00
3090667cf9 [pipelining] pipeline() taking microbatch as example input (#128163)
Changed the API of `pipeline()` to take microbatch instead of full batch as example args.

Main purpose is to:
- make this API more atomic;
- decouple tracing frontend from runtime info like `num_chunks`.

Side effects:
- Creates opportunity for varying `num_chunks` of schedules with the same `pipe` object.
- User has to create example microbatch input.
- Chunk spec stuff are now all moved to runtime side.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128163
Approved by: https://github.com/H-Huang
2024-06-07 15:51:53 +00:00
224b4339e5 Revert "Make ValueRange repr less chatty by default (#128043)"
This reverts commit f0dd11df5534ae074ad2d090e6700576a22719d6.

Reverted https://github.com/pytorch/pytorch/pull/128043 on behalf of https://github.com/atalman due to Sorry reverting because in conflict with [#126905](https://github.com/pytorch/pytorch/pull/126905) which needs to be reverted ([comment](https://github.com/pytorch/pytorch/pull/128043#issuecomment-2155091732))
2024-06-07 15:43:39 +00:00
6e75024ff0 Run TestAOTAutograd with dynamo (#128047)
My goal is to run these tests with the autograd cache on, but first I want them running with dynamo. These tests already caught an interesting issue so I thought it would be helpful to just have them.

Next up I'll have a second subclass of these tests, run them twice, and expect a cache hit the second time from autograd.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128047
Approved by: https://github.com/ezyang
2024-06-07 15:42:28 +00:00
771be55bb0 Documenting torch.onnx.operator.shape_as_tensor (#128051)
Fixes #127890

This PR adds docstring to the `torch.onnx.operator.shape_as_tensor` function.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128051
Approved by: https://github.com/xadupre
2024-06-07 15:20:18 +00:00
3f9798a4fd add docstring to masked_fill, expand, select, unsqueeze, cat fns (#128055)
Fixes #127891
Fixes #127893
Fixes #127894
Fixes #127907
Fixes #127910

## Description
Add docstring to `masked_fill`, `expand`, `select`, `unsqueeze`, and `cat` functions in torch.onnx.symbolic_opset9.py

remaining pydocstyle errors: 257

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128055
Approved by: https://github.com/xadupre
2024-06-07 15:17:22 +00:00
543a870943 [pipelining] Rename ManualPipelineStage -> PipelineStage (#128157)
Renaming ManualPipelineStage to remove the "Manual" part. I needed to replace the existing `PipelineStage` which takes in the `pipe` argument, so I have renamed that to `TracerPipelineStage`. @kwen2501 will remove this entirely in favor of adding a util to `Pipe` to just create the stage directly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128157
Approved by: https://github.com/wconstab
2024-06-07 09:24:16 +00:00
5f81265572 [Traceable FSDP2] Return early from _register_post_backward_hook when compile (#127864)
Dynamo doesn't support `RegisterPostBackwardFunction` very well yet. This PR skips it and rely on `root_post_backward_callback` under compile. We will improve `RegisterPostBackwardFunction` support in Q3.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127864
Approved by: https://github.com/awgu
2024-06-07 09:19:07 +00:00
7efaeb1494 [AOTI] docs: add suggestion to turn on freezing on CPU (#128010)
With https://github.com/pytorch/pytorch/pull/124350 landed, it is now suggested in AOTI to turn on freezing on CPU to get better performance.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128010
Approved by: https://github.com/desertfire
2024-06-07 08:57:02 +00:00
0c16800b4a [pipelining] include lifted constants in input_to_state (#128173)
Previous PR only looked at state dict to determine inputs to state, missing out on lifted tensors

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128173
Approved by: https://github.com/kwen2501
2024-06-07 08:40:54 +00:00
01601ebd41 Retire torch.distributed.pipeline (#127354)
Actually retiring module after deprecation warning for a while.
The new supported module is: torch.distributed.pipelining.
Please migrate.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127354
Approved by: https://github.com/wconstab
2024-06-07 08:11:58 +00:00
70724bdbfe Bugfix for nondeterminstic torch_key (#128111)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128111
Approved by: https://github.com/oulgen
2024-06-07 07:17:39 +00:00
00c6ca4459 [compiled autograd][cudagraphs] Inputs runtime wrapper to move cpu scalars to cuda (#125382)
Most commonly CPU scalars used for philox random seed. Right now, any cpu input will skip cudagraphing the entire graph. We need both the traced graph and the runtime inputs to be cudaified.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125382
Approved by: https://github.com/jansel
2024-06-07 07:12:46 +00:00
190f06d468 [pipelining] Lower _configure_data_parallel_mode to stage (#127946)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127946
Approved by: https://github.com/wconstab
ghstack dependencies: #127935
2024-06-07 07:06:23 +00:00
a448b3ae95 [Traceable FSDP2] Check hasattr('fsdp_pre_all_gather') only when not compile (#127855)
Dynamo doesn't support `hasattr(inner_tensor, "fsdp_post_all_gather")` yet. We will work on this support in Q3.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127855
Approved by: https://github.com/awgu
2024-06-07 06:36:40 +00:00
2ff312359c skip hf_T5_generate in dynamic shape test (#121129)
As reported in https://github.com/pytorch/pytorch/issues/119434, `hf_T5_generate` failed with dynamic shape testing, we propose to skip the dynamic batch size testing of this model in this PR.

* Error msg is
```
  File "/home/jiayisun/pytorch/torch/_dynamo/guards.py", line 705, in SHAPE_ENV
    guards = output_graph.shape_env.produce_guards(
  File "/home/jiayisun/pytorch/torch/fx/experimental/symbolic_shapes.py", line 3253, in produce_guards
    raise ConstraintViolationError(
torch.fx.experimental.symbolic_shapes.ConstraintViolationError: Constraints violated (L['inputs_tensor'].size()[0])! For more information, run with TORCH_LOGS="+dynamic".
  - Not all values of RelaxedUnspecConstraint(L['inputs_tensor'].size()[0]) are valid because L['inputs_tensor'].size()[0] was inferred to be a constant (4).
```

* Root Cause is
This error happens while creating guard for this [model script line](https://github.com/huggingface/transformers/blob/main/src/transformers/models/t5/modeling_t5.py#L561): `scores += position_bias_masked`
I run it with TORCH_LOGS="+dynamic" and got the key line : `I0305 00:21:00.849974 140376923287424 torch/fx/experimental/symbolic_shapes.py:3963] [6/0_1] eval Eq(s0, 4) [guard added] at miniconda3/envs/pt2/lib/python3.9/site-packages/transformers/models/t5/modeling_t5.py:561 in forward (_refs/__init__.py:403 in _broadcast_shapes)`
The reason for this error is that the batch dimension of `inputs_tensor` in the dynamic batch size test is marked as dynamic shape `s0`, so the batch dimension of `scores` generated by a series of operations with `inputs_tensor` is also `s0`. However, because the function of creating `attention_mask` is not in Dynamo but in python. The batch dimension of `attention_mask` is the real shape `4`, and the batch dimension of `position_bias_masked` generated by a series of operations with `attention_mask` is also the real shape `4`, not the dynamic shape `s0`. The current line of `scores += position_bias_masked` requires creating a guard and check whether the batch dimension of `scores` is always equal to the batch dimension of `position_bias_masked`, Eq(s0, 4), the error happens.
So the root cause of this error is that the function of creating `attention_mask` not in Dynamo but in python. The reason why the function of `attention_mask` not in Dynamo is that Dynamo has a graph break on this function (happened in the [model script line](https://github.com/huggingface/transformers/blob/main/src/transformers/generation/utils.py#L476): `is_pad_token_in_inputs = (pad_token_id is not None) and (pad_token_id in inputs)`) due to the following error:
`torch._dynamo.exc.Unsupported: Tensor.item`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121129
Approved by: https://github.com/leslie-fang-intel, https://github.com/ezyang
2024-06-07 06:28:29 +00:00
d943357a21 [XPU] Add xpu support of make triton (#126513)
This PR is to add XPU support for `make triton`.

If a user wishes to use Triton with XPU support, the user needs to install the  [intel-xpu-backend-for-triton](https://github.com/intel/intel-xpu-backend-for-triton).

This PR allows the user to easily install Triton for xpu backend support:

```
# clone the pytorch repo
export USE_XPU=1
make triton
```
The XPU version of triton will always be built from the source. It will cat the commit id from `.ci/docker/ci_commit_pins/triton-xpu.txt`, for example, `b8c64f64c18d8cac598b3adb355c21e7439c21de`.

So the final call would be like:

```
pip install --force-reinstall "git+https://github.com/intel/intel-xpu-backend-for-triton@b8c64f64c18d8cac598b3adb355c21e7439c21de#subdirectory=python"
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126513
Approved by: https://github.com/EikanWang, https://github.com/atalman
2024-06-07 06:25:47 +00:00
68cc63ae27 introduce skipIfNNModuleInlined and skip test_cpu_cuda_module_after_dynamo (#128023)
see the issue https://github.com/pytorch/pytorch/issues/127636 to for details about the issue, TLDR is that
when inlining is enabled, we create a fake tensor while tracing in dynamo and try to perform  aten.add.Tensor between
two tensor of different types, with out inlining we do not hit that operation during tracing.
```
Failed running call_function <built-in function add>(*(FakeTensor(..., size=(20, 20), grad_fn=<AddBackward0>), FakeTensor(..., device='cuda:0', size=(20, 20))), **{}):
Unhandled FakeTensor Device Propagation for aten.add.Tensor, found two different devices cpu, cuda:0
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128023
Approved by: https://github.com/anijain2305
ghstack dependencies: #127487, #127553
2024-06-07 06:00:33 +00:00
7e48d6a497 reset dynamo in test_do_not_skip_side_effects unit test loop to avoid dynamo cache limit hit (#127487)
fix https://github.com/pytorch/pytorch/issues/127483

When nn module inlining is enabled, all recompilations are considered for the same frame hence we hit the cache limit for
test_do_not_skip_side_effects, but without inlining things are different , each time we hit a new Object Model we do not consider that a re-compilation, as explained in https://github.com/pytorch/pytorch/issues/127483

For that test we do not really care about cache size hence i reset dynamo in the main loop.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127487
Approved by: https://github.com/anijain2305
2024-06-07 06:00:33 +00:00
dc8e3c2e90 [inductor] subproc parallel compile: initialize future before sending work to the pool (#128086)
Summary: I got reports of intermittent failures in CI and the logs show errors like this:
```
CRITICAL:concurrent.futures:Future 139789013754560 in unexpected state: FINISHED
```
I can't repro locally, but seems clear that we should initialize the future _before_ sending work to the subprocess pool since it could finish before we call set_running_or_notify_cancel()

Differential Revision: [D58239829](https://our.internmc.facebook.com/intern/diff/D58239829)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128086
Approved by: https://github.com/jansel
ghstack dependencies: #128037
2024-06-07 04:17:35 +00:00
6a2bf48cfa [inductor] subproc parallel-compile: start thread last in init (#128037)
Summary: Observed on an internal workload: the helper thread started and attempted to access member variables before they were initialized.

Differential Revision: [D58239827](https://our.internmc.facebook.com/intern/diff/D58239827)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128037
Approved by: https://github.com/Skylion007, https://github.com/eellison
2024-06-07 04:17:35 +00:00
e8e0bdf541 [inductor] parallel-compile: call triton_key() before forking (#127639)
Summary:
A user reported severe slowdown on a workload when using parallel compile. The issue is that in some environments, the process affinity changes after forking such that all forked subprocesses use a single logical processor. Described here: https://github.com/pytorch/pytorch/issues/99625. That requires a separate fix, but during debuging we noticed that we can at least optimize the expensive call to triton_key() before forking.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127639
Approved by: https://github.com/eellison, https://github.com/anijain2305
2024-06-07 04:12:57 +00:00
96806b1777 [pipelining][doc] Add frontend description and change tracer example (#128070)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128070
Approved by: https://github.com/wconstab, https://github.com/H-Huang
2024-06-07 04:09:36 +00:00
3df53c2a8f [dtensor] directly return local_tensor under no_grad (#128145)
as titled, skip the autograd function and directly return the
local_tensor if it's under no_grad context, this would avoid creating
views

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128145
Approved by: https://github.com/awgu
ghstack dependencies: #128112
2024-06-07 04:01:47 +00:00
747fc35ff5 [dynamo] Support if cond on UnspecializedNNModuleVariable and add inline tests (#128158)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128158
Approved by: https://github.com/jansel
ghstack dependencies: #128001, #126578
2024-06-07 03:50:33 +00:00
5e5bbdb35e [DDP] Bucket handling: make first bucket size equal to bucket_cap_mb if it was set (#121640)
The fist DDP bucket is always being created of the size of `dist._DEFAULT_FIRST_BUCKET_BYTES` (1 MiB) by default regardless of `bucket_cap_mb`. The proposal is to set `bucket_cap_mb` as the one main bucket size if it was supplied by the user.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121640
Approved by: https://github.com/wanchaol
2024-06-07 03:33:33 +00:00
4d0ece8196 [pipelining] Consolidate chunk counting between stage and schedule (#127935)
We used to have two backward chunk id counting systems, one at schedule level, the other at stage level.
(Which makes safety dependent on the two advancing hand-in-hand.)

This PR consolidates the counting system to the schedule side only, which would pass `mb_index` to the following stage calls:
`forward_one_chunk`
`backward_one_chunk`
`get_bwd_send_ops`
...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127935
Approved by: https://github.com/H-Huang
2024-06-07 03:33:18 +00:00
476bfe6cce fix torch.compile with triton kernels under inference_mode (#124489)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124489
Approved by: https://github.com/albanD
2024-06-07 03:29:37 +00:00
50155e825b [export] provide refine function for automatically accepting dynamic shapes suggested fixes (#127436)
Summary:
Part of the work helping export's automatic dynamic shapes / dynamic shapes refining based on suggested fixes.

Introduces a util function refine_dynamic_shapes_from_suggested_fixes() that takes the error message from a ConstraintViolationError message containing suggested dynamic shapes fixes, along with the original dynamic shapes spec, and returns the new spec. Written so that the suggested fixes from export can be directly parsed and used.

Example usage for the automatic dynamic shapes workflow:
```
# export, fail, parse & refine suggested fixes, re-export
try:
    export(model, inps, dynamic_shapes=dynamic_shapes)
except torch._dynamo.exc.UserError as exc:
    new_shapes = refine_dynamic_shapes_from_suggested_fixes(exc.msg, dynamic_shapes)
    export(model, inps, dynamic_shapes=new_shapes)
```

For examples of behavior, see the added test and docstring. Will take suggestions for renaming the function to something else 😅

Test Plan: test_export tests

Differential Revision: D57409142

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127436
Approved by: https://github.com/avikchaudhuri
2024-06-07 03:29:06 +00:00
65aa16f968 Revert "Default XLA to use swap_tensors path in nn.Module._apply (#126814)" (#128170)
https://github.com/pytorch/pytorch/issues/128165 :(

This reverts commit a7b1dd82ff3063894fc665ab0c424815231c10e6.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128170
Approved by: https://github.com/drisspg, https://github.com/albanD
2024-06-07 01:44:14 +00:00
f99409903c Documenting torch.distributions.utils.clamp_probs (#128136)
Fixes https://github.com/pytorch/pytorch/issues/127889

This PR adds docstring to the `torch.distributions.utils.clamp_probs` function.

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128136
Approved by: https://github.com/janeyx99, https://github.com/svekars, https://github.com/malfet
2024-06-07 00:49:41 +00:00
740cd0559f Filter non input symexprs from codecache guards (#128052)
Summary: Dynamo lifts all symexprs that appear in the inputs to top level which means that we do not need to look at guards that contain symexprs that do not appear in the inputs. Prune them.

Test Plan: added two new tests

Differential Revision: D58200476

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128052
Approved by: https://github.com/ezyang, https://github.com/masnesral
2024-06-07 00:48:49 +00:00
117ab34891 Documenting the torch.utils.collect_env.get_pretty_env_info function (#128123)
Fixes #127888

This PR adds docstring to the `torch.utils.collect_env.get_pretty_env_info` function

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128123
Approved by: https://github.com/ezyang, https://github.com/malfet
2024-06-07 00:43:18 +00:00
901226ae83 [inductor] simplify indexing (#127661)
This is a short term fix for: https://github.com/pytorch/pytorch/issues/124002

We found the cause of bad perf for the int8_unpack kernel is due to sub-optimal indexing. In this PR we introduce 2 indexing optimizations:
1. expand FloorDiv to the entire expression when feasible. E.g. `x1 * 1024 + x2 // 2`  will be transformed to `(x1 * 2048 + x2) // 2`. The motivation is that we have more chance to simplify loops for `x1 * 2048 + x2`.
2. merge ModularIndexing pairs: `ModularIndexing(ModularIndex(x, 1, a), 1, b)`, can be simplified to `ModularIndexing(x, 1, b)` if a is a multiple of b.

With both indexing optimizations, we improve int8_unpack perf by 1.54x (183us -> 119us).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127661
Approved by: https://github.com/jansel
2024-06-06 23:57:45 +00:00
7ede78f9f5 [dynamo][nn-modules] Trace through nn.Module dunder methods for UnspecializedNNModule (#126578)
Tracing through `__init__`  is important because it initializes (calls STORE_ATTR) on members. By doing that, we kick in the mutation tracking for these objects. So, things like mutating `_modules` etc is tracked automatically.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126578
Approved by: https://github.com/jansel
ghstack dependencies: #128001
2024-06-06 23:05:49 +00:00
e5b3387166 [dynamo] Bugfix for nn parameter construction (#128001)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128001
Approved by: https://github.com/jansel
2024-06-06 23:05:49 +00:00
6dfdce92ba Fixed typos in the complex numbers portion of the autograd docs (#127948)
This PR fixes several typos in the complex numbers section of the docs for autograd. Only documentation was altered.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127948
Approved by: https://github.com/soulitzer
2024-06-06 22:47:04 +00:00
56a3d276fe Handle custom op during TorchScript to ExportedProgram conversion (#127580)
#### Description
Handle custom ops during TorchScript to ExportedProgram covnersion
```python
torch.library.define(
    "mylib::foo",
    "(Tensor x) -> Tensor",
    lib=lib,
)

# PyTorch custorm op implementation
@torch.library.impl(
    "mylib::foo",
    "CompositeExplicitAutograd",
    lib=lib,
)
def foo_impl(x):
    return x + x

# Meta function of the custom op.
@torch.library.impl_abstract(
    "mylib::foo",
    lib=lib,
)
def foo_meta(x):
    return x + x

class M(torch.nn.Module):
    def forward(self, x):
        return torch.ops.mylib.foo(x)
```

#### Test Plan
* Add a test case where custom op is called and converted. `pytest test/export/test_converter.py -s -k test_ts2ep_converter_custom_op`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127580
Approved by: https://github.com/angelayi
2024-06-06 22:06:51 +00:00
80fa2778ed Update types for verbose in lr_scheduler (#127943)
I'm currently locked into jsonargparse version 4.19.0, and it complains when used in combination with LightningCLI (v2.0.8). This is because it cares about the types declared in google style docstrings. This causes a problem when it tries to parse how it should cast arguments to construct an instance of an LRScheduler class because the docstrings declare the "verbose" parameter as a bool, but the defaults recently changed to a string "deprecated". This means the type should really be `bool | str`.

This PR adds a `| str` to the docstring type in each learning rate scheduler class. This will prevent jsonargparse from complaining.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127943
Approved by: https://github.com/janeyx99
2024-06-06 21:59:22 +00:00
0a761f0627 [RFC] Provide optional switches to _dump_nccl_trace (#127651)
Summary:
Data from PyTorch distributed is mostly useful during initial stages of model development.
Provide options to reduce data sent/dumped.
`_dump_nccl_trace` takes 3 optional switches. Default as before returns everything
- `includeCollectives`: option to also include collectives: Default is True.
- `includeStacktraces`: option to include stack traces in collectives. Default is True.
- `onlyActive`: option to only send active collective work - i.e. not completed. Default is
    False (i.e. send everything)

Test Plan:
Unit tests

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127651
Approved by: https://github.com/wconstab
2024-06-06 21:59:09 +00:00
54fe2d0e89 [cuDNN][quantization] skip qlinear test in cuDNN v9.1.0 (#128166)
#120006 only very recently unskipped this test 3 days ago so we don't consider it a blocker for cuDNNv9 for now

CC @atalman

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128166
Approved by: https://github.com/atalman, https://github.com/nWEIdia
2024-06-06 21:43:29 +00:00
04272a0e12 Add docstring for the torch.ao.quantization.utils.get_combined_dict function (#128127)
Fixes: #127906

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128127
Approved by: https://github.com/jerryzh168
2024-06-06 21:22:09 +00:00
baaa914bf7 [small] test clean up (#128079)
remove unnecessary line: https://github.com/pytorch/pytorch/issues/123733
add main so test can be run `python ...`: https://github.com/pytorch/pytorch/issues/124906

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128079
Approved by: https://github.com/awgu
2024-06-06 21:21:40 +00:00
9554300436 [inductor][codegen] Codegen constexpr globals and constexpr annotated globals correctly. (#126195)
[Triton #3762](https://github.com/triton-lang/triton/pull/3762)
disallows access to globals which are not `tl.constexpr`

Triton has always treated captured globals this way, but they now
require it be explicit in user code.

Updated codegen to make sure these variables are defined before writing
the kernel source when compiling a user defined triton kernel.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126195
Approved by: https://github.com/alexbaden, https://github.com/bertmaher
2024-06-06 20:50:11 +00:00
2184cdd291 Added memory budget to partitioner (#126320)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126320
Approved by: https://github.com/shunting314
2024-06-06 20:32:29 +00:00
7e059b3c95 Add a call to validate docker images after build step is complete (#127768)
Adds validation to docker images. As discussed here: https://github.com/pytorch/pytorch/issues/125879
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127768
Approved by: https://github.com/huydhn, https://github.com/Skylion007
2024-06-06 20:25:39 +00:00
e8670f6aea [Dynamo][TVM] Support macOS and Linux/aarch64 platforms (#128124)
Fixes #128122
With this fix, I've confirmed that the repro works on the platforms below.
- macOS 14.5 (arm64)
- Ubuntu 20.04.6 LTS (GNU/Linux 5.10.120-tegra aarch64)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128124
Approved by: https://github.com/malfet
2024-06-06 19:47:11 +00:00
de4f8b9946 [BE]: Update cudnn to 9.1.0.70 (#123475)
cuDNN has managed to upload cu11 and cu12 wheels for ~~9.0.0.312~~ 9.1.0.70, so trying this out...

CC @Skylion007 @malfet

Co-authored-by: Wei Wang <weiwan@nvidia.com>
Co-authored-by: atalman <atalman@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123475
Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/nWEIdia, https://github.com/atalman
2024-06-06 18:45:22 +00:00
fba21edf5b [CI] Ensure inductor/test_cpu_cpp_wrapper is actually run in inductor_cpp_wrapper_abi_compatible (#126717)
`inductor/test_cpu_cpp_wrapper` is not actually being run in `inductor_cpp_wrapper_abi_compatible` test config

The cpu device type gets removed in d28868c7e8/torch/testing/_internal/common_device_type.py (L733)

so d28868c7e8/test/inductor/test_cpu_cpp_wrapper.py (L396) returns false.

Feel free to make a PR with a different way to do this (a better RUN_CPU check?)

Add a skip for a failing test.  I am not equipped to fix it

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126717
Approved by: https://github.com/ZainRizvi
2024-06-06 18:23:52 +00:00
936225d7b2 [mergebot] Fix pending unstable jobs being viewed as failed (#128080)
https://github.com/pytorch/pytorch/pull/128038#issuecomment-2150802030

In the above, pending unstable jobs get put into the ok_failed_checks list, and because there are a lot of unstable jobs, it exceeds the threshold and merge fails.

I don't think unstable jobs should be considered in the ok failed checks threshold, only flaky and broken trunk jobs should be considered there.

Change looks big, but main thing is that unstable jobs don't get included in the check for how many flaky failures there are.  The other changes are mostly renames so things are clearer
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128080
Approved by: https://github.com/huydhn
2024-06-06 18:22:20 +00:00
32fb68960e [FSDP2] Added experimental warning to unshard API (#128138)
There is still ongoing discussion on how this API should work.

Current approach:
- The pre-all-gather ops run in the default stream and the all-gather is called from the default stream with `async_op=True`.
- Pros:
    - The all-gather input and output tensors are allocated in the default stream, so there is no increased memory fragmentation across stream pools.
    - There is no need for additional CUDA synchronization. The API is self-contained.
- Cons:
    - The pre-all-gather ops (e.g. cast from fp32 -> bf16 and all-gather copy-in device copies) cannot overlap with other default stream compute. The biggest concern here is for CPU offloading, the H2D copies cannot overlap.

Alternative approach:
- Follow the default implicit prefetching approach, where the pre-all-gather ops and all-gather run in separate streams.
- Pros:
    - The pre-all-gather ops can overlap with default stream compute.
- Cons:
    - We require an API that should be called after the last optimizer step (namely, last op that modified sharded parameters) and before the first `unshard` call that has the all-gather streams wait for the default stream. The API is no longer self-contained and now has a complementary API.
    - The all-gather input and output tensors are allocated in separate streams (not the default stream), so there can be increased memory fragmentation across pools.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128138
Approved by: https://github.com/wanchaol
ghstack dependencies: #128100
2024-06-06 18:18:42 +00:00
78a6b0c479 update test_reformer_train test to handle nn module inlining (#127467)
number of call nodes increase due to inlining
before inlining:
```
 class GraphModule(torch.nn.Module):
        def forward(self, function_ctx, cat: "f32[1, s0, 512]"):
            # No stacktrace found for following nodes
            _set_grad_enabled = torch._C._set_grad_enabled(False)

            # File: /data/users/lsakka/pytorch/pytorch/test/dynamo/test_repros.py:283 in backward, code: grad_attn_output, grad_hidden_states = torch.chunk(
            chunk = torch.chunk(cat, 2, dim = -1);  cat = None
            getitem: "f32[1, s0, 256]" = chunk[0]
            getitem_1: "f32[1, s0, 256]" = chunk[1];  chunk = None

            # No stacktrace found for following nodes
            _set_grad_enabled_1 = torch._C._set_grad_enabled(True)
            return (getitem_1, None)
```

after inlining:
```
class GraphModule(torch.nn.Module):
    def forward(self, s0: "Sym(s0)", L_hidden_states_: "f32[1, s0, 256]", L_self_layers_0_weight: "f32[256, 256]", L_self_layers_0_bias: "f32[256]", L_self_layer_norm_weight: "f32[512]", L_self_layer_norm_bias: "f32[512]", L_self_layer_norm_normalized_shape_0_: "Sym(512)"):
        l_hidden_states_ = L_hidden_states_
        l_self_layers_0_weight = L_self_layers_0_weight
        l_self_layers_0_bias = L_self_layers_0_bias
        l_self_layer_norm_weight = L_self_layer_norm_weight
        l_self_layer_norm_bias = L_self_layer_norm_bias
        l_self_layer_norm_normalized_shape_0_ = L_self_layer_norm_normalized_shape_0_

        # File: /data/users/lsakka/pytorch/pytorch/test/dynamo/test_repros.py:332 in forward, code: hidden_states = torch.cat([hidden_states, hidden_states], dim=-1)
        hidden_states: "f32[1, s0, 512]" = torch.cat([l_hidden_states_, l_hidden_states_], dim = -1);  l_hidden_states_ = None

        # File: /data/users/lsakka/pytorch/pytorch/test/dynamo/test_repros.py:333 in forward, code: hidden_states = _ReversibleFunction.apply(
        function_ctx = torch.autograd.function.FunctionCtx()

        # File: /data/users/lsakka/pytorch/pytorch/test/dynamo/test_repros.py:258 in forward, code: hidden_states, attn_output = torch.chunk(hidden_states, 2, dim=-1)
        chunk = torch.chunk(hidden_states, 2, dim = -1);  hidden_states = None
        hidden_states_1: "f32[1, s0, 256]" = chunk[0]
        attn_output: "f32[1, s0, 256]" = chunk[1];  chunk = None

        # File: /data/users/lsakka/pytorch/pytorch/torch/nn/modules/linear.py:116 in forward, code: return F.linear(input, self.weight, self.bias)
        attn_output_1: "f32[1, s0, 256]" = torch._C._nn.linear(attn_output, l_self_layers_0_weight, l_self_layers_0_bias);  attn_output = l_self_layers_0_weight = l_self_layers_0_bias = None

        # File: /data/users/lsakka/pytorch/pytorch/test/dynamo/test_repros.py:272 in forward, code: ctx.save_for_backward(attn_output.detach(), hidden_states.detach())
        detach: "f32[1, s0, 256]" = attn_output_1.detach()
        detach_1: "f32[1, s0, 256]" = hidden_states_1.detach()

        # File: /data/users/lsakka/pytorch/pytorch/test/dynamo/test_repros.py:279 in forward, code: return torch.cat([attn_output, hidden_states], dim=-1)
        hidden_states_2: "f32[1, s0, 512]" = torch.cat([attn_output_1, hidden_states_1], dim = -1);  attn_output_1 = hidden_states_1 = None

        # File: /data/users/lsakka/pytorch/pytorch/torch/nn/modules/normalization.py:201 in forward, code: return F.layer_norm(
        hidden_states_3: "f32[1, s0, 512]" = torch.nn.functional.layer_norm(hidden_states_2, (l_self_layer_norm_normalized_shape_0_,), l_self_layer_norm_weight, l_self_layer_norm_bias, 1e-12);  hidden_states_2 = l_self_layer_norm_normalized_shape_0_ = l_self_layer_norm_weight = l_self_layer_norm_bias = None

        # File: /data/users/lsakka/pytorch/pytorch/test/dynamo/test_repros.py:352 in forward, code: hidden_states = torch.nn.functional.dropout(
        hidden_states_4: "f32[1, s0, 512]" = torch.nn.functional.dropout(hidden_states_3, p = 0.5, training = True);  hidden_states_3 = None
        return (hidden_states_4,)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127467
Approved by: https://github.com/anijain2305
ghstack dependencies: #126444, #127146, #127424, #127440
2024-06-06 17:56:36 +00:00
304956e1fb Switch to torch.float16 on XPU AMP mode (#127741)
# Motivation
Previously, the default dtype for AMP on XPU was aligned with the CPU. To align with other GPUs, we intend to change the default dtype for AMP to `torch.float16`. This change aims to save users the effort of converting models from `torch.float16` to `torch.bfloat16`, or vice versa when they want to run the model on different types of GPUs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127741
Approved by: https://github.com/EikanWang, https://github.com/albanD
2024-06-06 17:40:13 +00:00
1d0c1087dd Allow overriding per-dim group options via _MeshEnv.set_dim_group_options (#126599)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126599
Approved by: https://github.com/wanchaol
ghstack dependencies: #126598
2024-06-06 17:18:12 +00:00
e9c5144cbc Fix bug in update_process_group DDP API (#128092)
Fix bug in `_update_process_group` DDP API where we didn't correctly reset `local_used_map_` and a few other variables. This resulted in errors like `Encountered gradient which is undefined, but still allreduced by...`

Added a unit test as well that reproduced the issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128092
Approved by: https://github.com/awgu, https://github.com/fegin
2024-06-06 17:10:42 +00:00
2ffdf556ea Add back API that some people rely on in torch.cuda.amp.grad_scaler namespace (#128056)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128056
Approved by: https://github.com/kit1980, https://github.com/eqy
2024-06-06 17:02:32 +00:00
2d47385f0f [BE]: Enable ruff TCH rules and autofixes for better imports (#127688)
Automated fixes to put imports that are only used in type hints into TYPE_CHECKING imports. This also enables the RUFF TCH rules which will automatically apply autofixes to move imports in and out of TYPE_CHECKING blocks as needed in the future, this will make the initial PyTorch import faster and will reduce cyclic dependencies.

Co-authored-by: Xuehai Pan <XuehaiPan@pku.edu.cn>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127688
Approved by: https://github.com/XuehaiPan, https://github.com/ezyang, https://github.com/malfet
2024-06-06 16:55:58 +00:00
4f87f47ea1 [dtensor] reuse DTensorSpec as much as possible (#128112)
as titled, given that our DTensorSpec is immutable, we can always reuse
the spec if the input/output have the same tensor metadata. this helps two fold:
1. We don't need to re-calculate the hash everytime we produce a
   DTensorSpec, reduce runtime operator overhead
2. reduce the DTensor construction overhead.

Some local benchmark on a 800 parameter clip_grad_norm shows that for
foreach_norm the CPU overhead reduces from 11ms -> 7.8ms (around 30% improvement)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128112
Approved by: https://github.com/awgu
2024-06-06 16:55:50 +00:00
f0dd11df55 Make ValueRange repr less chatty by default (#128043)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128043
Approved by: https://github.com/lezcano
2024-06-06 16:42:48 +00:00
eqy
0de6d2427f Bump tolerances for inductor/test_efficient_conv_bn_eval.py::EfficientConvBNEvalCudaTests::test_basic_cuda attempt 2 (#128048)
CC @nWEIdia @huydhn @Skylion007

Same thing but also bump backward tolerances...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128048
Approved by: https://github.com/Skylion007
2024-06-06 16:17:43 +00:00
a5b86a1ec0 Revert "FP8 rowwise scaling (#125204)"
This reverts commit 5dc912822913b3d90f4938891c7eca722a057cf1.

Reverted https://github.com/pytorch/pytorch/pull/125204 on behalf of https://github.com/atalman due to Sorry need to revert this failing, on internal CI. I suggest to reimport this and try to land internally resolving all issues ([comment](https://github.com/pytorch/pytorch/pull/125204#issuecomment-2152905513))
2024-06-06 16:12:34 +00:00
a5ba9b2858 Fix for addcdiv contiguous problem (#124442)
Fixes issue number #118115
Co-authored-by: Siddharth Kotapati <skotapati@apple.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124442
Approved by: https://github.com/kulinseth
2024-06-06 16:09:18 +00:00
c58d3af3b4 Revert "Add OpInfo entry for alias_copy (#127232)"
This reverts commit 457df212e1c6e1aa4f1eb2ad6ee292052d7c07e1.

Reverted https://github.com/pytorch/pytorch/pull/127232 on behalf of https://github.com/clee2000 due to broke [onnx](https://github.com/pytorch/pytorch/actions/runs/9397057801/job/25880181144) and [mps](https://github.com/pytorch/pytorch/actions/runs/9397057805/job/25879818705) tests, [hud link](457df212e1) , base is 15 days old, the onnx test xfailed on the pr but the xfail was removed so if you rebase itll surface, mps build failed so no mps tests were run on the pr ([comment](https://github.com/pytorch/pytorch/pull/127232#issuecomment-2152848758))
2024-06-06 15:44:47 +00:00
9d849d4312 Disable py3.12 nightly wheel builds for ROCm (#127968)
Triton commit bump PR https://github.com/pytorch/pytorch/pull/125396 reverted due to missing llnl-hatchet dependency for triton. Workaround is to disable py3.12 binary build jobs for ROCm on PyTorch CI until llnl-hatchet publishes py3.12 wheels on [PyPI](https://pypi.org/project/llnl-hatchet/#files)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127968
Approved by: https://github.com/atalman, https://github.com/pruthvistony
2024-06-06 15:17:35 +00:00
48a54146e7 Revert "[dynamo] Support ndarray.dtype attribute access (#124490)"
This reverts commit 4adee71155bec4e419bac32be2cbc1763bc6c98f.

Reverted https://github.com/pytorch/pytorch/pull/124490 on behalf of https://github.com/atalman due to Breaks internal builds ([comment](https://github.com/pytorch/pytorch/pull/124490#issuecomment-2152664749))
2024-06-06 14:21:29 +00:00
f08fd8e9e3 Remove redundant device guard in Resize.h (#126498)
In https://github.com/pytorch/pytorch/pull/113386 a device guard was [inserted](https://github.com/pytorch/pytorch/pull/113386/files#diff-2691af3a999b3a8f4a0f635aabcd8edf0ffeda501edfa9366648e8a89de12a90R30).

The new inserted device guarded has a clear and more confined guarded scope.
And it's hard to tell the exact purpose and scope of the  [old device guard](78ffe49a3f/aten/src/ATen/native/cuda/Resize.h (L41)).

Removing the guard has negligible positive performance impact and make the code more understandable.

Thanks

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126498
Approved by: https://github.com/eqy, https://github.com/lezcano
2024-06-06 13:01:42 +00:00
c97e3ebb96 Fix wrongly exposed variables in torch/__init__.py (#127795)
<img width="609" alt="image" src="https://github.com/pytorch/pytorch/assets/16078332/964c6707-1856-4c2c-8cd8-ce1d96d38d36">

This PR removes temporary variables in `torch/__init__.py`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127795
Approved by: https://github.com/albanD
2024-06-06 08:31:41 +00:00
457df212e1 Add OpInfo entry for alias_copy (#127232)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127232
Approved by: https://github.com/lezcano
2024-06-06 07:46:26 +00:00
f5328542b5 Allow multiple cudagraph recordings per compiled graph (#126822)
### Introduction/Problem

Today when dynamo traces a builtin nn module (nn.Linear for example) it will specially handle parameters of that module by storing them as constant attributes of the graph. This requires that dynamo guard on the ID of the NNModule because if the instance of the module changes, we need to retrace and recollect the new parameters as attributes of the graph. This creates a 1:1 compiled graph to cudagraph relationship.

With hierarchical compilation, dynamo will treat builtin nn modules like any other code. This reduces complexity and critically, if there are multiple identical layers in a model, we only need to compile one of those layers once, and reuse the same compiled artifact for each layer. This introduces a problem for the current approach to parameter handling. Since the parameters could now possibly change across calls to the compiled artifact, these need to be inputs to the graph instead of attributes. This introduces a problem for cudagraphs - previously cudagraphs was guaranteed that the parameters of builtin NN Modules would be constant across calls, but now since the compiled artifact needs to be agnostic to the actual instance of the NN module being used these parameter memory locations may vary. Previously cudagraphs simply copies varying inputs to cudagraph owned memory, but since the parameters are quite large, this is catastrophic for performance.

### Solution
To avoid this performance cliff, this PR allows cudagraphs to re-record a new cudagraph if only parameters change. Metadata about which arguments are parameters are propagated from AOT Autograd to compile_fx, and these indices are passed to cudagraphs. If these memory locations change, a new graph is recorded vs previously where this would be an error (because this previously should not happen). This enables a 1:many compiled graph to cudagraph relationship. Across similar modules we will re-record cudagraphs and dispatch the correct graph if parameter pointers match when the cudagraph is executed.

### Next steps (if needed)
It is theoretically possible that a user passes Parameters that change frequently as inputs to model code - if this is a common issue this design allows for dynamo to pass metadata indicating which parameters were created in a builtin NN Module context to only permit those parameters to have the multi-cudagraph behavior, but this PR does not implement this.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126822
Approved by: https://github.com/eellison
ghstack dependencies: #126820, #126821
2024-06-06 06:39:59 +00:00
5a3bea1e88 Remove unused arg to GraphLowering (#126821)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126821
Approved by: https://github.com/eellison
ghstack dependencies: #126820
2024-06-06 06:39:59 +00:00
70ba6f0ab6 Collect static parameter metadata in aot (#126820)
Collect the indices of the static parameters to pass down to cudagraphs in order to re-record if necessary.
This location was chosen in order to allow us to restrict this (if needed) in the future by setting metadata in dynamo.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126820
Approved by: https://github.com/bdhirsh
2024-06-06 06:39:50 +00:00
c8ff1cd387 [FSDP2] Changed test_register_forward_method to use multiprocess test (#128100)
The test seems to be flaky due to multi-threaded process group. This PR converts the test to use normal multi-process `ProcessGroupNCCL` to fix the flakiness.

This PR closes https://github.com/pytorch/pytorch/issues/126851.

Interestingly, the original MTPG version passes for me on devgpu. Either way, the new version also passes on devgpu, so we can see in CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128100
Approved by: https://github.com/weifengpy
2024-06-06 06:34:02 +00:00
638f543ac2 Enable single nadam test (#128087)
https://github.com/pytorch/pytorch/issues/117150 has been fixed

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128087
Approved by: https://github.com/xmfan
2024-06-06 06:25:00 +00:00
cd42b95047 Handle aten::__contains__ during TorchScript to ExportedProgram conversion (#127544)
#### Description
Add support for converting `prim::__contains__` from TorchScript IR to ExportedProgram, e.g.,
```python
class MIn(torch.nn.Module):
    def forward(self, x: torch.Tensor):
        return x.dtype in [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
```
#### Test Plan
* Add test cases to cover both contains IR resulted from primitive types or Tensor. `pytest test/export/test_converter.py -s -k test_ts2ep_converter_contains`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127544
Approved by: https://github.com/angelayi
2024-06-06 05:00:13 +00:00
cyy
68eb771265 [2/N] Remove unused test functions (#128005)
Following #127881, this PR continues to remove unused test functions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128005
Approved by: https://github.com/ezyang
2024-06-06 03:41:32 +00:00
2f7cfecd86 Complete revamp of float/promotion sympy handling (#126905)
At a high level, the idea behind this PR is:

* Make it clearer what the promotion and int/float rules for various Sympy operations are. Operators that previously were polymorphic over int/float are now split into separate operators for clarity. We never do mixed int/float addition/multiplication etc in sympy, instead, we always promote to the appropriate operator. (However, equality is currently not done correctly.)
* Enforce strict typing on ValueRanges: if you have a ValueRange for a float, the lower and upper MUST be floats, and so forth for integers.

The story begins in **torch/utils/_sympy/functions.py**. Here, I make some changes to how we represent certain operations in sympy expressions:

* FloorDiv now only supports integer inputs; to do float floor division, do a truediv and then a trunc. Additionally, we remove the divide out addition by gcd optimization, because sympy gcd is over fields and is willing to generate rationals (but rationals are bad for ValueRange strict typing).
* ModularIndexing, LShift, RShift now assert they are given integer inputs.
* Mod only supports integer inputs; eventually we will support FloatMod (left for later work, when we build out Sympy support for floating operations). Unfortunately, I couldn't assert integer inputs here, because of a bad interaction with sympy's inequality solver that is used by the offline solver
* TrueDiv is split into FloatTrueDiv and IntTrueDiv. This allows for us to eventually generate accurate code for Python semantics IntTrueDiv, which is written in a special way to preserve precision when the inputs are >= 2**53 beyond what first coercing the integer to floats and then doing true division.
* Trunc is split to TruncToFloat and TruncToInt.
* Round is updated to return a float, not an int, making it consistent with the round op handler in Inductor. To get Python-style conversion to int, we call TruncToInt on the result.
* RoundDecimal updated to consistently only ever return a float
* Add ToFloat for explicit coercion to float (required so we can enforce strict ValueRanges typing)

In **torch/__init__.py**, we modify SymInt and SymFloat to appropriately call into new bindings that route to these refined sympy operations.  Also, we modify `torch.sym_min` and `torch.sym_max` to have promotion semantics (if one argument is a float, the return result is always a float), making them inconsistent with builtins.min/max, but possible to do type analysis without runtime information.

We also need to introduce some new op handlers in **torch/_inductor/ops_handler.py**:

* `to_int` for truncation to int64, directly corresponding to TruncToInt; this can be implemented by trunc and dtype, but with a dedicated handler it is more convenient for roundtripping in Sympy
* `int_truediv` for Python-style integer true division, which has higher precision than casting to floats and then running `truediv`

These changes have consequences. First, we need to make some administrative changes:

* Actually wire up these Sympy functions from SymInt/SymFloat in **torch/fx/experimental/sym_node.py**, including the new promotion rules (promote2)
* Add support for new Sympy functions in **torch/utils/_sympy/interp.py**, **torch/utils/_sympy/reference.py**
  * In particular, in torch.utils._sympy.reference, we have a strong preference to NOT do nontrivial compute, instead, everything in ops handler should map to a singular sympy function
  * TODO: I chose to roundtrip mod back to our Mod function, but I think I'm going to have to deal with the C/Python inconsistency this to fix tests here
* Add printer support for the Sympy functions in **torch/_inductor/codegen/common.py**, **torch/_inductor/codegen/cpp_utils.py**, **torch/_inductor/codegen/triton.py**. `int_truediv` and mixed precision equality is currently not implemented soundly, so we will lose precision in codegen for large values. TODO: The additions here are not exhaustive yet
* Update ValueRanges logic to use new sympy functions in **torch/utils/_sympy/value_ranges.py**. In general, we prefer to use the new Sympy function rather than try to roll things by hand, which is what was done previously for many VR analysis functions.

In **torch/fx/experimental/symbolic_shapes.py** we need to make some symbolic reasoning adjustments:

* Avoid generation of rational subexpressions by removing simplification of `x // y` into `floor(x / y)`. This simplification then triggers an addition simplification rule `(x + y) / c --> x / c + y / c` which is bad because x / c is a rational number now
* `_assert_bound_is_rational` is no more, we no longer generate rational bounds
* Don't intersect non-int value ranges with the `int_range`
* Support more sympy Functions for guard SYMPY_INTERP
* Assert the type of value range is consistent with the variable type

The new asserts uncovered necessary bug fixes:

* **torch/_inductor/codegen/cpp.py**, **torch/_inductor/select_algorithm.py**, **torch/_inductor/sizevars.py** - Ensure Wild/Symbol manually allocated in Inductor is marked `is_integer` so it's accepted to build expressions
* **torch/_inductor/utils.py** - make sure you actually pass in sympy.Expr to these functions
* **torch/_inductor/ir.py** - make_contiguous_strides_for takes int/SymInt, not sympy.Expr!
* **torch/export/dynamic_shapes.py** - don't use infinity to represent int ranges, instead use sys.maxsize - 1

Because of the removal of some symbolic reasoning that produced rationals, some of our symbolic reasoning has gotten worse and we are unable to simplify some guards. Check the TODO at **test/test_proxy_tensor.py**

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126905
Approved by: https://github.com/xadupre, https://github.com/lezcano
2024-06-06 02:29:45 +00:00
c1a43a69e4 [NestedTensor] Add error checks for unbind operator coverage when ragged_idx != 1 (#128058)
Summary:
Add the following error checks for the `unbind` operator on `NestedTensor`s when `ragged_idx != 1`:

- The current implementation allows the creation of `NestedTensor` instances from the class definition with an `offsets` tensor that applies to a dimension other than the jagged dimension. This diff ensures that `unbind` fails when the `offsets` exceed the length of the jagged dimension.

Test Plan:
Added the following unit tests:

`test_unbind_with_lengths_ragged_idx_equals_2_bad_dim_cpu` verifies that `unbind` fails when there is a mismatch between the offsets and the jagged dimension, for `NestedTensor`s with `lengths`.
```
test_unbind_with_lengths_ragged_idx_equals_2_bad_dim_cpu (test_nestedtensor.TestNestedTensorSubclassCPU) ... ok
```

Reviewed By: davidberard98

Differential Revision: D57989082

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128058
Approved by: https://github.com/davidberard98
2024-06-06 01:56:12 +00:00
9795c4224b Revert "[DDP] Bucket handling: make first bucket size equal to bucket_cap_mb if it was set (#121640)"
This reverts commit e98662bed99df57b7d79f9fc1cbe670afc303235.

Reverted https://github.com/pytorch/pytorch/pull/121640 on behalf of https://github.com/clee2000 due to Sorry but it looks like you're failing  `distributed/_composable/test_replicate_with_compiler.py::ReplicateTest::test_bucketing_coalesced_op `. THe build failed so the tests didn't run, consider rebasing, there have been a couple of PRs lately related to cudnn so you probably are either based on a bad or too old of a commit e98662bed9 https://github.com/pytorch/pytorch/actions/runs/9392731942/job/25868060913 ([comment](https://github.com/pytorch/pytorch/pull/121640#issuecomment-2151258585))
2024-06-06 01:50:18 +00:00
sdp
b4a0161449 Build SYCL kernels for ATen XPU ops on Native Windows (take 2) (#127390)
Original PR https://github.com/pytorch/pytorch/pull/126725 is closed due to bad rebase.

-------
As proposed in https://github.com/pytorch/pytorch/issues/126719, we are enabling PyTorch XPU on Native Windows on Intel GPU.

This PR  enables XPU build on Windows as the first step of #126719:

- Enable `USE_XPU` build on Windows using MSVC as host compiler. The use of MSVC as host compiler seamlessly aligns with the existing PyTorch build on Windows.
- Build oneDNN GPU library on Windows.

Co-authored-by: Yu, Guangye <guangye.yu@intel.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127390
Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/ezyang
2024-06-06 01:41:06 +00:00
6adcf21b2b Documenting the torch.cuda.nccl.version function (#128022)
Fixes #127892

This PR adds docstring to the torch.cuda.nccl.version function

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128022
Approved by: https://github.com/malfet
2024-06-06 01:13:07 +00:00
bf2c05352e Make length == stop size oblivious too (#128050)
This doesn't do anything right now (need some other PRs to activate)
but since it edits a header file it would be better to land this
earlier.

Context: https://github.com/pytorch/pytorch/pull/127693

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128050
Approved by: https://github.com/Skylion007, https://github.com/lezcano
2024-06-06 01:09:37 +00:00
80d34217c6 Typo fixes: et al. (#127811)
"et al." is short for _et alia_ and should be abbreviated with a period on the second word. Noticed this typo when reading through the SGD docs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127811
Approved by: https://github.com/janeyx99
2024-06-06 01:03:25 +00:00
d3ad84c38f Use pexpr, not texpr in Triton launch codegen (#128038)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128038
Approved by: https://github.com/Skylion007
2024-06-06 00:45:59 +00:00
8bcebc8dae Add runtime dependency on setuptools for cpp_extensions (#127921)
As per title since this was removed from the builtin python binary in 3.12 and we use it `torch.utils.cpp_extension.*`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127921
Approved by: https://github.com/Skylion007
2024-06-05 23:59:38 +00:00
cyy
2fd75667b4 [Caffe2]Remove Caffe2 scripts and benchmarks (#126747)
Due to removal of Caffe2.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126747
Approved by: https://github.com/ezyang, https://github.com/malfet
2024-06-05 23:46:31 +00:00
e98662bed9 [DDP] Bucket handling: make first bucket size equal to bucket_cap_mb if it was set (#121640)
The fist DDP bucket is always being created of the size of `dist._DEFAULT_FIRST_BUCKET_BYTES` (1 MiB) by default regardless of `bucket_cap_mb`. The proposal is to set `bucket_cap_mb` as the one main bucket size if it was supplied by the user.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121640
Approved by: https://github.com/wanchaol
2024-06-05 23:44:54 +00:00
ffaea656b5 WorkerServer: add support for binding to TCP (#127986)
This adds support for the WorkerServer binding to TCP as well as the existing unix socket support.

```py
server = _WorkerServer("", 1234)
```

Test plan:

Added unit test

```
python test/distributed/elastic/test_control_plane.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127986
Approved by: https://github.com/c-p-i-o
2024-06-05 22:56:32 +00:00
a7c596870d [BE][Eazy] remove torch.torch.xxx usages (#127800)
NB: `torch` is exposed in `torch/__init__.py`. So there can be `torch.torch.torch.xxx`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127800
Approved by: https://github.com/peterbell10, https://github.com/kit1980, https://github.com/malfet
2024-06-05 21:53:49 +00:00
4123323eff [ONNX] Single function for torch.onnx.export and torch.onnx.dynamo_export (#127974)
Add `dynamo: bool = True` as a switch in `torch.onnx.export` to provide users an option to try `torch.onnx.dynamo_export`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127974
Approved by: https://github.com/justinchuby
2024-06-05 21:27:46 +00:00
01694eaa56 Move cuda 12.4 jobs to periodic for both pull and inductor (#127825)
Moves 12.4 sm86/a10g jobs in pull to trunk
Moves 12.4 cuda non sm86 jobs to periodic
Moves 12.4 jobs in inductor to inductor-periodic, except inductor_timm which seems to give important signal

There has been a lot of queueing for cuda runners due to the addition of jobs for cuda 12.4, so move those jobs to other workflows that are run less often
Co-authored-by: Andrey Talman <atalman@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127825
Approved by: https://github.com/ZainRizvi, https://github.com/nWEIdia, https://github.com/atalman, https://github.com/malfet
2024-06-05 21:01:36 +00:00
8184cd85fc [fake tensor] Set _is_param for base fake tensors for views (#127823)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127823
Approved by: https://github.com/eellison, https://github.com/ezyang
ghstack dependencies: #127972
2024-06-05 20:26:52 +00:00
626dc934d1 [dynamo][pippy] Hotfix for nn_module_stack for pippy usecase (#127972)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127972
Approved by: https://github.com/ydwu4
2024-06-05 20:14:50 +00:00
72e863df27 Update _learnable_fake_quantize.py (#127993)
Remove sentence "For literature references, please see the class _LearnableFakeQuantizePerTensorOp." and add "s" to "support"

(Possibly) Fixes #99107 (But not sure, sorry)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127993
Approved by: https://github.com/jerryzh168
2024-06-05 20:02:33 +00:00
6e545392cd Move nongpu workflows from trunk to periodic (#128049)
We don't need to run them on every PR. These are used to test for graceful degradation of GPU.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128049
Approved by: https://github.com/clee2000
2024-06-05 18:31:26 +00:00
6412c6060c [reland] Refresh OpOverloadPacket if a new OpOverload gets added (#128000)
If a user accesses an OpOverloadPacket, then creates a new OpOverload,
then uses the OpOverloadPacket, the new OpOverload never gets hit. This
is because OpOverloadPacket caches OpOverloads when it is constructed.

This PR fixes the problem by "refreshing" the OpOverloadPacket if a new
OpOverload gets constructed and the OpOverloadPacket exists.

Test Plan:
- new tests

This is the third land attempt. The first one was reverted for breaking
internal tests, the second was reverted for being erroneously suspected
of causing a perf regression.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128000
Approved by: https://github.com/albanD
2024-06-05 17:57:09 +00:00
bb68b54be0 [BE][ptd_fb_test][1/N] Enable testslide (#127512)
This change allows to enable Testslide, which gives us more readable output, import time, etc. The PR is previously stamped https://github.com/pytorch/pytorch/pull/126460 but the old PR has some ghexport issue.

Differential Revision: [D57919583](https://our.internmc.facebook.com/intern/diff/D57919583/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127512
Approved by: https://github.com/wz337, https://github.com/Skylion007
2024-06-05 17:45:15 +00:00
3acbfd602e Document torch.utils.collect_env.get_env_info function (#128021)
Fixes #127911

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128021
Approved by: https://github.com/malfet
2024-06-05 17:44:47 +00:00
6454e95824 [FSDP2] enable CI for torch.compile(root Transformer) (#127832)
This CI showcases FSDP2 works with `torch.compile` root model, since FSDP1 can do the same

compiling root Transformer without AC: `pytest test/distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_multi_group`

compiling root Transformer with AC: `pytest test/distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_with_activation_checkpointing`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127832
Approved by: https://github.com/awgu
2024-06-05 17:29:46 +00:00
4adee71155 [dynamo] Support ndarray.dtype attribute access (#124490)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124490
Approved by: https://github.com/lezcano
ghstack dependencies: #125717
2024-06-05 17:20:01 +00:00
a9cc147fa1 [DSD][FSDP1] Deprecate FSDP.state_dict_type and redirect users to DSD (#127794)
Summary:
As title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127794
Approved by: https://github.com/awgu
ghstack dependencies: #127793
2024-06-05 16:55:05 +00:00
9acc19f8da [inductor] Take absolute value of strides when picking loop order (#127425)
Fixes #126860

The stride hint is found by comparing the value of the indexing expression
evaluated at `idx` set to all zeros and at `idx[dim] = 1`. This causes a problem
for padded inputs where 0 and 1 are still in the padded region.

In particular, for reflection padding this causes the stride to be negative.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127425
Approved by: https://github.com/lezcano
2024-06-05 16:48:22 +00:00
22964d1007 [DSD] Deprecate submodules feature for DSD (#127793)
Summary:
Getting a partial of the state_dict and set the state_dict with the type of Dict[nn.Module, Dict[str, Any]] is too complicated and can confuse users. The features can be achieved by simple pre-processing and post-processing by users. So this PR adds the deprecation warning to the feature.

The previous PR, https://github.com/pytorch/pytorch/pull/127070, assumes
no one is using the feature and remove it without the grace period. This
seems to be too aggresive and causes some concerns. This PR adds the
deprecation warning and tests.

We will remove the support in 2.5.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127793
Approved by: https://github.com/LucasLLC
2024-06-05 16:31:29 +00:00
5dc9128229 FP8 rowwise scaling (#125204)
# Summary
This pull request introduces an fp8 row-scaling kernel as an optional implementation for `scaled_mm`. The kernel selection is based on the scaling tensors of the inputs. For inputs `x` and `y` of shape `[M, K]` and `[K, N]` respectively, the following conditions must be met:
- `x`'s scale should be a 1-dimensional tensor of length `M`.
- `y`'s scale should be a 1-dimensional tensor of length `N`.

It's important to note that this kernel is not called "rowwise, columnwise" scaling because, although the scales for `y` are semantically along its columns, this implementation only supports the TN format. This means the scaling is along the faster-moving dimension, or the "row".

The following two PRs were required to enable local builds:
- [PR #126185](https://github.com/pytorch/pytorch/pull/126185)
- [PR #125523](https://github.com/pytorch/pytorch/pull/125523)

### Todo
We still do not build our Python wheels with this architecture.

@ptrblck @malfet, should we replace `sm_90` with `sm_90a`?

The NVRTC TMA shadowing feels wrong, but I a not sure the right way to spoof the symbol for this compilation unit:
https://github.com/pytorch/pytorch/pull/125204/files#r1586986954

#### ifdef

I tried to use : `#if !defined(USE_ROCM) && defined(CUDA_VERSION) && CUDA_VERSION >= 12000 && \
    defined(__CUDA_ARCH__) && __CUDA_ARCH__ > 900` to gate the building of the kernel. I was having a hell of a time with this.. so I am not really sure the right way to do this

Kernel Credit:
@jwfromm

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125204
Approved by: https://github.com/lw, https://github.com/malfet
2024-06-05 15:46:40 +00:00
4f9fcd7156 Handle unpacking during TorchScript to ExportedProgram conversion (#127419)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127419
Approved by: https://github.com/angelayi
2024-06-05 15:27:13 +00:00
cyy
9f2c4b9342 Replace with standard type traits in torch/csrc (#127852)
In preparation to clean up more type traits.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127852
Approved by: https://github.com/ezyang
2024-06-05 15:22:48 +00:00
cyy
3d617333e7 Simplify CMake code (#127683)
Due to the recent adoption of find(python), it is possible to further simplify some CMake code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127683
Approved by: https://github.com/ezyang
2024-06-05 15:17:31 +00:00
cyy
df75a9dc80 Remove Caffe2/onnx (#127991)
Remove Caffe2/onnx since it is not used. Other tiny fixes are also applied.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127991
Approved by: https://github.com/ezyang
2024-06-05 15:10:12 +00:00
d48c25c7d1 [BE] Fix missing-prototypes errors in Metal backend (#127994)
By declaring a bunch of functions static.
Removed `USE_PYTORCH_METAL` from list of flags that suppress `-Werror=missing-prototypes`. This  will prevent regressions like the ones reported in https://github.com/pytorch/pytorch/issues/127942 to sneak past CI, that builds PyTorch with Metal support.
Use nested namespaces
Remove spurious semicolon after TORCH_LIBRARY declaration.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127994
Approved by: https://github.com/Skylion007, https://github.com/ZainRizvi
2024-06-05 14:58:19 +00:00
8992141dba Restore MPS testing on MacOS 13 and m2 metal (#127853)
The runners are ready now https://github.com/organizations/pytorch/settings/actions/runners?qr=label%3Amacos-m1-13, we want to keep some MacOS 13 runner for mps coverage until MacOS 15 is out.

This also fixes the `macos-m2-14` mistake from https://github.com/pytorch/pytorch/pull/127582.

The current `macos-m2-14` runner is on 14.2 while our `macos-m1-14` has 14.4.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127853
Approved by: https://github.com/malfet
2024-06-05 14:44:00 +00:00
879d01afcb [dynamo][numpy] Add unsigned integer dtypes (#125717)
We should support these to whatever extent we can. They corresponding
`torch.uint<w>` types are defined, so I don't see an issue with
generating the various casting rules and allowing them to trace.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125717
Approved by: https://github.com/lezcano
2024-06-05 14:33:47 +00:00
4ce5322a1f Enable UFMT on test_shape_ops.py test_show_pickle.py test_sort_and_select.py (#127165)
Fixes some files in #123062

Run lintrunner on files:
test_shape_ops.py
test_show_pickle.py
test_sort_and_select.py

```bash
$ lintrunner --take UFMT --all-files
ok No lint issues.
Successfully applied all patches.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127165
Approved by: https://github.com/ezyang
2024-06-05 14:31:26 +00:00
faabda4fc9 [Inductor] Skip model_fail_to_load and eager_fail_to_run models in inductor benchmarks test (#127210)
Aligned with test-infra repo, we skipped `model_fail_to_load` and `eager_fail_to_run` models
Refer code logic:
d3b79778f8/torchci/rockset/inductor/__sql/compilers_benchmark_performance.sql (L57-L58)
```SQL
  WHERE
    filename LIKE '%_accuracy'
    AND filename LIKE CONCAT(
      '%_', : dtypes, '_', : mode, '_', : device,
      '_%'
    )
    AND _event_time >= PARSE_DATETIME_ISO8601(:startTime)
    AND _event_time < PARSE_DATETIME_ISO8601(:stopTime)
    AND (workflow_id = :workflowId OR :workflowId = 0)
    AND accuracy != 'model_fail_to_load'
    AND accuracy != 'eager_fail_to_run'
),
```

Comp Item | Compiler | suite | Before | After fix
-- | -- | -- | -- | --
Pass Rate | Inductor | torchbench | 96%, 80/83 | 100%, 80/80

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127210
Approved by: https://github.com/jansel
2024-06-05 14:23:09 +00:00
c3949b20a1 Opt model save and load (#126374)
## save&load support for OptimizedModule

[Issue Description](https://github.com/pytorch/pytorch/pull/101651)

English is not my native language; please excuse typing errors.

This pr is based on commit b9588101c4d3411b107fdc860acfa8a72c642f91\
I'll do something with the merge conflicts later

### test result for test/dynamo

Conclusion:\
It performs the same as before as far as I can see.

ENV(CPU only):\
platform linux -- Python 3.10.14, pytest-7.3.2, pluggy-1.5.0\
configfile: pytest.ini\
plugins: anyio-3.7.1, cpp-2.3.0, flakefinder-1.1.0, xdist-3.3.1, xdoctest-1.1.0, metadata-3.1.1, html-4.1.1, hypothesis-5.35.1, rerunfailures-14.0

#### before this pr:

[before](https://github.com/pytorch/pytorch/files/15329370/before.md)

#### after this pr:

[after](https://github.com/pytorch/pytorch/files/15329376/after.md)

### some changes

1. add test_save_and_load to test/dynamo/test_modules.py with & without "backend='inductor'"
2. add \_\_reduce\_\_ function to OptimizedModule and derived classes of _TorchDynamoContext for pickling & unpickling
3. change the wrappers into wrapper classes ( including convert_frame_assert, convert_frame, catch_errors_wrapper in torch/_dynamo/convert_frame.py & wrap_backend_debug in torch/_dynamo/repro/after_dynamo.py )
4. change self.output.compiler_fn into innermost_fn(self.output.compiler_fn) in torch/_dynamo/symbolic_convert.py to get the origin compiler_fn and to avoid the "compiler_fn is not eager" condition

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126374
Approved by: https://github.com/msaroufim, https://github.com/jansel
2024-06-05 13:01:16 +00:00
9a8ab778d3 Revert "[BE]: Update cudnn to 9.1.0.70 (#123475)"
This reverts commit c490046693e77e254664e19d940e9b05a1da18ef.

Reverted https://github.com/pytorch/pytorch/pull/123475 on behalf of https://github.com/huydhn due to CUDA trunk jobs are pretty red after this change, and the forward fix https://github.com/pytorch/pytorch/pull/127984 does not look working ([comment](https://github.com/pytorch/pytorch/pull/123475#issuecomment-2149258430))
2024-06-05 08:59:53 +00:00
bb2de3b101 Fixed broken link and removed unfinished sentence from issue #126367 (#127938)
Fixes #126367.

## Description

Fixed a broken link in the pytorch/docs/source/torch.compiler_faq.rst doc and deleted a few words that were extra according to the issue tagged above.

## Checklist
- [X] The issue that is being fixed is referred in the description
- [X] Only one issue is addressed in this pull request
- [X] Labels from the issue that this PR is fixing are added to this pull request
- [X] No unnecesary issues are included into this pull request

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127938
Approved by: https://github.com/msaroufim
2024-06-05 07:37:32 +00:00
4a384d813b [SDPA/memeff] Backport changes from xFormers to PT (#127090)
Backporting a few fixes from xFormers:
* Bug fixes for local attention (which is not exposed in PT at the moment)
* Massively reduced memory usage on the BW pass (see also https://github.com/facebookresearch/xformers/pull/1028)

Essentially this will also make xFormers build process much easier, as we will be able to use mem-eff from PyTorch (if the user has a recent enough version) rather than building it at xFormers install time
The goal is to have the source of truth for these files in PT moving forward, and remove them from xFormers eventually once our users have a recent-enough version of PT.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127090
Approved by: https://github.com/drisspg
2024-06-05 07:33:27 +00:00
cyy
b054470db2 Remove unused functions (#127881)
Some unused functions detected by g++ warnings can be removed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127881
Approved by: https://github.com/zou3519
2024-06-05 05:21:24 +00:00
30788739f4 [c10d] add a simple test to demonstrate the user usage of collectives (#127665)
Summary:
Just play around the UT and think it would be good to give an simple
example of user function which can be used for different subclasses of
_ControlCollectives, and test the user function can be executed
correctly

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127665
Approved by: https://github.com/d4l3k
2024-06-05 04:32:11 +00:00
e505132797 [export] track TORCH_DYNAMO_DO_NOT_EMIT_RUNTIME_ASSERTS for export runtime asserts (#127554)
Track TORCH_DYNAMO_DO_NOT_EMIT_RUNTIME_ASSERTS=1 in export so it doesn't omit runtime asserts.

Differential Revision: D57978699

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127554
Approved by: https://github.com/tugsbayasgalan
2024-06-05 04:16:54 +00:00
d5cb5d623a Revert "Complete revamp of float/promotion sympy handling (#126905)"
This reverts commit fb696ef3aa34e20c0fef1c0210a397abd3ea5885.

Reverted https://github.com/pytorch/pytorch/pull/126905 on behalf of https://github.com/ezyang due to internal user reported ceiling equality simplification problem, I have a plan ([comment](https://github.com/pytorch/pytorch/pull/126905#issuecomment-2148805840))
2024-06-05 03:57:58 +00:00
55a4ef80c4 [pipelining] test pipeline_order in schedule (#127559)
Add a unittest to test validate the pipeline order for different `num_stages`, `num_microbatches`, `num_world_size` combinations. This doesn't actually run the schedule but just validates the ordering of microbatches processed is valid, therefore doesn't require GPUs / multiple processes.

Will add more combinations and negative tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127559
Approved by: https://github.com/wconstab
ghstack dependencies: #127084, #127332
2024-06-05 03:51:27 +00:00
71e684bfae [BE][Mac] Add missing prototypes (#127988)
Really confused how CI did not catch this one, but this triggers missing prototype erros if compiled from scratch on MacOS Sonoma using clang-15

Fixes https://github.com/pytorch/pytorch/issues/127942

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127988
Approved by: https://github.com/Skylion007, https://github.com/huydhn
2024-06-05 02:16:50 +00:00
cyy
ce4436944c Fix IOS builds (#127985)
IOS builds fail these days, fix it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127985
Approved by: https://github.com/ezyang
2024-06-05 02:14:43 +00:00
a135776307 Remove tensor subclass detection logic from weights_only unpickler (#127808)
Remove logic to auto-detect and allow subclasses that did not override certain methods from the weights_only unpickler from https://github.com/pytorch/pytorch/pull/124331 for 2.4 release

Subclasses should be loadable using `torch.serialization.add_safe_globals`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127808
Approved by: https://github.com/malfet
2024-06-05 02:14:30 +00:00
8e496046e5 Update torch-xpu-ops pin (ATen XPU implementation) (#127879)
Support AMP GradScaler.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127879
Approved by: https://github.com/EikanWang
2024-06-05 02:13:46 +00:00
6c07e2c930 fix redundant tensor (#127850)
As title.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127850
Approved by: https://github.com/mikaylagawarecki
2024-06-05 02:03:02 +00:00
8830b81208 [c10d] Add commCreateFromRanks to c10d (#127421) (#127982)
This is a duplicate of: https://github.com/pytorch/pytorch/pull/127421 which we can't merge. its landed internally already

Summary:

`ncclCommCreateFromRanks` - described in this [document](https://docs.google.com/document/d/1QIRkAO4SAQ6eFBpxE51JmRKRAH2bwAHn8OIj69XuFqQ/edit#heading=h.5g71oqe3soez), replaces `ncclCommSplit` in NCCLX versions 2.21.5+.  The difference is that `ncclCommCreateFromRanks` is given a list of active ranks and is collective only over those ranks as opposed to `ncclCommSplit` for which you give it a color for every rank including NO_COLOR for inactive ranks and the collective is over the entire world.

This diff connects `ncclCommCreateFromRanks` to `c10d`

`ncclCommSplit` will still be available at the NCCL API but, in this diff, is not used starting at version 2.21.5

Split the python test and implementation of `split()` for internal FB and external OSS builds.

The diff defines `"USE_C10D_NCCL_FBCODE"` as a compiler option. When defined, we use the version of split in the newly created `NCCLUtils.cpp` in the `fb` directory.  The `fb` directory is not *shipit*-ed to *github*.

The same API is used for `split()` in both the `ncclx` and `nccl` versions adding `ranks` to the API.  This argument is not used in the `nccl` version nor in the 2.18 `ncclx` version where `ncclCommSplit()` is used instead of `ncclCommCreateFromRanks()` in `ncclx`

This diff was squashed with D57343946 - see D57343946 for additional review comments.

Test Plan:
for 2.18.3-1 and 2.21.5-1 versions:
```
buck2 run fbcode//mode/opt -c param.use_nccl=True -c fbcode.nvcc_arch=a100 -c hpc_comms.use_ncclx="$VERSION" -c fbcode.enable_gpu_sections=true  fbcode//caffe2/test/distributed/fb:test_comm_split_subgroup_x
```

```
BUILD SUCCEEDED
...
ok

----------------------------------------------------------------------
Ran 1 test in 10.210s

OK
~/scripts
```

OSS build:
`[cmodlin@devgpu003.vll5 ~/fbsource/third-party/ncclx/v2.21.5-1 (e56338cfa)]$ ./maint/oss_build.sh`

OSS build output:
```
...
ncclCommHash 197dce9b413e2775
nccl commDesc example_pg
Dump from comm 0x4708aa0 rings: [[0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0]]
Dump from comm 0x4708aa0 commDesc: example_pg
Dump from comm 0x4708aa0 nRanks: 1
Dump from comm 0x4708aa0 nNodes: 1
Dump from comm 0x4708aa0 node: 0
Dump from comm 0x4708aa0 localRanks: 1
Dump from comm 0x4708aa0 localRank: 0
Dump from comm 0x4708aa0 rank: 0
Dump from comm 0x4708aa0 commHash: "197dce9b413e2775"

2024-05-24T09:02:54.385543 devgpu003:3040664:3040744 [0][AsyncJob]ctran/backends/ib/CtranIb.cc:143 NCCL WARN CTRAN-IB : No active device found.

2024-05-24T09:02:54.385607 devgpu003:3040664:3040744 [0][AsyncJob]ctran/mapper/CtranMapper.cc:187 NCCL WARN CTRAN: IB backend not enabled
Created NCCL_SPLIT_TYPE_NODE type splitComm 0x11c76d0, rank 0
~/fbsource/third-party/ncclx/v2.21.5-1
```

Reviewed By: wconstab, wesbland

Differential Revision: D56907877

Fixes #ISSUE_NUMBER

Co-authored-by: Cory Modlin <cmodlin@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127982
Approved by: https://github.com/izaitsevfb
2024-06-05 00:19:52 +00:00
7fdfb88f03 [pipelining] rewrite interleaved 1f1b (#127332)
## Context

Interleaved 1F1B has multiple points in the schedule where communication is both criss-crossed across ranks leading to hangs due to 1. looped nature of schedules, 2. batched nature of forward + backward in 1f1b phase.

<img width="1370" alt="image" src="https://github.com/pytorch/pytorch/assets/14858254/a07c2b1d-8a99-420b-9ba3-32a0115d228b">

In the current implementation, it is difficult to fix these hangs since it requires `dist.recv` from a prior point in time, but each rank operates on its own step schedule and does not have knowledge of other ranks operations to perform the `recv` prior to their own `send`.

## New implementation

The new implementation is split into 2 parts:

1. Creating the pipeline order.

Each rank will create the timestep normalized ordering of all schedule actions across all ranks. This is created once during the initialization of the schedule class. The timestep between each rank is normalized as each rank can only have 1 computation action (forward or backward) during that timestep.

<img width="1065" alt="image" src="https://github.com/pytorch/pytorch/assets/14858254/196f2347-7ff4-49cf-903b-d8db97d1156f">

3. Executing the pipeline order.

Once the pipeline order is determined, execution is simple because as each rank will perform its send to its peer (based on whether they did forward and backward). Now that each rank has a global understanding of the schedule, they can check their previous and next neighbor ranks to see if they need to recv any activations/gradients from them. Therefore, during execution, each rank is aligned and executing the same time step.

## Benefits

- Implementation is faster since 1f1b computation can now be split up in two time steps, 1 for forward and 1 for backward.
- Debugging is easier since we can now determine which timestep each rank is hung on
- Testing is easier since we can just validate the pipeline order, without running the schedule. This allows us to test on large amount of ranks without actually needing the GPUs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127332
Approved by: https://github.com/wconstab
ghstack dependencies: #127084
2024-06-04 23:46:05 +00:00
1f67cfd437 [inductor] raise tolerance for cspdarknet (#127949)
cspdarknet previously is flaky but after https://github.com/pytorch/pytorch/pull/127367 it fails quite stably. It's probably due to small numerical change from the mentioned PR. That PR will let inductor generated different code due to different loop orders.

Raise tolerance to pass CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127949
Approved by: https://github.com/atalman, https://github.com/nWEIdia, https://github.com/eqy
2024-06-04 23:28:20 +00:00
907cb28f67 Revert "Inductor: Allow small sizes of m for mixed mm autotuning (#127663)"
This reverts commit d8d0bf264a736c7fb3cd17799a1c1aba4addf8d9.

Reverted https://github.com/pytorch/pytorch/pull/127663 on behalf of https://github.com/soulitzer due to breaks torch ao CI, see: https://github.com/pytorch/pytorch/issues/127924 ([comment](https://github.com/pytorch/pytorch/pull/127663#issuecomment-2148554128))
2024-06-04 23:06:43 +00:00
f4b05ce683 Add registry for TorchScript to ExportedProgram conversion (#127464)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127464
Approved by: https://github.com/ydwu4, https://github.com/angelayi
2024-06-04 22:53:00 +00:00
0eb9ec958a Revert "Inductor respects strides for custom ops by default (#126986)" (#127923)
This reverts commit dd64ca2a02434944ecbc8f3e186d44ba81e3cb26.

There's a silent incorrectness bug with needs_fixed_stride_order=True and
mutable custom ops, so it's better to flip the default back to avoid
silent incorrectness.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127923
Approved by: https://github.com/williamwen42
2024-06-04 22:25:45 +00:00
20f966a8e0 Ignore undocumented PipelineSchedule.step (#127955)
Ignore undocumented PipelineSchedule.step to fix doc build:

https://github.com/pytorch/pytorch/actions/runs/9372492435/job/25805861083?pr=127938#step:11:1284

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127955
Approved by: https://github.com/kit1980
2024-06-04 22:11:09 +00:00
a7b1dd82ff Default XLA to use swap_tensors path in nn.Module._apply (#126814)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126814
Approved by: https://github.com/JackCaoG, https://github.com/albanD
ghstack dependencies: #127313
2024-06-04 21:40:49 +00:00
1b704a160f Add linker script optimization flag to CMAKE rule for CUDA ARM wheel (#127514)
Original PR - https://github.com/pytorch/pytorch/pull/127220

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127514
Approved by: https://github.com/Aidyn-A, https://github.com/atalman
2024-06-04 20:51:44 +00:00
6dc0a291b9 Revert "[dynamo] Bugfix for nn parameter construction (#127806)"
This reverts commit f27c4dd862bf79f37019ef277957cd577d57b66f.

Reverted https://github.com/pytorch/pytorch/pull/127806 on behalf of https://github.com/PaliC due to causing nn tests to fail ([comment](https://github.com/pytorch/pytorch/pull/127806#issuecomment-2148393903))
2024-06-04 20:51:41 +00:00
597922ba21 Reapply "distributed debug handlers (#126601)" (#127805)
This reverts commit 7646825c3eb687030c4f873b01312be0eed80174.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127805
Approved by: https://github.com/PaliC
2024-06-04 19:44:30 +00:00
e76b28c765 [dtensor][debug] added c10d alltoall_ and alltoall_base_ to CommDebugMode (#127360)
**Summary**
Added c10d alltoall_ and alltoall_base tracing to CommDebugMode and edited test case in test_comm_mode to include added features.

**Test Plan**
pytest test/distributed/_tensor/debug/test_comm_mode.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127360
Approved by: https://github.com/wz337, https://github.com/XilunWu, https://github.com/yifuwang
ghstack dependencies: #127358
2024-06-04 18:29:48 +00:00
01e6d1cae4 [dtensor][debug] added c10d reduce_scatter_ and reduce_scatter_tensor_coalesced tracing_ to CommDebugMode (#127358)
**Summary**
Added c10d reduce_scatter_ and reduce_scatter_tensor_coalesced tracing to CommDebugMode and edited test case in test_comm_mode to include added features.

**Test Plan**
pytest test/distributed/_tensor/debug/test_comm_mode.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127358
Approved by: https://github.com/wz337, https://github.com/XilunWu, https://github.com/yifuwang
2024-06-04 18:29:48 +00:00
9a25ff77af Revert "[inductor] Enable subprocess-based parallel compile as the default (#126817)"
This reverts commit cf77e7dd9770caf65e898ac2ee82045aa0408e30.

Reverted https://github.com/pytorch/pytorch/pull/126817 on behalf of https://github.com/huydhn due to There are lots of flaky inductor failure showing up in trunk after this commit cf77e7dd97, so I am trying to revert this to see if this helps ([comment](https://github.com/pytorch/pytorch/pull/126817#issuecomment-2148143502))
2024-06-04 18:26:12 +00:00
f27c4dd862 [dynamo] Bugfix for nn parameter construction (#127806)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127806
Approved by: https://github.com/jansel
ghstack dependencies: #127785, #127802
2024-06-04 18:25:46 +00:00
569c5e72e7 [dynamo] Unspec nn module when global backward hooks are present (#127802)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127802
Approved by: https://github.com/jansel
ghstack dependencies: #127785
2024-06-04 18:25:46 +00:00
c7e936a56a [dynamo] Tensorvariable - track grad with _grad field (#127785)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127785
Approved by: https://github.com/jansel
2024-06-04 18:25:46 +00:00
3bcc3cddb5 Using scalarType instead string in function _group_tensors_by_device_and_dtype. (#127869)
Now torch.dtype can pass through pybind11, so modify function _group_tensors_by_device_and_dtype to using scalar type. And without convert torch.dtype and string in python and c++ side.
@ezyang @bdhirsh
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127869
Approved by: https://github.com/ezyang
2024-06-04 18:19:33 +00:00
0ff60236ab Revert "Retire torch.distributed.pipeline (#127354)"
This reverts commit b9c058c203ee38032594f898f27cd8404f113a63.

Reverted https://github.com/pytorch/pytorch/pull/127354 on behalf of https://github.com/huydhn due to Sorry for reverting your change but the doc build failure looks legit b9c058c203 ([comment](https://github.com/pytorch/pytorch/pull/127354#issuecomment-2148133982))
2024-06-04 18:19:31 +00:00
627d2cd87d [CI] disable td for xpu ci test by default (#127611)
Due to the xpu ci test has been enabled td by default, a lot of test cases (75%) have been skipped in CI tests. It caused some ci failures escaped from the ci tests, for example issue #127539. This PR depends on PR #127595 landed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127611
Approved by: https://github.com/etaf, https://github.com/atalman
2024-06-04 17:15:10 +00:00
36e9b71613 Enable UFMT on test/test_jit_fuser_te.py (#127759)
Part of #123062

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127759
Approved by: https://github.com/ezyang
2024-06-04 16:56:03 +00:00
ff32f6c93b Use freshly traced jit-traced module to be used in export analysis (#127577)
Summary: When we export already traced module, it seems to be modifying some global state causing the traced modules to fail to run. For now, we are only logging for test cases, so it is probs ok to trace fresh copy to be used in export for now.

Test Plan: CI

Differential Revision: D57983518

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127577
Approved by: https://github.com/pianpwk
2024-06-04 16:54:23 +00:00
c490046693 [BE]: Update cudnn to 9.1.0.70 (#123475)
cuDNN has managed to upload cu11 and cu12 wheels for ~~9.0.0.312~~ 9.1.0.70, so trying this out...

CC @Skylion007 @malfet

Co-authored-by: Wei Wang <weiwan@nvidia.com>
Co-authored-by: atalman <atalman@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123475
Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/nWEIdia
2024-06-04 16:33:06 +00:00
97ea2b5d83 documentation for pattern_matcher.py (#127459)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127459
Approved by: https://github.com/oulgen
ghstack dependencies: #127457, #127458
2024-06-04 15:24:47 +00:00
7a60a75256 Add typing annotations to pattern_matcher.py (#127458)
Turn on `mypy: disallow-untyped-defs` in pattern_matcher.py and fix the fallout.

There are still a bunch of `type: ignore` annotations which should eventually be ironed out.

In the processs found a bug: #127457

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127458
Approved by: https://github.com/Skylion007
ghstack dependencies: #127457
2024-06-04 15:24:47 +00:00
9adfa143d7 fix post_grad pattern (#127457)
The lowering pattern built by cuda_and_enabled_mixed_mm_and_not_int8() was using ListOf() incorrectly - ListOf() is meant to represent a single repeating pattern - but cuda_and_enabled_mixed_mm_and_not_int8() was passing two patterns - I think based on the comment it's trying to build a sequence which would be represented by an actual list, not ListOf().

The behavior of the existing pattern would be to pass the second pattern as the `partial` parameter of `ListOf` which is meant to be a boolean - so it's almost certainly not what was intended.

I tried changing it to be what I thought was the intended behavior but then the resnet152 test failed accuracy - so I'm just preserving the existing behavior with the correct parameter types.

Found when adding annotations to pattern_matcher.py (#127458)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127457
Approved by: https://github.com/oulgen
2024-06-04 15:24:41 +00:00
cyy
f8c6d43524 Concat namespaces and other fixes in torch/csrc/utils (#127833)
It contains formatting and other minor fixes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127833
Approved by: https://github.com/ezyang
2024-06-04 15:12:45 +00:00
91461601b6 [TORCH_FA2_flash_api] Update total_q to the reshaped query 0th dimension (#127524)
There is a difference (&bug) between the TORCH_FA2_flash_api:**mha_varlen_fwd** and FA2_flash_api:**mha_varlen_fwd** at the query transposition (GQA) step.

```
at::Tensor temp_q = q;
if (seqlenq_ngroups_swapped) {
        temp_q = q.reshape( ...
 ...
}
const int total_q = q.sizes()[0];
CHECK_SHAPE(temp_q, total_q, num_heads, head_size_og);
```

When doing query transposition we need to update total_q to the reshaped query 0th dimension, i.e:
```
const int total_q = temp_q.sizes()[0];
 ```

In the original FA2_flash_api:**mha_varlen_fwd** they dont introduce a new variable temp_q but overwrite the q value directly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127524
Approved by: https://github.com/drisspg
2024-06-04 14:44:45 +00:00
c209fbdc53 [inductor] Fix missing unbacked def for unbacked in input expr (#127770)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127770
Approved by: https://github.com/ezyang
2024-06-04 14:43:01 +00:00
cyy
059cae6176 [Caffe2] Remove Caffe2 proto and other files (#127655)
Remove Caffe2 proto files altogether.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127655
Approved by: https://github.com/ezyang
2024-06-04 14:22:21 +00:00
4c074a9b8b Revert "[torchbind] always fakify script object by default in non-strict export (#127116)"
This reverts commit c27882ffa8c1c7e4cf8ebc6c2f879e5b6c8814ad.

Reverted https://github.com/pytorch/pytorch/pull/127116 on behalf of https://github.com/atalman due to Failing internal tests ([comment](https://github.com/pytorch/pytorch/pull/127116#issuecomment-2147459339))
2024-06-04 12:53:19 +00:00
fb696ef3aa Complete revamp of float/promotion sympy handling (#126905)
At a high level, the idea behind this PR is:

* Make it clearer what the promotion and int/float rules for various Sympy operations are. Operators that previously were polymorphic over int/float are now split into separate operators for clarity. We never do mixed int/float addition/multiplication etc in sympy, instead, we always promote to the appropriate operator. (However, equality is currently not done correctly.)
* Enforce strict typing on ValueRanges: if you have a ValueRange for a float, the lower and upper MUST be floats, and so forth for integers.

The story begins in **torch/utils/_sympy/functions.py**. Here, I make some changes to how we represent certain operations in sympy expressions:

* FloorDiv now only supports integer inputs; to do float floor division, do a truediv and then a trunc. Additionally, we remove the divide out addition by gcd optimization, because sympy gcd is over fields and is willing to generate rationals (but rationals are bad for ValueRange strict typing).
* ModularIndexing, LShift, RShift now assert they are given integer inputs.
* Mod only supports integer inputs; eventually we will support FloatMod (left for later work, when we build out Sympy support for floating operations). Unfortunately, I couldn't assert integer inputs here, because of a bad interaction with sympy's inequality solver that is used by the offline solver
* TrueDiv is split into FloatTrueDiv and IntTrueDiv. This allows for us to eventually generate accurate code for Python semantics IntTrueDiv, which is written in a special way to preserve precision when the inputs are >= 2**53 beyond what first coercing the integer to floats and then doing true division.
* Trunc is split to TruncToFloat and TruncToInt.
* Round is updated to return a float, not an int, making it consistent with the round op handler in Inductor. To get Python-style conversion to int, we call TruncToInt on the result.
* RoundDecimal updated to consistently only ever return a float
* Add ToFloat for explicit coercion to float (required so we can enforce strict ValueRanges typing)

In **torch/__init__.py**, we modify SymInt and SymFloat to appropriately call into new bindings that route to these refined sympy operations.  Also, we modify `torch.sym_min` and `torch.sym_max` to have promotion semantics (if one argument is a float, the return result is always a float), making them inconsistent with builtins.min/max, but possible to do type analysis without runtime information.

We also need to introduce some new op handlers in **torch/_inductor/ops_handler.py**:

* `to_int` for truncation to int64, directly corresponding to TruncToInt; this can be implemented by trunc and dtype, but with a dedicated handler it is more convenient for roundtripping in Sympy
* `int_truediv` for Python-style integer true division, which has higher precision than casting to floats and then running `truediv`

These changes have consequences. First, we need to make some administrative changes:

* Actually wire up these Sympy functions from SymInt/SymFloat in **torch/fx/experimental/sym_node.py**, including the new promotion rules (promote2)
* Add support for new Sympy functions in **torch/utils/_sympy/interp.py**, **torch/utils/_sympy/reference.py**
  * In particular, in torch.utils._sympy.reference, we have a strong preference to NOT do nontrivial compute, instead, everything in ops handler should map to a singular sympy function
  * TODO: I chose to roundtrip mod back to our Mod function, but I think I'm going to have to deal with the C/Python inconsistency this to fix tests here
* Add printer support for the Sympy functions in **torch/_inductor/codegen/common.py**, **torch/_inductor/codegen/cpp_utils.py**, **torch/_inductor/codegen/triton.py**. `int_truediv` and mixed precision equality is currently not implemented soundly, so we will lose precision in codegen for large values. TODO: The additions here are not exhaustive yet
* Update ValueRanges logic to use new sympy functions in **torch/utils/_sympy/value_ranges.py**. In general, we prefer to use the new Sympy function rather than try to roll things by hand, which is what was done previously for many VR analysis functions.

In **torch/fx/experimental/symbolic_shapes.py** we need to make some symbolic reasoning adjustments:

* Avoid generation of rational subexpressions by removing simplification of `x // y` into `floor(x / y)`. This simplification then triggers an addition simplification rule `(x + y) / c --> x / c + y / c` which is bad because x / c is a rational number now
* `_assert_bound_is_rational` is no more, we no longer generate rational bounds
* Don't intersect non-int value ranges with the `int_range`
* Support more sympy Functions for guard SYMPY_INTERP
* Assert the type of value range is consistent with the variable type

The new asserts uncovered necessary bug fixes:

* **torch/_inductor/codegen/cpp.py**, **torch/_inductor/select_algorithm.py**, **torch/_inductor/sizevars.py** - Ensure Wild/Symbol manually allocated in Inductor is marked `is_integer` so it's accepted to build expressions
* **torch/_inductor/utils.py** - make sure you actually pass in sympy.Expr to these functions
* **torch/_inductor/ir.py** - make_contiguous_strides_for takes int/SymInt, not sympy.Expr!
* **torch/export/dynamic_shapes.py** - don't use infinity to represent int ranges, instead use sys.maxsize - 1

Because of the removal of some symbolic reasoning that produced rationals, some of our symbolic reasoning has gotten worse and we are unable to simplify some guards. Check the TODO at **test/test_proxy_tensor.py**

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126905
Approved by: https://github.com/xadupre, https://github.com/lezcano
2024-06-04 11:47:32 +00:00
db515b6ac7 [ROCm] Fix error in torch.cuda initialisation if amdsmi is not available (#127528)
Reported in https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/15874

When nvml_count is set via 9f73c65b8f/torch/cuda/__init__.py (L834)

If amdsmi is not available this will throw an error
```
File "python3.10/site-packages/torch/cuda/__init__.py", line 634, in _raw_device_count_amdsmi
    except amdsmi.AmdSmiException as e:
NameError: name 'amdsmi' is not defined
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127528
Approved by: https://github.com/jeffdaily, https://github.com/eqy, https://github.com/pruthvistony, https://github.com/atalman
2024-06-04 11:16:02 +00:00
49048e7f26 [FSDP2] Fixed variable shadowing of module (#127776)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127776
Approved by: https://github.com/wanchaol
ghstack dependencies: #127771
2024-06-04 10:27:34 +00:00
f325b39303 Introduce Inductor passes to micro-pipeline all-gather-matmul and matmul-reduce-scatter in certain cases (#126598)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126598
Approved by: https://github.com/wanchaol
2024-06-04 09:06:56 +00:00
cf77e7dd97 [inductor] Enable subprocess-based parallel compile as the default (#126817)
Differential Revision: [D58056502](https://our.internmc.facebook.com/intern/diff/D58056502)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126817
Approved by: https://github.com/eellison
2024-06-04 07:48:32 +00:00
b9c058c203 Retire torch.distributed.pipeline (#127354)
Actually retiring module after deprecation warning for a while.
The new supported module is: torch.distributed.pipelining.
Please migrate.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127354
Approved by: https://github.com/wconstab
2024-06-04 07:03:26 +00:00
6abca6a564 [export][unflatten] More strictly respect scope when removing inputs (#127607)
Code snippet from TorchTitan (LLaMa):
```
for layer in self.layers.values():
    h = layer(h, self.freqs_cis)
```
`self.freqs_cis` is a buffer of root module (`self`).
It is also an explicit arg in the call signature of original `layer` modules.
If not respecting scope -- `freqs_cis`'s scope only corresponds to root -- `_sink_param` can remove `freqs_cis` from `layer`'s call signature, resulting in runtime error.

There are two fixes in this PR:
1. We filter out the `inputs_to_state` corresponding to the current scope, using existing code that does prefix matching.
2. We delay the removal of param inputs from `call_module` nodes' `args`, till `_sink_param` call on that submodule returns. The return now returns information on which input is actually removed by the submodule, thus more accurate than just doing:
```
    for node in call_module_nodes:
        node.args = tuple(filter(lambda n: n.name not in inputs_to_state, node.args))
```

Before the PR:
![Screenshot 2024-05-31 at 1 40 24 AM](https://github.com/pytorch/pytorch/assets/6676466/a2e06b18-44d5-40ca-b242-0edab45075b7)

After the PR:
![Screenshot 2024-05-31 at 1 43 41 AM](https://github.com/pytorch/pytorch/assets/6676466/b72afb94-cdfa-420d-b88b-29a92bf2a0c0)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127607
Approved by: https://github.com/pianpwk
2024-06-04 06:43:54 +00:00
e216df48c8 [Dynamo][TVM] Fix ignored trials argument for MetaSchedule (#127747)
Fixes #127746

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127747
Approved by: https://github.com/jansel
2024-06-04 06:13:02 +00:00
2122c9e2a9 [BE] Enabled lintrunner on torch/distributed/utils.py (#127771)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127771
Approved by: https://github.com/wanchaol, https://github.com/Skylion007
2024-06-04 06:10:33 +00:00
ef77f2ca4a [pipelining] Simple 1F1B schedule (#127673)
![Screenshot 2024-05-31 at 9 13 18 PM](https://github.com/pytorch/pytorch/assets/6676466/ecf3ca24-33a6-4188-9f7c-df6e96311caa)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127673
Approved by: https://github.com/wconstab
2024-06-04 06:09:51 +00:00
f4b77ce8e2 Masked scale meta function registration #119984 (#127389)
Fixes #119984

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127389
Approved by: https://github.com/cpuhrsch
2024-06-04 06:09:17 +00:00
cyy
e7cb43a2d2 Check unused variables in tests (#127498)
Enables unused variable checks in CMake.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127498
Approved by: https://github.com/ezyang
2024-06-04 05:35:25 +00:00
2ad0e4197d [ts-migration] support aten::__is__, aten::__isnot__, aten::__not__, profiler::_record_function_enter_new, profiler::_record_function_exit (#127656)
Support more ops in ts converter and add unit tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127656
Approved by: https://github.com/SherlockNoMad
2024-06-04 04:51:29 +00:00
8d153e0bab [Inductor] Add FlexAttention backward kernel dynamic shape tests (#127728)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127728
Approved by: https://github.com/Chillee
2024-06-04 04:32:03 +00:00
e793ae220f [Inductor][Flex-attention] Support different sequence lengths for Query and Key/Value (#127678)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127678
Approved by: https://github.com/Chillee
2024-06-04 04:27:24 +00:00
dae757c971 Specify supported OS matrix (#127816)
Windows-10 or newer
manylinux-2014
MacOS-11 or newer (but only on Apple Silicon)

Fixes https://github.com/pytorch/pytorch/issues/126679

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127816
Approved by: https://github.com/kit1980, https://github.com/huydhn
2024-06-04 04:25:41 +00:00
22368eac10 [FSDP2] Fix submesh slicing to enable 3D parallelism (#127585)
Ensures the submesh used to create sharded parameters are created on a
submesh that excludes the Pipeline Parallelism dimension.

Also cleans up the logic for storing placements to no longer consider the outer / global dims.  Since we store an 'spmd' submesh, we can avoid this.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127585
Approved by: https://github.com/wanchaol
2024-06-04 04:24:09 +00:00
69f5b66132 [Inductor] FlexAttention backward kernel optimization (#127208)
BWD Speedups (before this PR):
```
| Type    |   Speedup | shape             | score_mod     | dtype          |
|---------|-----------|-------------------|---------------|----------------|
| Average |     0.211 |                   |               |                |
| Max     |     0.364 | (16, 16, 512, 64) | relative_bias | torch.bfloat16 |
| Min     |     0.044 | (2, 16, 4096, 64) | causal_mask   | torch.bfloat16 |
```
BWD Speedups (after this PR, though not optimizing block size yet):
```
| Type    |   Speedup | shape              | score_mod     | dtype          |
|---------|-----------|--------------------|---------------|----------------|
| Average |     0.484 |                    |               |                |
| Max     |     0.626 | (2, 16, 512, 256)  | head_bias     | torch.bfloat16 |
| Min     |     0.355 | (8, 16, 4096, 128) | relative_bias | torch.bfloat16 |
```

There are a few things need to do as follow-ups:
* Optimized default block size on A100/H100.
* Support different seqlen for Q and K/V.
* Support dynamic shapes for backward.
* Enhance unit tests to check there is no ```nan``` value in any grad. I think we should make some changes to ```test_padded_dense_causal``` because it has invalid inputs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127208
Approved by: https://github.com/Chillee
2024-06-04 04:22:41 +00:00
2498ef7490 Fix scheduler typehints (#127769)
Fixes scheduler typehints

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127769
Approved by: https://github.com/jansel
2024-06-04 04:19:06 +00:00
6580a18f86 [c10d][BE] fix test_init_pg_and_rpc_with_same_socket (#127654)
**Summary**
fix `test_init_pg_and_rpc_with_same_socket` in `test/distributed/test_store.py` which missed a call to destroy the created ProcessGroup before exiting test function. It lead to "init PG twice" error in the test.

**Test Plan**
`pytest test/distributed/test_store.py -s -k test_init_pg_and_rpc_with_same_socket`
`ciflow/periodic` since this test is included in `.ci/pytorch/multigpu-test.sh`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127654
Approved by: https://github.com/Skylion007, https://github.com/malfet
2024-06-04 04:00:28 +00:00
7e906ec9e5 [PT2][Optimus] Improve group batch fusion with same parent/users fusion enablement (#127648)
Summary:
Currently, we fuse the ops in random place, we here enable the same parent/users fuse to enable follow up potential split cat elimination.

Context

https://docs.google.com/document/d/1MSZY23wKD2keW2Z-DfAI1DscDERHKjOJAnuB5bxa06I/edit

Test Plan:
# local reproduce

```
buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode batch-split --model_type "pm_cmf" --flow_id 559694026
```
P1386889671

Differential Revision: D58037636

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127648
Approved by: https://github.com/jackiexu1992
2024-06-04 03:41:44 +00:00
c32fe6b279 [FSDP] keep paras in torch.distributed.checkpoint.state_dict.set_optimizer_state_dict (#127644)
This addresses Fixes https://github.com/pytorch/pytorch/issues/126948
The previous code under `_load_optim_state_dict `function with condition of `info.broadcast_from_rank0`, `optim_state_dict` holds the parameters based on `optim`.
Changes here aim to synchronize the differential parameters.
Unit tests are conducted under `test_state_dict.py` in `test_optim_state_dict_para_matching`,

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127644
Approved by: https://github.com/fegin
2024-06-04 03:32:22 +00:00
4d0386ce1c [torch/jit-runtime] Add explicit include of <chrono> to torch/jit/run… (#127779)
Added an explicit include to `<chrono>` in `jit/runtime/logging.h` since `std::chrono::time_point<std::chrono::high_resolution_clock>` is directly referenced in the header.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127779
Approved by: https://github.com/albanD
2024-06-04 02:12:17 +00:00
ddef7c350f Add comments about runner labels (#127827)
To distinguish between org-wide and repo-specific runners as well as highlight where they are hosted (by DevInfra, LF or various partners

Delete unused `bm-runner`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127827
Approved by: https://github.com/huydhn
2024-06-04 02:06:43 +00:00
1208347d09 [inductor][ez] fix loop ordering test (#127807)
I didn't realize that the main block is not being run when inductor tests are being run in FBCode via remote GPUs. This is a quick fix. I've tested it in both OSS and FBCode.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127807
Approved by: https://github.com/eellison, https://github.com/jansel
2024-06-04 01:14:34 +00:00
41033a4274 PyPI: fix link to images to be rendered (#127798)
It addresses the long pending issues on PyPI. The [package description](https://pypi.org/project/torch/2.3.0/) is the repo's Readme, but compared to GitHub rendering, PyPI accepts only raw images linked via MarkDown images.
![image](https://github.com/pytorch/pytorch/assets/6035284/1d8e51d5-c8c1-4f92-b323-f7684879adb4)
 This minor link edit makes the image become raw images and so correctly rendered via PyPI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127798
Approved by: https://github.com/albanD
2024-06-04 00:59:58 +00:00
cyy
05fa05cbae [2/N] Change static functions in headers to inline (#127764)
Follows #127727

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127764
Approved by: https://github.com/Skylion007
2024-06-04 00:49:04 +00:00
dbf39a6e63 [inductor] fix linear_add_bias path (#127597)
Previous the `linear_add_bias` path do not work.
This PR is to fix it and add more ut with it.

**TestPlan**
```
python test/inductor/test_mkldnn_pattern_matcher.py -k test_linear_add_bias
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127597
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-06-04 00:39:01 +00:00
b42cfcabc4 Lift jagged -> padded dense forward / backward kernels from fbgemm_gpu (#125946)
PyTorch can't depend on `fbgemm_gpu` as a dependency because `fbgemm_gpu` already has a dependency on PyTorch. So this PR copy / pastes kernels from `fbgemm_gpu`:
* `dense_to_jagged_forward()` as CUDA registration for new ATen op `_padded_dense_to_jagged_forward()`
* `jagged_to_padded_dense_forward()` as CUDA registration for new ATen op `_jagged_to_padded_dense_forward()`

CPU impls for these new ATen ops will be added in a follow-up PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125946
Approved by: https://github.com/davidberard98
2024-06-03 23:41:54 +00:00
eqy
ac568fc007 [CUDNN] Remove defunct cuDNN V8 API build flag (#120006)
The flag basically does nothing following #95722

Let's see if the quantization tests break

CC @malfet @atalmanagement

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120006
Approved by: https://github.com/malfet
2024-06-03 22:42:05 +00:00
0e7bd7fedd [ROCm] TunableOp improvements (#124362)
- use less memory; smaller default hipblaslt workspace size
- options to avoid cache effects
  - icache flush option
  - rotating buffers during tuning
- python APIs
- unit tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124362
Approved by: https://github.com/xw285cornell
2024-06-03 22:30:11 +00:00
0f1f0d3015 Onboard ARM bfloat16 to gemv fast path (#127484)
Summary: Used bfloat16 dot support from #127477 to write a bfloat16 transposed fast path and integrated it.

Test Plan: Ran https://github.com/malfet/llm_experiments/blob/main/benchmarks/benchmark_torch_mm.py before and after on my Apple M1 Pro.
Before:
```
mv_nt    torch.float32    6.77 usec
mv_nt    torch.float16    8.24 usec
mv_nt   torch.bfloat16  184.74 usec
mv_ta    torch.float32    5.71 usec
mv_ta    torch.float16   27.95 usec
mv_ta   torch.bfloat16   98.06 usec
notrans  torch.float32    5.55 usec
notrans  torch.float16   25.11 usec
notrans torch.bfloat16   63.55 usec
trans_a  torch.float32    5.62 usec
trans_a  torch.float16   74.48 usec
trans_a torch.bfloat16  313.19 usec
trans_b  torch.float32    5.68 usec
trans_b  torch.float16    8.18 usec
trans_b torch.bfloat16   14.96 usec
```

After:
```
mv_nt    torch.float32    5.40 usec
mv_nt    torch.float16    8.25 usec
mv_nt   torch.bfloat16   12.81 usec
mv_ta    torch.float32    5.69 usec
mv_ta    torch.float16   27.94 usec
mv_ta   torch.bfloat16   98.18 usec
notrans  torch.float32    5.60 usec
notrans  torch.float16   25.17 usec
notrans torch.bfloat16   63.22 usec
trans_a  torch.float32    5.61 usec
trans_a  torch.float16   69.32 usec
trans_a torch.bfloat16  316.62 usec
trans_b  torch.float32    5.60 usec
trans_b  torch.float16    8.09 usec
trans_b torch.bfloat16   14.61 usec
```

Note large improvement in mv_nt torch.bfloat16 case.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127484
Approved by: https://github.com/malfet
ghstack dependencies: #127477, #127478
2024-06-03 22:14:16 +00:00
f6ca822366 Patch ARM Half use_gemv_fast_path gate to avoid kernel duplication (#127478)
Summary: The existing code didn't gate the fast path, so the fast path had to duplicate the stock kernel. Now we gate it and delete the duplicate kernel.

Test Plan: Existing tests. Flipped the TORCH_INTERNAL_ASSERT_DEBUG_ONLY to non-debug and forced to fail (locally) to make sure we had test coverage.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127478
Approved by: https://github.com/malfet
ghstack dependencies: #127477
2024-06-03 22:14:16 +00:00
6faa3d5f18 Onboard ARM bfloat16 to gemm-by-dot-product-for-gemm_transa_ infrastructure (#127477)
Summary: This gets us a baseline level of reasonable performance for
bfloat16 matrix-vector and matrix-matrix multiplication on my Apple
M1. I've intentionally left using intrinsics for future work.

Test Plan: Used
https://github.com/malfet/llm_experiments/blob/main/benchmarks/benchmark_torch_mm.py
(modified to run larger sizes) to benchmark a range of LLM-interesting
matrix-vector and matrix-matrix sizes on my Apple M1 Pro. bfloat16 performance is
improved across the board (except possibly for very small cases) and
now exceeds float32 performance (as it should) for the matrix-vector
cases.

Before:
```
Matrix-vector:
m=8, n=128, k=1
====================
trans_b  torch.float32    0.75 usec
trans_b  torch.float16    0.71 usec
trans_b torch.bfloat16    0.81 usec
m=128, n=8, k=1
====================
trans_b  torch.float32    0.75 usec
trans_b  torch.float16    0.93 usec
trans_b torch.bfloat16    0.98 usec
m=4096, n=4096, k=1
====================
trans_b  torch.float32 2194.31 usec
trans_b  torch.float16  661.27 usec
trans_b torch.bfloat16 3758.42 usec
m=11008, n=4096, k=1
====================
trans_b  torch.float32 5792.04 usec
trans_b  torch.float16 1789.98 usec
trans_b torch.bfloat16 10120.67 usec
m=4096, n=11008, k=1
====================
trans_b  torch.float32 6101.22 usec
trans_b  torch.float16 1927.34 usec
trans_b torch.bfloat16 10469.47 usec
m=32000, n=4096, k=1
====================
trans_b  torch.float32 18353.20 usec
trans_b  torch.float16 5161.06 usec
trans_b torch.bfloat16 29601.69 usec

Matrix-matrix (prompt len 4:
m=8, n=128, k=4
====================
trans_b  torch.float32    2.14 usec
trans_b  torch.float16    0.85 usec
trans_b torch.bfloat16    1.19 usec
m=128, n=8, k=4
====================
trans_b  torch.float32    1.47 usec
trans_b  torch.float16    1.85 usec
trans_b torch.bfloat16    1.75 usec
m=4096, n=4096, k=4
====================
trans_b  torch.float32 4416.40 usec
trans_b  torch.float16 2688.36 usec
trans_b torch.bfloat16 14987.33 usec
m=11008, n=4096, k=4
====================
trans_b  torch.float32 6140.24 usec
trans_b  torch.float16 7467.26 usec
trans_b torch.bfloat16 40295.52 usec
m=4096, n=11008, k=4
====================
trans_b  torch.float32 6143.10 usec
trans_b  torch.float16 7298.04 usec
trans_b torch.bfloat16 41393.43 usec
m=32000, n=4096, k=4
====================
trans_b  torch.float32 17650.72 usec
trans_b  torch.float16 21346.63 usec
trans_b torch.bfloat16 116849.98 usec

Matrix-matrix (prompt len 8:
m=8, n=128, k=8
====================
trans_b  torch.float32    1.05 usec
trans_b  torch.float16    1.03 usec
trans_b torch.bfloat16    1.69 usec
m=128, n=8, k=8
====================
trans_b  torch.float32    2.05 usec
trans_b  torch.float16    3.08 usec
trans_b torch.bfloat16    2.95 usec
m=4096, n=4096, k=8
====================
trans_b  torch.float32 2323.99 usec
trans_b  torch.float16 5265.45 usec
trans_b torch.bfloat16 29942.40 usec
m=11008, n=4096, k=8
====================
trans_b  torch.float32 6202.01 usec
trans_b  torch.float16 14677.90 usec
trans_b torch.bfloat16 80625.18 usec
m=4096, n=11008, k=8
====================
trans_b  torch.float32 6112.05 usec
trans_b  torch.float16 14340.52 usec
trans_b torch.bfloat16 82799.99 usec
m=32000, n=4096, k=8
====================
trans_b  torch.float32 17650.65 usec
trans_b  torch.float16 42551.43 usec
trans_b torch.bfloat16 236081.08 usec

Matrix-matrix (prompt len 16:
m=8, n=128, k=16
====================
trans_b  torch.float32    1.26 usec
trans_b  torch.float16    1.34 usec
trans_b torch.bfloat16    2.69 usec
m=128, n=8, k=16
====================
trans_b  torch.float32    1.60 usec
trans_b  torch.float16    5.81 usec
trans_b torch.bfloat16    5.34 usec
m=4096, n=4096, k=16
====================
trans_b  torch.float32 2328.05 usec
trans_b  torch.float16 10526.58 usec
trans_b torch.bfloat16 60028.28 usec
m=11008, n=4096, k=16
====================
trans_b  torch.float32 6243.35 usec
trans_b  torch.float16 28505.08 usec
trans_b torch.bfloat16 163670.15 usec
m=4096, n=11008, k=16
====================
trans_b  torch.float32 5870.11 usec
trans_b  torch.float16 28597.89 usec
trans_b torch.bfloat16 165404.88 usec
m=32000, n=4096, k=16
====================
trans_b  torch.float32 17746.27 usec
trans_b  torch.float16 83393.87 usec
trans_b torch.bfloat16 472313.13 usec

Matrix-matrix (prompt len 32:
m=8, n=128, k=32
====================
trans_b  torch.float32    1.35 usec
trans_b  torch.float16    2.01 usec
trans_b torch.bfloat16    4.68 usec
m=128, n=8, k=32
====================
trans_b  torch.float32    1.19 usec
trans_b  torch.float16   10.98 usec
trans_b torch.bfloat16   10.13 usec
m=4096, n=4096, k=32
====================
trans_b  torch.float32 2525.29 usec
trans_b  torch.float16 23106.71 usec
trans_b torch.bfloat16 122987.04 usec
m=11008, n=4096, k=32
====================
trans_b  torch.float32 6131.34 usec
trans_b  torch.float16 57537.41 usec
trans_b torch.bfloat16 327825.00 usec
m=4096, n=11008, k=32
====================
trans_b  torch.float32 6395.01 usec
trans_b  torch.float16 57456.33 usec
trans_b torch.bfloat16 331325.58 usec
m=32000, n=4096, k=32
====================
trans_b  torch.float32 19078.68 usec
trans_b  torch.float16 167735.08 usec
trans_b torch.bfloat16 975736.88 usec

Matrix-matrix (prompt len 128:
m=8, n=128, k=128
====================
trans_b  torch.float32    2.40 usec
trans_b  torch.float16    6.07 usec
trans_b torch.bfloat16   16.83 usec
m=128, n=8, k=128
====================
trans_b  torch.float32    1.78 usec
trans_b  torch.float16   40.35 usec
trans_b torch.bfloat16   37.21 usec
m=4096, n=4096, k=128
====================
trans_b  torch.float32 4827.60 usec
trans_b  torch.float16 84341.24 usec
trans_b torch.bfloat16 478917.75 usec
m=11008, n=4096, k=128
====================
trans_b  torch.float32 11879.96 usec
trans_b  torch.float16 226484.33 usec
trans_b torch.bfloat16 1289465.50 usec
m=4096, n=11008, k=128
====================
trans_b  torch.float32 10707.75 usec
trans_b  torch.float16 229200.58 usec
trans_b torch.bfloat16 1327416.67 usec
m=32000, n=4096, k=128
====================
trans_b  torch.float32 33306.32 usec
trans_b  torch.float16 662898.21 usec
trans_b torch.bfloat16 3815866.63 usec
```

After:
```
Matrix-vector:
m=8, n=128, k=1
====================
trans_b  torch.float32    0.77 usec
trans_b  torch.float16    0.72 usec
trans_b torch.bfloat16    0.77 usec
m=128, n=8, k=1
====================
trans_b  torch.float32    0.73 usec
trans_b  torch.float16    0.93 usec
trans_b torch.bfloat16    1.56 usec
m=4096, n=4096, k=1
====================
trans_b  torch.float32 2195.22 usec
trans_b  torch.float16  675.40 usec
trans_b torch.bfloat16 1038.29 usec
m=11008, n=4096, k=1
====================
trans_b  torch.float32 5980.27 usec
trans_b  torch.float16 1806.08 usec
trans_b torch.bfloat16 2756.46 usec
m=4096, n=11008, k=1
====================
trans_b  torch.float32 6339.95 usec
trans_b  torch.float16 1844.71 usec
trans_b torch.bfloat16 2726.52 usec
m=32000, n=4096, k=1
====================
trans_b  torch.float32 18137.17 usec
trans_b  torch.float16 6020.75 usec
trans_b torch.bfloat16 8612.89 usec

Matrix-matrix (prompt len 4:
m=8, n=128, k=4
====================
trans_b  torch.float32    2.24 usec
trans_b  torch.float16    0.91 usec
trans_b torch.bfloat16    1.07 usec
m=128, n=8, k=4
====================
trans_b  torch.float32    1.58 usec
trans_b  torch.float16    1.96 usec
trans_b torch.bfloat16    2.11 usec
m=4096, n=4096, k=4
====================
trans_b  torch.float32 4583.43 usec
trans_b  torch.float16 3014.04 usec
trans_b torch.bfloat16 4434.04 usec
m=11008, n=4096, k=4
====================
trans_b  torch.float32 6245.55 usec
trans_b  torch.float16 7513.82 usec
trans_b torch.bfloat16 11207.80 usec
m=4096, n=11008, k=4
====================
trans_b  torch.float32 6096.22 usec
trans_b  torch.float16 7688.82 usec
trans_b torch.bfloat16 11143.72 usec
m=32000, n=4096, k=4
====================
trans_b  torch.float32 17982.88 usec
trans_b  torch.float16 22001.28 usec
trans_b torch.bfloat16 32470.62 usec

Matrix-matrix (prompt len 8:
m=8, n=128, k=8
====================
trans_b  torch.float32    1.05 usec
trans_b  torch.float16    1.02 usec
trans_b torch.bfloat16    1.44 usec
m=128, n=8, k=8
====================
trans_b  torch.float32    2.07 usec
trans_b  torch.float16    3.10 usec
trans_b torch.bfloat16    3.38 usec
m=4096, n=4096, k=8
====================
trans_b  torch.float32 2245.43 usec
trans_b  torch.float16 5597.87 usec
trans_b torch.bfloat16 8775.08 usec
m=11008, n=4096, k=8
====================
trans_b  torch.float32 6227.68 usec
trans_b  torch.float16 15102.41 usec
trans_b torch.bfloat16 22457.37 usec
m=4096, n=11008, k=8
====================
trans_b  torch.float32 6082.16 usec
trans_b  torch.float16 15131.57 usec
trans_b torch.bfloat16 21860.15 usec
m=32000, n=4096, k=8
====================
trans_b  torch.float32 19659.00 usec
trans_b  torch.float16 45075.64 usec
trans_b torch.bfloat16 67746.75 usec

Matrix-matrix (prompt len 16:
m=8, n=128, k=16
====================
trans_b  torch.float32    1.31 usec
trans_b  torch.float16    1.41 usec
trans_b torch.bfloat16    2.04 usec
m=128, n=8, k=16
====================
trans_b  torch.float32    1.66 usec
trans_b  torch.float16    5.76 usec
trans_b torch.bfloat16    6.37 usec
m=4096, n=4096, k=16
====================
trans_b  torch.float32 2271.34 usec
trans_b  torch.float16 11198.46 usec
trans_b torch.bfloat16 16893.54 usec
m=11008, n=4096, k=16
====================
trans_b  torch.float32 6266.85 usec
trans_b  torch.float16 29342.49 usec
trans_b torch.bfloat16 45159.22 usec
m=4096, n=11008, k=16
====================
trans_b  torch.float32 5999.16 usec
trans_b  torch.float16 29157.43 usec
trans_b torch.bfloat16 43295.81 usec
m=32000, n=4096, k=16
====================
trans_b  torch.float32 18028.83 usec
trans_b  torch.float16 89626.88 usec
trans_b torch.bfloat16 128164.62 usec

Matrix-matrix (prompt len 32:
m=8, n=128, k=32
====================
trans_b  torch.float32    1.38 usec
trans_b  torch.float16    2.03 usec
trans_b torch.bfloat16    3.29 usec
m=128, n=8, k=32
====================
trans_b  torch.float32    1.24 usec
trans_b  torch.float16   10.58 usec
trans_b torch.bfloat16   11.97 usec
m=4096, n=4096, k=32
====================
trans_b  torch.float32 2591.56 usec
trans_b  torch.float16 21683.62 usec
trans_b torch.bfloat16 32657.68 usec
m=11008, n=4096, k=32
====================
trans_b  torch.float32 6468.43 usec
trans_b  torch.float16 57811.33 usec
trans_b torch.bfloat16 89263.21 usec
m=4096, n=11008, k=32
====================
trans_b  torch.float32 6034.74 usec
trans_b  torch.float16 59372.56 usec
trans_b torch.bfloat16 88107.85 usec
m=32000, n=4096, k=32
====================
trans_b  torch.float32 18609.27 usec
trans_b  torch.float16 167298.00 usec
trans_b torch.bfloat16 255116.37 usec

Matrix-matrix (prompt len 128:
m=8, n=128, k=128
====================
trans_b  torch.float32    2.44 usec
trans_b  torch.float16    6.11 usec
trans_b torch.bfloat16   10.92 usec
m=128, n=8, k=128
====================
trans_b  torch.float32    1.80 usec
trans_b  torch.float16   40.26 usec
trans_b torch.bfloat16   44.82 usec
m=4096, n=4096, k=128
====================
trans_b  torch.float32 4773.29 usec
trans_b  torch.float16 84458.54 usec
trans_b torch.bfloat16 131248.58 usec
m=11008, n=4096, k=128
====================
trans_b  torch.float32 12249.16 usec
trans_b  torch.float16 234411.87 usec
trans_b torch.bfloat16 351970.71 usec
m=4096, n=11008, k=128
====================
trans_b  torch.float32 11439.24 usec
trans_b  torch.float16 233347.04 usec
trans_b torch.bfloat16 354475.96 usec
m=32000, n=4096, k=128
====================
trans_b  torch.float32 33803.03 usec
trans_b  torch.float16 688157.54 usec
trans_b torch.bfloat16 1048221.42 usec
```

Also ran the stock configuration; it was unchanged, indicating that we need to integrate this path with torch.mv separately, which will come in a follow-up PR.l

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127477
Approved by: https://github.com/malfet
2024-06-03 22:14:10 +00:00
01fc22056a [BE] enable UFMT for torch/masked/ (#127715)
Part of #123062

- #123062
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127715
Approved by: https://github.com/cpuhrsch
2024-06-03 22:01:49 +00:00
406532f864 [AMD] Fix power_draw api (#127729)
Summary: average_socket_power only gives me NA. So we need to change it to current_socket_power

Test Plan: Before `torch.cuda.power_draw` gives me NA, after it gives me the right power reading (e.g.441)

Differential Revision: D58047484

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127729
Approved by: https://github.com/nmacchioni, https://github.com/eqy
2024-06-03 21:46:50 +00:00
c27882ffa8 [torchbind] always fakify script object by default in non-strict export (#127116)
This diff can be risky for internal tests: any torchbind class that hasn't registered a fake class will fail and we should fix them. We've gained some confidence that this can work e2e by implementing FakeTensorQueue for TBE models in sigmoid with [D54210823](https://www.internalfb.com/diff/D54210823).

Differential Revision: [D57991002](https://our.internmc.facebook.com/intern/diff/D57991002)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127116
Approved by: https://github.com/zou3519
ghstack dependencies: #127113, #127114
2024-06-03 21:38:57 +00:00
3efac92888 [torchbind] support torch.compile with aot_eager backend (#127114)
Differential Revision: [D57991001](https://our.internmc.facebook.com/intern/diff/D57991001)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127114
Approved by: https://github.com/zou3519
ghstack dependencies: #127113
2024-06-03 21:38:57 +00:00
c6dc624690 [torchbind] remove test cases that don't fakify script objects (#127113)
As titled.

Differential Revision: [D57991003](https://our.internmc.facebook.com/intern/diff/D57991003)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127113
Approved by: https://github.com/zou3519
2024-06-03 21:38:50 +00:00
6d4ec9b2ec [RFC] Introduce Checkpointable for DCP (#127540) (#127628)
Summary:
# Introduce Checkpointable interface for DCP to support arbitrary tensor subclasses for checkpointing

**Authors:**
* zainhuda

## **Summary**
This diff adds a CheckpointableTensor interface to allow for future compatibility for any tensor subclass with DCP in a clean and maintainable way.

## **Motivation**
For TorchRec sharding migration from ShardedTensor to DTensor, we create a tensor subclass that is stored by DTensor to support TorchRec's sharding schemes (ex, empty shards, multiple shards on a rank).

## **Proposed Implementation**
View the CheckpointableTensor interface implementation, in which, we introduce the minimal set of methods needed to be compatible with DCP. These methods are expected to implemented by any tensor subclasses and as such are then checkpointable by DCP.

## **Drawbacks**
No drawbacks, it extends functionality in a clean and maintainable way.

## **Alternatives**
Alternative design was creating paths for checking for certain attributes in tensor subclasses which can get messy and hard to maintain/understand why it was there in the first place.

Test Plan:
Sandcastle

cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k LucasLLC

Differential Revision: D57970603

Pulled By: iamzainhuda

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127628
Approved by: https://github.com/wz337, https://github.com/XilunWu, https://github.com/fegin
2024-06-03 21:21:55 +00:00
a4064da8ca Always simplify sympy expressions before printing. (#127543)
This is important because if a replacement has happened during inductor lowering, we may have stale symbols in sympy expressions that we need to replace away.  Do this at the very end.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127543
Approved by: https://github.com/lezcano
2024-06-03 20:36:14 +00:00
ef9451ac8d Move the build of AOTriton to base ROCM docker image. (#127012)
Mitigates #126111

AOTrtion, as a Math library, takes long time to build. However, this library itself is not moving as fast as PyTorch itself and it is not cost-efficient to build it for every CI check.

This PR moves the build of AOTriton from PyTorch to its base docker image, avoids duplicated and long build time.

Pre-this-PR:
* PyTorch base docker build job duration: 1.1-1.3h
* PyTorch build job duration: 1.4-1.5hr (includes AOTriton build time of 1hr6min on a linux.2xlarge node)

Post-this-PR:
* PyTorch base docker build job duration: 1.3h (includes AOTriton build time of 20min on a linux.12xlarge node)
* PyTorch build job duration: <20 min

Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127012
Approved by: https://github.com/jithunnair-amd, https://github.com/pruthvistony, https://github.com/huydhn
2024-06-03 20:35:22 +00:00
941316f821 [pipelining] Stress test schedules with multi iters (#127475)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127475
Approved by: https://github.com/wconstab
2024-06-03 20:24:07 +00:00
db9d457a3f Use sleef on macOS Apple silicon by default (#126509)
Use sleef ~~for aarch64~~ on macOS Apple silicon by default.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126509
Approved by: https://github.com/digantdesai, https://github.com/malfet
2024-06-03 19:33:06 +00:00
2fc907971a Revert "[Inductor] FlexAttention backward kernel optimization (#127208)"
This reverts commit f7171313abf14d9501a330457140b2f8a01c9985.

Reverted https://github.com/pytorch/pytorch/pull/127208 on behalf of https://github.com/yanboliang due to test_flex_attention is failing internally ([comment](https://github.com/pytorch/pytorch/pull/127208#issuecomment-2145830810))
2024-06-03 18:13:27 +00:00
3f45fa63f2 Revert "[Inductor] Add FlexAttention backward kernel dynamic shape tests (#127728)"
This reverts commit 10e3406ea5d115a54a7d753d33110762eb6c07ff.

Reverted https://github.com/pytorch/pytorch/pull/127728 on behalf of https://github.com/yanboliang due to Ineternal breakage of https://github.com/pytorch/pytorch/pull/127208 hence reverting ([comment](https://github.com/pytorch/pytorch/pull/127728#issuecomment-2145822667))
2024-06-03 18:10:46 +00:00
c35b65715c Revert "[Inductor][Flex-attention] Support different sequence lengths for Query and Key/Value (#127678)"
This reverts commit e2e3ca94ccce1c0abbfd75ac0368793e1756c268.

Reverted https://github.com/pytorch/pytorch/pull/127678 on behalf of https://github.com/atalman due to Ineternal breakage of https://github.com/pytorch/pytorch/pull/127208 hence reverting ([comment](https://github.com/pytorch/pytorch/pull/127678#issuecomment-2145821489))
2024-06-03 18:07:57 +00:00
3437177e2b Quick Fix on #126854, deepcopy lr and other possible base_parameters (#127190)
* Apply `deepcopy` to every base parameters (`initial_lr`, `max_lr`) when instantiating `LRScheduler`.

Fixes #126854

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127190
Approved by: https://github.com/janeyx99
2024-06-03 18:06:31 +00:00
d8d0bf264a Inductor: Allow small sizes of m for mixed mm autotuning (#127663)
For mixed mm with small sizes of m, such as in the example provided in #127056, being able to set BLOCK_M to 16 leads to better performance. This PR introduces kernel configs that are specific to mixed mm by extending the mm configs with two configs that work well for the example provided in #127056.
I am excluding configs with (BLOCK_M=16, BLOCK_K=16, BLOCK_N=64) because triton crashes when this config is used.

For the example in #127056:
- Without my changes, skip_triton is evaluated to true which disables autotuning. On my machine I achieve 146GB/s.
- If autotuning is enabled, but BLOCK_M>=32, I achieve 614 GB/s.
- With the changes in this PR (i.e. autotuning enabled and BLOCK_M=16), I achieve 772 GB/s.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127663
Approved by: https://github.com/Chillee
2024-06-03 17:53:48 +00:00
7c3740d388 [NestedTensor] Extend coverage for unbind when ragged_idx != 1 (#127493)
Summary:
Extend coverage for the `NestedTensor` `unbind` operator to cases in which `ragged_idx != 1`.

Currently, the `unbind` operator in the `NestedTensor` class splits a tensor along the 0-th dimension, where the `ragged_idx` property, which controls the jagged dimension upon which `unbind` splits, is 1. This diff extends support for `ragged_idx != 1` in `NestedTensor`s, allowing `unbind` to split a tensor along a jagged dimension greater than 0 for `NestedTensor`s with and without the `lengths` property.

Test Plan:
Added the following unit tests:

`test_unbind_ragged_idx_equals_2_cpu`, `test_unbind_ragged_idx_equals_3_cpu`, and `test_unbind_ragged_idx_equals_last_dim_cpu` verify that `unbind` works for all jagged dimensions greater than 1, for `NestedTensor`s without `lengths`.
```
test_unbind_ragged_idx_equals_2_cpu (test_nestedtensor.TestNestedTensorSubclassCPU) ... ok
test_unbind_ragged_idx_equals_3_cpu (test_nestedtensor.TestNestedTensorSubclassCPU) ... ok
test_unbind_ragged_idx_equals_last_dim_cpu (test_nestedtensor.TestNestedTensorSubclassCPU) ... ok
```

`test_unbind_with_lengths_cpu` and `test_unbind_with_lengths_ragged_idx_equals_1_cpu` verify that `unbind` works when the jagged dimension is 1, for `NestedTensor`s with `lengths`.
```
test_unbind_with_lengths_cpu (test_nestedtensor.TestNestedTensorSubclassCPU) ... ok
test_unbind_with_lengths_ragged_idx_equals_1_cpu (test_nestedtensor.TestNestedTensorSubclassCPU) ... ok
```

`test_unbind_with_lengths_ragged_idx_equals_2_cpu` and `test_unbind_with_lengths_ragged_idx_equals_3_cpu` verify that `unbind` works when the jagged dimension is greater than 1, for `NestedTensor`s with `lengths`.
```
test_unbind_with_lengths_ragged_idx_equals_2_cpu (test_nestedtensor.TestNestedTensorSubclassCPU) ... ok
test_unbind_with_lengths_ragged_idx_equals_3_cpu (test_nestedtensor.TestNestedTensorSubclassCPU) ... ok
```

`test_unbind_with_lengths_ragged_idx_equals_0_cpu` verifies that `unbind` fails when the jagged dimension is 0 (the batch dimension), for `NestedTensor`s with `lengths`.
```
test_unbind_with_lengths_ragged_idx_equals_0_cpu (test_nestedtensor.TestNestedTensorSubclassCPU) ... ok
```

`test_unbind_with_lengths_ragged_idx_equals_2_bad_dim_cpu` verifies that `unbind` fails when there is a mismatch between the offsets and the jagged dimension, for `NestedTensor`s with `lengths`.
```
test_unbind_with_lengths_ragged_idx_equals_2_bad_dim_cpu (test_nestedtensor.TestNestedTensorSubclassCPU) ... ok
```

`test_unbind_with_wrong_lengths_cpu` verifies that `unbind` fails when the lengths exceed the limitations set by offsets, for `NestedTensor`s with `lengths`.

```
test_unbind_with_wrong_lengths_cpu (test_nestedtensor.TestNestedTensorSubclassCPU) ... ok
```

Differential Revision: D57942686

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127493
Approved by: https://github.com/davidberard98
2024-06-03 17:46:12 +00:00
4d32de14b6 [export] Handle serializing duplicate getitem nodes (#127633)
We ran into a graph that looks something like the following, where we have 2 getitem calls to the same index (%getitem, %getitem_2 both query topk[0]):
```
graph():
    %x : [num_users=1] = placeholder[target=x]
    %topk : [num_users=3] = call_function[target=torch.ops.aten.topk.default](args = (%x, 2), kwargs = {})
    %getitem : [num_users=1] = call_function[target=operator.getitem](args = (%topk, 0), kwargs = {})
    %getitem_1 : [num_users=1] = call_function[target=operator.getitem](args = (%topk, 1), kwargs = {})
    %getitem_2 : [num_users=1] = call_function[target=operator.getitem](args = (%topk, 0), kwargs = {})
    %mul_tensor : [num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%getitem, %getitem_2), kwargs = {})
    %mul : [num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%mul_tensor, 2), kwargs = {})
    return (mul, getitem_1)
```

The duplicate getitem call gets created during a pass.. so there are a couple of solutions:

1. Change serializer to support the case of duplicate getitem calls
2. Change the pass so that it doesn’t produce duplicate getitem calls
3. Add a pass which dedups the getitem calls

As a framework, we should do 1 and 3 (through a CSE pass).

This PR implements solution 1. However, the serializer currently does some special handling for getitem nodes -- instead of directly serializing the getitem nodes, we serialize the output of the node that outputting a list of tensors (the %topk node in this example) into a list nodes for each output ([%getitem, %getitem_1]). This fails when we have duplicate getitem nodes to the same index (%getitem_2), since we do not record that duplicate getitem node anywhere. So, the solution this PR takes is that the serializer will deduplicate the getitem nodes (%getitem_2 will be replaced with %getitem). This would result in a sematically correct graph, but not necessarily node-to-node identical as the original fx graph.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127633
Approved by: https://github.com/ydwu4
2024-06-03 17:25:51 +00:00
12c4a2c297 [BE]: Apply PLR1736 fixes (unnecessary index lookup) (#127716)
Applies the PLR1736 preview rule with some more autofixes to cut down on unnecessary accesses. Added a noqa since that test actually testing the dunder method.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127716
Approved by: https://github.com/ezyang
2024-06-03 17:22:13 +00:00
21144ce570 [dtensor] implement scatter op with simple replication (#126713)
as titled, implement torch.scatter op with simple replications strategy,
need to follow up and see if we could actually support any sharding
pattern

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126713
Approved by: https://github.com/tianyu-l
ghstack dependencies: #126712
2024-06-03 16:16:28 +00:00
ded580a594 [dtensor] standardize multi mesh-dim strategy with utils (#126712)
This PR standardize the multi mesh-dim strategy generation by unifying a
util to expand from a single mesh dim strategy to multi mesh dim
strategy, to allow strategy generation simpler

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126712
Approved by: https://github.com/tianyu-l
2024-06-03 16:16:28 +00:00
d1fad416a8 Revert "Add aten._unsafe_masked_index (#116491)"
This reverts commit f03f8bc901a6c9038308a6353e8d280f4b5628f5.

Reverted https://github.com/pytorch/pytorch/pull/116491 on behalf of https://github.com/PaliC due to breaking onnx tests ([comment](https://github.com/pytorch/pytorch/pull/116491#issuecomment-2145557724))
2024-06-03 15:51:50 +00:00
53f001c599 Revert "correct BLAS input (#126200)" (#127762)
This reverts commit ea13e9a097aaa875a2b404822579b7f8b62ea291.

Looks like this could have caused: https://github.com/pytorch/pytorch/actions/runs/9346105069/job/25722431775#step:17:984

Aarch64 tests failures:
```
+ echo 'Checking that MKLDNN is available on aarch64'
Checking that MKLDNN is available on aarch64
+ pushd /tmp
/tmp /
+ python -c 'import torch; exit(0 if torch.backends.mkldnn.is_available() else 1)'
Error: Process completed with exit code 1.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127762
Approved by: https://github.com/PaliC, https://github.com/malfet
2024-06-03 15:49:48 +00:00
8677508167 [c10d] guard gpu context during abort (#127363)
This is a mitigation for an internal out of MEM issues on GPU0 that happend during comms abort, this PR was tested internally to have fixed the out of MEM issue.

Note This is supposed to be mitigation only, as the ideal fix should be within NCCL comm libs, which should just set the right CUDA context before any CUDA call and restore it to its exact previous state

ncclCommDestroy/ncclCommAbort -> commReclaim -> commDestroySync (https://fburl.com/code/pori1tka)

In commDestroySync, it thinks that "current device context" is not same as comm's device context. It tries to:
1) save the current context
2) sets the comm's device context
3) cleans up things
4) Restores "previously stored context" by another cudaSetDevice.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127363
Approved by: https://github.com/wconstab
2024-06-03 15:41:11 +00:00
430cdfc0ac [ATen][Native] fixes sparse SPMV on aarch64 (#127642)
Fixes #127491
In #127491 result was allocated as `result = at::empty(...)`, which does not guarantee `result` being filled by zeros, therefore `torch.mv` was producing non-finite values. This happened mainly because the corner case (`beta = 0`) of `addmv` was not taken care of, as it should be just like in any other `addmv`/`addmm`:
923edef31c/aten/src/ATen/native/mkl/SparseBlasImpl.cpp (L307-L311)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127642
Approved by: https://github.com/malfet
2024-06-03 15:38:27 +00:00
badf898df2 Remove unstable ARC jobs (#127563)
Disable these jobs since we're no longer trying to enable ARC
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127563
Approved by: https://github.com/huydhn
2024-06-03 15:30:06 +00:00
63d7ffe121 Retry of D58015187 Move AsyncCompile to a different file (#127691)
Summary:
This is a retry of https://github.com/pytorch/pytorch/pull/127545/files
and
D58015187, fixing the internal test that also imported codecache

Test Plan: Same tests as CI in github, plus sandcastle for internal unit tests should pass now

Differential Revision: D58054611

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127691
Approved by: https://github.com/oulgen
2024-06-03 15:29:41 +00:00
3f8b8f08c8 [Split Build] Make libtorch_global_deps accessible from libtorch wheel (#127570)
Title
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127570
Approved by: https://github.com/atalman, https://github.com/malfet
2024-06-03 15:14:29 +00:00
d05cddfe23 Revert "FP8 rowwise scaling (#125204)"
This reverts commit 923edef31c7f3e98a14625724f2019b1422dcb26.

Reverted https://github.com/pytorch/pytorch/pull/125204 on behalf of https://github.com/atalman due to Broke nightlies and internal tests ([comment](https://github.com/pytorch/pytorch/pull/125204#issuecomment-2145422196))
2024-06-03 15:00:21 +00:00
f03f8bc901 Add aten._unsafe_masked_index (#116491)
To generate masked indexing operations that would generate
masked loads in triton code

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116491
Approved by: https://github.com/lezcano, https://github.com/peterbell10
2024-06-03 14:44:03 +00:00
d6963e769c Force Inductor output code to be dumped even if it fails to compile (#127700)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127700
Approved by: https://github.com/oulgen
2024-06-03 14:06:53 +00:00
f343f98710 [jit] Validate mobile module fields parsed by flatbuffer loader (#127437)
Fixing error in `torch.jit.load` Python API function that cause crash in C-backend of PyTorch.
The mobile module is succesfully parsed from flatbuffer format, but its fields are used without any validation.

Fixes #127434

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127437
Approved by: https://github.com/davidberard98
2024-06-03 08:48:12 +00:00
e017b56c0c [dtensor] local_map UX change: keep func signature and be compatible with Tensor input (#126924)
**Summary**
This PR has 2 parts of change in `local_map`:

1. regulates the way user can access `DeviceMesh` inside the `func` argument of `local_map`. This means `local_map` will strictly follow the `func` signature without implicitly passing any argument to `func`. If user wants to use `DeviceMesh` inside `func`, this mesh must be explicitly passed to `func` as an argument by user. For example,

```
def user_function(device_mesh, /, *args, **kwargs):
    USER CODE HERE

local_func = local_map(func=user_function, ...)
dtensor_out = local_func(device_mesh, dtensor_input, ...)
```

Before this PR, user code was like:
```
def user_function(device_mesh, /, *args, **kwargs):
    USER CODE HERE

local_func = local_map(func=user_function, ...)
dtensor_out = local_func(dtensor_input, ...)  # local_map passes mesh implicitly for user
```

2. `local_map` now supports mix use of `torch.Tensor` and `DTensor` in argument:

- Pure torch.Tensor case: no `DTensor` argument is passed in, all tensor arguments are `torch.Tensor`. Bypass the `in_placements` check and unwrapping steps. The output will not be wrapped into `DTensor` but directly returned.
- Pure DTensor case: no `torch.Tensor` argument is passed in, all tensor arguments are `DTensor`. This follows the default rule: `in_placements` check, unwrapping arguments, pass into `func`, wrapping the `torch.Tensor` output into `DTensor` if the `out_placements` is not `None`.
- Mix of the above two: some arguments are `torch.Tensor` while some are `DTensor`. Only perform `in_placements` check and unwrapping on `DTensor` arguments. For output processing, it's the same as Pure DTensor case.

**Test**
`pytest test/distributed/_tensor/experimental/test_local_map.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126924
Approved by: https://github.com/wanchaol
2024-06-03 08:41:59 +00:00
2d1ad0c31a [CI] Add freezing for cpu inductor accuracy test in inductor CI (#124715)
This PR is to enable '--freezing' when running dynamo accuracy check in CI.
Backgroud:
ISSUES[#124286](https://github.com/pytorch/pytorch/issues/124286) is not captured by CI since freezing is not enabled for cpu-inductor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124715
Approved by: https://github.com/chuanqi129, https://github.com/jgong5, https://github.com/atalman, https://github.com/desertfire
2024-06-03 07:37:30 +00:00
10e3406ea5 [Inductor] Add FlexAttention backward kernel dynamic shape tests (#127728)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127728
Approved by: https://github.com/Chillee
2024-06-03 07:15:46 +00:00
6d21685b45 [DSD] Fixes various bugs for broadcast_from_rank0 (#127635)
Fixes https://github.com/pytorch/pytorch/issues/126285

Summary:
1. Fixes https://github.com/pytorch/pytorch/issues/126285
2. Broadcasting one tensor per time to avoid OOM.
3. Add some docstring

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127635
Approved by: https://github.com/weifengpy
2024-06-03 06:35:21 +00:00
48846cd164 Update torch-xpu-ops pin (ATen XPU implementation) (#127730)
Regular bi-weekly pin update.
1. Porting operator relative PyTorch unit tests. The existing operators in torch-xpu-ops are covered by, 1) Operator specific test, like test_binary_ufuncs.py. 2) Operator common test, like test_ops.py.
2. Bugfixing under the latest PyTorch unit test scope, https://github.com/intel/torch-xpu-ops/tree/release/2.4/test/xpu.

Totally 297 ATen operators are implemented in torch-xpu-ops. https://github.com/intel/torch-xpu-ops/blob/release/2.4/yaml/xpu_functions.yaml

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127730
Approved by: https://github.com/EikanWang
2024-06-03 05:55:00 +00:00
e2e3ca94cc [Inductor][Flex-attention] Support different sequence lengths for Query and Key/Value (#127678)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127678
Approved by: https://github.com/Chillee
2024-06-03 04:35:50 +00:00
cyy
288df042c5 [1/N] Change static functions in headers to inline (#127727)
So that it may fix some tricky linking issues.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127727
Approved by: https://github.com/ezyang
2024-06-03 04:34:36 +00:00
cyy
1b182ea0d2 Remove c10::guts::{conjunction,disjunction} (#127726)
They are not used in Pytorch OSS.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127726
Approved by: https://github.com/ezyang
2024-06-03 04:06:21 +00:00
3399ad8d9d [Inductor][CPP] Add UT for bitwise right shift (#127731)
**Summary**
Per the discussion in https://github.com/pytorch/pytorch/issues/127310, `bitwise_right_shift` failed in Torch 2.1 but pass with latest PyTorch, Add the UT in this PR to ensure the correctness.

**TestPlan**
```
python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_bitwise_right_shift
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127731
Approved by: https://github.com/Skylion007
2024-06-03 04:05:41 +00:00
7e97b33fbb [Dynamo] Log backward graph compilation metrics (#126629)
Fixes #125313

Compilation metric logs for the code example at #125313:
```
%s CompilationMetrics(compile_id='0/0', frame_key='1', co_name='forward', co_filename='/data/users/ybliang/debug/debug2.py', co_firstlineno=10, cache_size=0, accumulated_cache_size=0, guard_count=11, shape_env_guard_count=0, graph_op_count=1, graph_node_count=3, graph_input_count=1, start_time=1716247236.6165977, entire_frame_compile_time_s=7.926939964294434, backend_compile_time_s=7.887059926986694, inductor_compile_time_s=4.108498811721802, code_gen_time_s=3.97833514213562, fail_type=None, fail_reason=None, fail_user_frame_filename=None, fail_user_frame_lineno=None, non_compliant_ops=set(), compliant_custom_ops=set(), restart_reasons={"'skip function graph_break in file /home/ybliang/local/pytorch/torch/_dynamo/decorators.py'"}, dynamo_time_before_restart_s=0.025330543518066406, has_guarded_code=True, is_fwd=True)
%s CompilationMetrics(compile_id='1/0', frame_key='2', co_name='torch_dynamo_resume_in_forward_at_12', co_filename='/data/users/ybliang/debug/debug2.py', co_firstlineno=12, cache_size=0, accumulated_cache_size=0, guard_count=10, shape_env_guard_count=0, graph_op_count=2, graph_node_count=5, graph_input_count=1, start_time=1716247244.544928, entire_frame_compile_time_s=0.10148310661315918, backend_compile_time_s=0.08753013610839844, inductor_compile_time_s=0.03691983222961426, code_gen_time_s=0.022417306900024414, fail_type=None, fail_reason=None, fail_user_frame_filename=None, fail_user_frame_lineno=None, non_compliant_ops=set(), compliant_custom_ops=set(), restart_reasons=set(), dynamo_time_before_restart_s=0.0, has_guarded_code=True, is_fwd=True)
tensor([[-0.1622, -0.0000, -0.0000,  0.5643, -0.0000,  0.0000, -0.5087,  0.0914,
         -0.0000, -0.0421]], grad_fn=<CompiledFunctionBackward>)
%s CompilationMetrics(compile_id='1/0', frame_key=None, co_name=None, co_filename=None, co_firstlineno=None, cache_size=None, accumulated_cache_size=None, guard_count=None, shape_env_guard_count=None, graph_op_count=None, graph_node_count=None, graph_input_count=None, start_time=None, entire_frame_compile_time_s=None, backend_compile_time_s=None, inductor_compile_time_s=0.026738643646240234, code_gen_time_s=0.016446352005004883, fail_type=None, fail_reason=None, fail_user_frame_filename=None, fail_user_frame_lineno=None, non_compliant_ops=None, compliant_custom_ops=None, restart_reasons=None, dynamo_time_before_restart_s=None, has_guarded_code=None, is_fwd=False)
%s CompilationMetrics(compile_id='0/0', frame_key=None, co_name=None, co_filename=None, co_firstlineno=None, cache_size=None, accumulated_cache_size=None, guard_count=None, shape_env_guard_count=None, graph_op_count=None, graph_node_count=None, graph_input_count=None, start_time=None, entire_frame_compile_time_s=None, backend_compile_time_s=None, inductor_compile_time_s=0.14563536643981934, code_gen_time_s=0.08652091026306152, fail_type=None, fail_reason=None, fail_user_frame_filename=None, fail_user_frame_lineno=None, non_compliant_ops=None, compliant_custom_ops=None, restart_reasons=None, dynamo_time_before_restart_s=None, has_guarded_code=None, is_fwd=False)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126629
Approved by: https://github.com/ezyang
2024-06-03 03:55:33 +00:00
84776d7597 Revert "[BE]: Update mypy to 1.10.0 (#127717)"
This reverts commit 30213ab0a7b27277e76ea9dd707ce629a63d91ee.

Reverted https://github.com/pytorch/pytorch/pull/127717 on behalf of https://github.com/huydhn due to I am not sure why but the failures look legit and they are showing up in trunk 30213ab0a7 ([comment](https://github.com/pytorch/pytorch/pull/127717#issuecomment-2144183347))
2024-06-03 02:52:47 +00:00
e57f51b80f Update _dedup_save_plans.py (#126569)
To resolve https://github.com/pytorch/pytorch/issues/125740, save each tensor on the lowest rank.

Fixes #125740

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126569
Approved by: https://github.com/LucasLLC
2024-06-03 01:55:03 +00:00
fec8ef8c17 [Aten][BlasKernel] Add function prototype to fix compiler error (#127719)
Adds a prototype for function `fp16_dot_with_fp32_arith()` in `aten/src/ATen/native/BlasKernel.cpp`.

Without this patch the build fails on Apple silicon/MacOs (CPU) with the error `no previous prototype for function 'fp16_dot_with_fp32_arith' [-Werror,-Wmissing-prototypes]`.

The function cannot be marked `static` because its use is not limited to this file.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127719
Approved by: https://github.com/Skylion007
2024-06-02 23:41:43 +00:00
8b08b0f340 [BE] enable ruff rule Q from flake8-quotes (#127713)
Enable [ruff rule `Q`](https://docs.astral.sh/ruff/rules/#flake8-quotes-q) from flake8-quotes. Fixes:

- [avoidable-escaped-quote (Q003)](https://docs.astral.sh/ruff/rules/avoidable-escaped-quote/#avoidable-escaped-quote-q003)
- [unnecessary-escaped-quote (Q004)](https://docs.astral.sh/ruff/rules/unnecessary-escaped-quote/#unnecessary-escaped-quote-q004)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127713
Approved by: https://github.com/ezyang
2024-06-02 23:25:26 +00:00
139b9c6529 Avoid reference cycle in inner closure (#127711)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127711
Approved by: https://github.com/Skylion007, https://github.com/izaitsevfb
2024-06-02 21:28:46 +00:00
30213ab0a7 [BE]: Update mypy to 1.10.0 (#127717)
Updates mypy to the latest and greatest.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127717
Approved by: https://github.com/ezyang
2024-06-02 21:07:23 +00:00
fb53cd6497 [aten_cuda/flash_attn] Add typename to template argument Kernel_trait… (#127634)
Adds the `typename` keyword to the template argument `Kernel_traits::TiledMma` and `Kernel_traits::TiledMmaSdP` (which are dependent type names) when calling the template function `pytorch_flash::convert_layout_acc_Aregs`.

Without `typename` flash_attention kernels do not compile with Clang under C++20 since Clang compiles the entire .cu file in a single pass as opposed to NVCC which split compiles the host and device code. Adding `typename` seems to be OK under NVCC based on CI cuda builds succeeding.

Below is the excerpt of the compilation error:

```
third_party/py/torch/aten/src/ATen/native/transformers/cuda/flash_attn/static_switch.h:46:24: note: expanded from macro 'ALIBI_SWITCH'
   46 |   #define ALIBI_SWITCH BOOL_SWITCH
      |                        ^
third_party/py/torch/aten/src/ATen/native/transformers/cuda/flash_attn/flash_bwd_launch_template.h:132:5: note: in instantiation of function template specialization 'pytorch_flash::run_flash_bwd_seqk_parallel<pytorch_flash::Flash_bwd_ke
rnel_traits<160, 64, 64, 8, 4, 4, 4, false, true>, true>' requested here
  132 |     run_flash_bwd_seqk_parallel<Kernel_traits, Is_dropout>(params, stream);
      |     ^
third_party/py/torch/aten/src/ATen/native/transformers/cuda/flash_attn/flash_bwd_launch_template.h:280:13: note: in instantiation of function template specialization 'pytorch_flash::run_flash_bwd<pytorch_flash::Flash_bwd_kernel_traits<1
60, 64, 64, 8, 4, 4, 4, false, true>, true>' requested here
  280 |             run_flash_bwd<Flash_bwd_kernel_traits<Headdim, 64, 64, 8, 4, 4, 4, false, true, T>, Is_dropout>(params, stream);
      |             ^
third_party/py/torch/aten/src/ATen/native/transformers/cuda/flash_attn/static_switch.h:36:26: note: expanded from macro 'DROPOUT_SWITCH'
   36 |   #define DROPOUT_SWITCH BOOL_SWITCH
      |                          ^
third_party/py/torch/aten/src/ATen/native/transformers/cuda/flash_attn/kernels/flash_bwd_hdim160_fp16_sm80.cu:12:5: note: in instantiation of function template specialization 'pytorch_flash::run_mha_bwd_hdim160<cutlass::half_t>' request
ed here
   12 |     run_mha_bwd_hdim160<cutlass::half_t>(params, stream);
      |     ^
In file included from third_party/py/torch/aten/src/ATen/native/transformers/cuda/flash_attn/kernels/flash_bwd_hdim160_fp16_sm80.cu:7:
In file included from third_party/py/torch/aten/src/ATen/native/transformers/cuda/flash_attn/flash_bwd_launch_template.h:12:
third_party/py/torch/aten/src/ATen/native/transformers/cuda/flash_attn/flash_bwd_kernel.h:543:86: error: missing 'typename' prior to dependent type name 'Flash_bwd_kernel_traits<160, 64, 64, 8, 4, 4, 4, false, true>::TiledMmaSdP'
  543 |         Tensor tPrP = make_tensor(rP.data(), pytorch_flash::convert_layout_acc_Aregs<Kernel_traits::TiledMmaSdP>(rP.layout()));
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127634
Approved by: https://github.com/Skylion007
2024-06-02 16:25:02 +00:00
08653fe355 Beef up the allow_in_graph docs (#127117)
We make the following changes:
- most of the time when someone uses allow_in_graph, they actually
  wanted to make a custom op. We add a link to the custom ops landing
  page and explain the differences between allow_in_graph and custom
  ops.
- we warn people against using allow_in_graph footguns and document
  them.

Test Plan:
- tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127117
Approved by: https://github.com/jansel, https://github.com/albanD
2024-06-02 15:00:46 +00:00
e24a87ed8d [BE][Ez]: Apply PYI059 - Generic always come last (#127685)
Generic baseclass should always be last or unexpected issues can occur, especially in non-stub files (such as with MRO). Applies autofixes from the preview PYI059 rule to fix the issues in the codebase.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127685
Approved by: https://github.com/ezyang
2024-06-02 13:38:58 +00:00
c2547dfcc3 [BE][Ez]: Enable ruff PYI019 (#127684)
Tells pytorch to use typing_extensions.Self when it's able to.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127684
Approved by: https://github.com/ezyang
2024-06-02 13:38:33 +00:00
67ef2683d9 [BE] wrap deprecated function/class with typing_extensions.deprecated (#127689)
Use `typing_extensions.deprecated` for deprecation annotation if possible. Otherwise, add `category=FutureWarning` to `warnings.warn("message")` if the category is missing.

Note that only warnings that their messages contain `[Dd]eprecat(ed|ion)` are updated in this PR.

Resolves #126888

- #126888

This PR is split from PR #126898.

- #126898

------

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127689
Approved by: https://github.com/Skylion007
2024-06-02 12:30:43 +00:00
c1dd3a615f Implement Graph Transform Observer (#127427)
Summary: Implement Graph Transform Observer

Differential Revision: D57887518

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127427
Approved by: https://github.com/angelayi
2024-06-02 06:49:47 +00:00
cyy
4e7f497bb3 [Submodule] Remove ios-cmake (#127694)
It has not been updated for a long time and CI iOS builds don't rely on it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127694
Approved by: https://github.com/ezyang
2024-06-02 04:40:21 +00:00
2129903aa3 Properly detect nested torch function args (#127496)
Dynamo was not detecting nested torch function classes in containers. This was due to pytree compatibility for variable trackers being removed.
Fixes https://github.com/pytorch/pytorch/issues/127174

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127496
Approved by: https://github.com/anijain2305
2024-06-02 03:43:22 +00:00
16578e8584 [symbolic shapes] if symbol not in var_ranges default to unknown range (#127681)
Purpose of this PR is to get around this error: https://github.com/pytorch/pytorch/issues/127677

Differential Revision: D58048558

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127681
Approved by: https://github.com/lezcano
2024-06-02 02:28:40 +00:00
4fd777ed59 [ONNX] Add quantized layer norm op to opset 17 (#127640)
Fixes #126160
Continue #126555

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127640
Approved by: https://github.com/justinchuby
2024-06-02 02:10:02 +00:00
c19ad112f6 [Inductor UT][Intel GPU] Skip test case which doesn't currently work on the XPU stack but newly re-enabled by community. (#127629)
The Inductor UT test/inductor/test_triton_heuristics.py:test_artificial_zgrid that previously skipped was recently enbaled by the PR https://github.com/pytorch/pytorch/pull/127448. However, the test doesn't currently work on the XPU stack, it will huang on GPU, so this PR skip the test for Intel GPU instead of expected failure.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127629
Approved by: https://github.com/EikanWang, https://github.com/peterbell10
2024-06-02 01:00:33 +00:00
2cef2fc2b4 [ts migration] support aten::dim, aten::len, aten::__getitem__ (#127593)
- Add support for aten::dim, aten::len, aten::__getitem__ for torchscript to export converter.
- Add unit tests
Co-authored-by: cyy <cyyever@outlook.com>
Co-authored-by: Menglu Yu <mengluy@meta.com>
Co-authored-by: Animesh Jain <anijain@umich.edu>
Co-authored-by: Simon Fan <xmfan@meta.com>
Co-authored-by: Zain Rizvi <ZainR@meta.com>
Co-authored-by: Tugsbayasgalan (Tugsuu) Manlaibaatar <tmanlaibaatar@meta.com>
Co-authored-by: titaiwangms <titaiwang@microsoft.com>
Co-authored-by: Yueming Hao <yhao@meta.com>
Co-authored-by: IvanKobzarev <ivan.kobzarev@gmail.com>
Co-authored-by: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Co-authored-by: Edward Z. Yang <ezyang@meta.com>
Co-authored-by: Bin Bao <binbao@meta.com>
Co-authored-by: Feny Patel <fenypatel@meta.com>
Co-authored-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Co-authored-by: xinan.lin <xinan.lin@intel.com>
Co-authored-by: Zain Huda <zainhuda@meta.com>
Co-authored-by: Chien-Chin Huang <chienchin@fb.com>
Co-authored-by: Wei Wang <weiwan@nvidia.com>
Co-authored-by: Jason Ansel <jansel@meta.com>
Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
Co-authored-by: Iris Z <31293777+wz337@users.noreply.github.com>
Co-authored-by: Wang, Eikan <eikan.wang@intel.com>
Co-authored-by: angelayi <yiangela7@gmail.com>
Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
Co-authored-by: Yanbo Liang <ybliang8@gmail.com>
Co-authored-by: Catherine Lee <csl@fb.com>
Co-authored-by: Kwanghoon An <kwanghoon@meta.com>
Co-authored-by: Brian Hirsh <hirsheybar@fb.com>
Co-authored-by: Robert Mast <rmast@live.nl>
Co-authored-by: drisspg <drisspguessous@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127593
Approved by: https://github.com/SherlockNoMad, https://github.com/malfet
2024-06-02 00:36:33 +00:00
0d9e527c4d Remove tensor storage_offset/storage_bytes from the cache key (#127319)
Summary: We observed differences in these fields and inductor does not specialize on them so it is safe to remove them from the key.

Test Plan: CI

Reviewed By: masnesral

Differential Revision: D57871276

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127319
Approved by: https://github.com/masnesral
2024-06-02 00:28:43 +00:00
eqy
2e779166eb [Functorch][cuDNN] Bump tolerances for test_vmapjvpvjp (#127355)
cuDNN can select a winograd kernel for this case which slightly affects tolerances...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127355
Approved by: https://github.com/zou3519, https://github.com/Skylion007
2024-06-01 21:22:55 +00:00
6e2e09f6cc [inductor] fix redis-related env vars in remote_cache.py (#127583)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127583
Approved by: https://github.com/oulgen
2024-06-01 19:55:25 +00:00
b505e86475 [Inductor][CI][CUDA 12.4] Update dynamic_inductor_timm_training.csv - change gluon_inception_v3 from fail_accuracy to pass (#127672)
From the HUD, most of the time the "X" is due to "improved_accuracy" for gluon_inception_v3.

![image](https://github.com/pytorch/pytorch/assets/143543872/d4f70377-2756-4921-872d-587426f00302)

https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=inductor_timm

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127672
Approved by: https://github.com/eqy, https://github.com/Skylion007
2024-06-01 19:12:43 +00:00
17dea09b15 Revert "Default XLA to use swap_tensors path in nn.Module._apply (#126814)"
This reverts commit bfdec93395f675a0e5a59e95aef9104ac8f5081a.

Reverted https://github.com/pytorch/pytorch/pull/126814 on behalf of https://github.com/izaitsevfb due to suspicious build instructions count regression, see [D58015016](https://www.internalfb.com/diff/D58015016) ([comment](https://github.com/pytorch/pytorch/pull/126814#issuecomment-2143545818))
2024-06-01 18:46:16 +00:00
82cd7a7dab Revert "Default meta device to use swap_tensors in nn.Module._apply (.to_empty and .to('meta')) (#126819)"
This reverts commit fa426b096b3635daab6ce26b44d50f3baab5a4e5.

Reverted https://github.com/pytorch/pytorch/pull/126819 on behalf of https://github.com/izaitsevfb due to suspicious build instructions count regression, see [D58015016](https://www.internalfb.com/diff/D58015016) ([comment](https://github.com/pytorch/pytorch/pull/126814#issuecomment-2143545818))
2024-06-01 18:46:16 +00:00
42312a52b3 [DSD] Adds type_check param to copy state dict utils (#127417)
[DSD] Adds type_check param to copy state dict utils.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127417
Approved by: https://github.com/fegin
2024-06-01 17:50:52 +00:00
edffb28d39 [BE][Ez]: Enable B019 - flags memory leaks through LRU cache on method (#127686)
Flags potential mem leaks through LRUCache and will hopefully make future contributors rethink this pattern which can cause memleaks. noqas the violations we currently have (should be fixed later)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127686
Approved by: https://github.com/c-p-i-o
2024-06-01 17:19:24 +00:00
22f392ba40 Revert "[easy?] Move AsyncCompile to a different file (#127235)"
This reverts commit f58fc16e8f059232f452a333f32e14ff681e12af.

Reverted https://github.com/pytorch/pytorch/pull/127235 on behalf of https://github.com/izaitsevfb due to breaking internal tests, see [D58015187](https://www.internalfb.com/diff/D58015187) ([comment](https://github.com/pytorch/pytorch/pull/127235#issuecomment-2143518610))
2024-06-01 17:16:16 +00:00
d49dc8f4b8 Revert "Add noqa to prevent lint warnings (#127545)"
This reverts commit f9937afd4f87fbb4844642ae2f587b13b5caa08c.

Reverted https://github.com/pytorch/pytorch/pull/127545 on behalf of https://github.com/izaitsevfb due to reverting to unblock the revert of #127545 ([comment](https://github.com/pytorch/pytorch/pull/127545#issuecomment-2143517711))
2024-06-01 17:12:46 +00:00
114c752b14 Revert "Improve MAGMA conditional macro in BatchLinearAlgebra.cpp (#127495)"
This reverts commit ee08cf57924a4230edad3101666890d8fe050c75.

Reverted https://github.com/pytorch/pytorch/pull/127495 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/127495#issuecomment-2143508218))
2024-06-01 16:39:06 +00:00
efcea2d2fd [dynamo] Support __getitem__ on NNModuleVariable __dict__ (#126956)
Moves further along (but still fails) for the testcase in https://github.com/pytorch/pytorch/pull/126875

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126956
Approved by: https://github.com/jansel
ghstack dependencies: #126923
2024-06-01 15:22:45 +00:00
4129c3e596 Let us find out why we wrote foreach meta regs (#127623)
Turns out it was for no reason!...well, after realizing that these ops are all CompositeExplicit, their meta impls come for free.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127623
Approved by: https://github.com/mikaylagawarecki
ghstack dependencies: #127412
2024-06-01 13:58:18 +00:00
ac60bdaf01 Allow slow foreach to run for any backend, not just CPU (#127412)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127412
Approved by: https://github.com/albanD
2024-06-01 13:58:18 +00:00
4aa7a1efcf [dynamo] Initial exception handling support (#126923)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126923
Approved by: https://github.com/williamwen42, https://github.com/jansel
2024-06-01 13:00:32 +00:00
25994a7ed1 [AOTI] Fix a bug when mutated buffer meets .to (#127671)
Summary: Before this change, the added unit test will trigger: `AssertionError: Can not find the original value for L__self____tensor_constant0_cuda0`. The reason is GraphLowering.constant_name could rename a constant with a device suffix but AOTI requires that new name being registered properly.

Differential Revision: [D58047165](https://our.internmc.facebook.com/intern/diff/D58047165)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127671
Approved by: https://github.com/ColinPeppler, https://github.com/22quinn
2024-06-01 12:30:56 +00:00
c3be459f26 [inductor] fix mkldnn linear binary fusion check ut (#127296)
In this PR:

(1)Fix the unary fusion for bf16 conv/linear.
    Previously we registered same fusion pattern for `bf16. fp16`. And we do not check the dtype while matching the pattern. This results the `fp16` case matched the `bf16` pattern but in later replacement, we found that we have a float16 here which is not expected, so we do not fuse them.  We fix it by checking dtypes to avoid `fp16` case matched `bf16` pattern.

```
  def _is_valid_computation_unary_fusion(computation_op, lowp_dtype=None):
      def fn(match):
          matched = _is_single_computation_op(computation_op, **lowp_dtype**)(match) # previously we do not check lowp_dtype here

```

It is not exposed before because we only check the match count, and the match count is anyway correct because we matched the pattern. To address this, we add check on number of `generated_kernel`. If it is not fused, there will be an additional kernel to compute the post op.

(2)Previous the ut
```
python test/inductor/test_mkldnn_pattern_matcher.py -k test_linear_binary
```
dose not check the fusion status, fix it in this PR.

(3)Extend `test_conv_binary` to test with lp.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127296
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jansel
2024-06-01 11:10:29 +00:00
e62925930f Clear dest impl extra_meta_ info when shallow_copy_from src impl to dest impl. (#127616)
tensorA.data = tensorB will call shallow_copy_from function to copy tensorB metadata and storage to tensorA metadata and storage. If tensorB extra_meta_ is nullptr,then tensorA extra_meta_ still keep in tensorA. This will contaminate new meta data in tensorA.
@ezyang  @bdhirsh
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127616
Approved by: https://github.com/ezyang
2024-06-01 06:54:32 +00:00
554265d450 [Inductor]: Use new device-agnostic libdevice import from triton.language (#127348)
Triton refactored `libdevice` in 5e6952d8c5

While both imports still appear to work under CUDA, this change is required to pull the correct libdevice variants under the Intel XPU backend. I am working on developing a test that catches this behavior. The easiest path would be to enable `test/inductor/test_triton_kernels.py` under the XPU backend, but a different group at Intel manages that test and I need to see if they already have an enabling plan.

I am not sure the double `libdevice` import (see line 22 where I have the nolint flag) is really necessary but have yet to find a conclusive test case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127348
Approved by: https://github.com/etaf, https://github.com/peterbell10
2024-06-01 06:15:33 +00:00
7ef7c265d4 Ack codecvt_utf8_utf16 as a deprecated func in C++17 (#127659)
https://en.cppreference.com/w/cpp/header/codecvt.  This starts to fail on MacOS after migrating it to MacOS 14 with a newer toolchain.  For example 57baae9c9b.

As there is no clear alternative to the deprecated function yet, I just ack the warning to fix the build and complete the migration https://github.com/pytorch/pytorch/issues/127490
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127659
Approved by: https://github.com/kit1980, https://github.com/atalman
2024-06-01 04:31:39 +00:00
3c1cf03fde Add fake impl for aten.unique_dim (#126561)
Follow-up to #113118 and #124306.

Developed in coordination with the solution to https://github.com/microsoft/onnxscript/pull/1547

This PR adds the missing fake tensor implementation for `aten.unique_dim`, thus enabling tracing and compilation of `torch.unique` when `dim` is not None.

Local testing has proceeded with the following simple script (provided that one has checked out the changes in https://github.com/microsoft/onnxscript/pull/1547):

```python
    import onnx
    import onnxruntime as ort
    import logging
    import numpy as np
    onnx_program = torch.onnx.dynamo_export(
        lambda x: torch.unique(x,
                               dim=0,
                               return_inverse=True),
        torch.arange(10),
        export_options=torch.onnx.ExportOptions(
            dynamic_shapes=True,
            diagnostic_options=torch.onnx.DiagnosticOptions(
                verbosity_level=logging.DEBUG)))
    onnx_program.save("torch_unique.onnx")
    onnx_inputs = onnx_program.adapt_torch_inputs_to_onnx(torch.arange(10))
    onnx_outputs = onnx_program(*onnx_inputs)
    loaded_onnx_program = onnx.load("torch_unique.onnx")
    onnx.checker.check_model(loaded_onnx_program)
    ort_session = ort.InferenceSession("torch_unique.onnx")
    inputs = np.random.randint(0, 10, 10)
    print(f"Inputs: {inputs}")
    outputs = ort_session.run(None,
                              {
                                  "l_x_": inputs
                              })
    print(f"Outputs: {outputs}")
    print("Success")
```

Co-authored-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126561
Approved by: https://github.com/ezyang
2024-06-01 04:03:10 +00:00
25447ba241 Always Link libtorch and libtorch_cpu to ensure the functionality for AOT mode (#127381)
Fix #126763: The root cause is that the produced library does not link any torch library because the vec ISA is invalid, and then it cannot run into another path without linking `libtorch` and `libtorch_cpu`.

https://github.com/pytorch/pytorch/blob/main/torch/_inductor/codecache.py#L1637-L1642

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127381
Approved by: https://github.com/desertfire
2024-06-01 01:47:41 +00:00
df53cc7114 [reland] "[reland] _foreach_copy with different src/dst dtypes" (#127186)
Fixes #115171

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127186
Approved by: https://github.com/ezyang
2024-06-01 01:25:10 +00:00
ff8042bcfb Enable AOTI shim v2 build and add into libtorch (#125211)
Summary:
Follow up of https://github.com/pytorch/pytorch/pull/125087

This diff will create shim v2 header and cpp file and corresponding build

Differential Revision: D56617546

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125211
Approved by: https://github.com/desertfire
2024-05-31 23:56:11 +00:00
a8c9b26534 [BE] Fix dependabot security errors (#127567)
Fixes https://github.com/pytorch/pytorch/security/dependabot/36 and https://github.com/pytorch/pytorch/security/dependabot/37 by deleting spurious dependency

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127567
Approved by: https://github.com/malfet
2024-05-31 23:00:07 +00:00
f7171313ab [Inductor] FlexAttention backward kernel optimization (#127208)
BWD Speedups (before this PR):
```
| Type    |   Speedup | shape             | score_mod     | dtype          |
|---------|-----------|-------------------|---------------|----------------|
| Average |     0.211 |                   |               |                |
| Max     |     0.364 | (16, 16, 512, 64) | relative_bias | torch.bfloat16 |
| Min     |     0.044 | (2, 16, 4096, 64) | causal_mask   | torch.bfloat16 |
```
BWD Speedups (after this PR, though not optimizing block size yet):
```
| Type    |   Speedup | shape              | score_mod     | dtype          |
|---------|-----------|--------------------|---------------|----------------|
| Average |     0.484 |                    |               |                |
| Max     |     0.626 | (2, 16, 512, 256)  | head_bias     | torch.bfloat16 |
| Min     |     0.355 | (8, 16, 4096, 128) | relative_bias | torch.bfloat16 |
```

There are a few things need to do as follow-ups:
* Optimized default block size on A100/H100.
* Support different seqlen for Q and K/V.
* Support dynamic shapes for backward.
* Enhance unit tests to check there is no ```nan``` value in any grad. I think we should make some changes to ```test_padded_dense_causal``` because it has invalid inputs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127208
Approved by: https://github.com/Chillee
2024-05-31 22:56:10 +00:00
57baae9c9b Migrating CI/CD jobs to macOS 14 (#127582)
We have half the fleet in MacoS 14 already and it has been running fine so far https://github.com/pytorch/pytorch/issues/127490.  So, I'm preparing the final push to replace the rest of them.  This also switches release build from 13 to 14 (GitHub runners)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127582
Approved by: https://github.com/atalman
2024-05-31 22:30:59 +00:00
02248b73eb [EZ] Port over all test-infra scale configs to lf runners (#127645)
Follow up to https://github.com/pytorch/pytorch/pull/127578

Since GPU builds seem to be working correctly, porting over all remaining scale configs from [the org-wide scale config file](https://github.com/pytorch/test-infra/blob/main/.github/scale-config.yml)

The naming convention here is all temporary. We'll figure out something better before completing the migration
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127645
Approved by: https://github.com/malfet
2024-05-31 22:24:41 +00:00
bb1468d506 Updates state dict in state dict loader (#127617)
Fixes #125096

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127617
Approved by: https://github.com/Skylion007, https://github.com/fegin
2024-05-31 21:59:10 +00:00
f33beb767d [NestedTensor] Use maybe_mark_dynamic instead of mark_dynamic (#127453)
Fixes #127097

**TL;DR**: dimensions marked with mark_dynamic can result in assertion failures if the marked-dynamic dimensions get specialized. In NJT, we don't care _that_ much that a dimension is marked as dynamic. So instead, mark with `maybe_mark_dynamic` which suggests that a dimension should be dynamic, but doesn't fail if the dimension gets specialized.

**Background**:
NJT marks the values tensor as dynamic:

49ad90349d/torch/nested/_internal/nested_tensor.py (L122)

It does this for two reasons:
1. **Conceptual**: We know that this dimension _should_ be dynamic; it's a nested tensor, so the sequence lengths will _probably_ vary between batches in the common case. Therefore, we should compile it as dynamic to prevent needing a recompile to trigger automatic dynamic shapes.
2. **Implementation detail**: Right now we run into issues with torch.compile / tensor_unflatten / other details when the dimensions are not marked as dynamic. We have some attempts to remove this (e.g. https://github.com/pytorch/pytorch/pull/126563) but while testing this I wasn't able to get all tests to pass, so there could be potential regressions here if we removed the mark_dynamic.

**Justification for this change**

1. **Conceptual**: AFAIK, we don't care enough about the dynamism of this dimension to error out if we specialize. We'd prefer that we don't have to recompile to get automatic dynamic shapes, but it's also better to not have this issue (and not to force the user to go hunt down all the other equivalent shapes to mark them as dynamic as well). This solution allows us to suggest the dynamism but not force it.
2. **Implementation detail**: This still marks the dimension as symbolic at the beginning of dynamo tracing, so we will (probably) avoid a lot of the issues we run into when we completely remove the `mark_dynamic` decorators.

Differential Revision: [D57933779](https://our.internmc.facebook.com/intern/diff/D57933779)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127453
Approved by: https://github.com/soulitzer, https://github.com/YuqingJ
2024-05-31 21:32:12 +00:00
6bfc6e0875 Add back private function torch.cuda.amp.autocast_mode._cast (#127433)
This is unfortunately used in a few places in the wild: https://github.com/search?q=torch.cuda.amp.autocast_mode._cast&type=code
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127433
Approved by: https://github.com/zou3519, https://github.com/guangyey
2024-05-31 20:48:15 +00:00
923edef31c FP8 rowwise scaling (#125204)
# Summary
This pull request introduces an fp8 row-scaling kernel as an optional implementation for `scaled_mm`. The kernel selection is based on the scaling tensors of the inputs. For inputs `x` and `y` of shape `[M, K]` and `[K, N]` respectively, the following conditions must be met:
- `x`'s scale should be a 1-dimensional tensor of length `M`.
- `y`'s scale should be a 1-dimensional tensor of length `N`.

It's important to note that this kernel is not called "rowwise, columnwise" scaling because, although the scales for `y` are semantically along its columns, this implementation only supports the TN format. This means the scaling is along the faster-moving dimension, or the "row".

The following two PRs were required to enable local builds:
- [PR #126185](https://github.com/pytorch/pytorch/pull/126185)
- [PR #125523](https://github.com/pytorch/pytorch/pull/125523)

### Todo
We still do not build our Python wheels with this architecture.

@ptrblck @malfet, should we replace `sm_90` with `sm_90a`?

The NVRTC TMA shadowing feels wrong, but I a not sure the right way to spoof the symbol for this compilation unit:
https://github.com/pytorch/pytorch/pull/125204/files#r1586986954

#### ifdef

I tried to use : `#if !defined(USE_ROCM) && defined(CUDA_VERSION) && CUDA_VERSION >= 12000 && \
    defined(__CUDA_ARCH__) && __CUDA_ARCH__ > 900` to gate the building of the kernel. I was having a hell of a time with this.. so I am not really sure the right way to do this

Kernel Credit:
@jwfromm

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125204
Approved by: https://github.com/lw
2024-05-31 20:09:08 +00:00
806e6257f3 Unconditionally assign symbolic shapes as locals (#127486)
Internal xref: https://fb.workplace.com/groups/1405155842844877/posts/8493858177307906

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127486
Approved by: https://github.com/albanD
2024-05-31 20:01:44 +00:00
033e733021 Revert "[BE] wrap deprecated function/class with typing_extensions.deprecated (#126898)"
This reverts commit 749a132fb0a8325cbad4734a563aa459ca611991.

Reverted https://github.com/pytorch/pytorch/pull/126898 on behalf of https://github.com/fbgheith due to switching typing-extensions=4.3.0 to 4.9.0 causes internal failure ([comment](https://github.com/pytorch/pytorch/pull/126898#issuecomment-2142884456))
2024-05-31 19:47:24 +00:00
ea13e9a097 correct BLAS input (#126200)
Fixes #32407

With this little correction to Dependencies.cmake it is possible to build an MKL-free version of Pytorch up from version v2.0.0 by explicitly choosing another MKL-free BLAS.

This pullrequest fulfills the "if not already present" part of the original comment in  Dependencies.cmake:
"setting default preferred BLAS options if not already present."

It's tested with this Action-.yml:
```
name: Build PyTorch v2.0.0 without AVX

on:
  push:
    branches:
      - v2.0.0
  pull_request:
    branches:
      - v2.0.0

jobs:
  build:
    runs-on: ubuntu-20.04
    defaults:
      run:
        shell: bash -el {0}
    steps:

    - name: Checkout repository
      uses: actions/checkout@v4
      with:
        #repository: 'pytorch/pytorch'
        #ref: 'v2.3.0'
        submodules: 'recursive'

    - uses: conda-incubator/setup-miniconda@v3
      with:
        auto-activate-base: true
        activate-environment: true
        python-version: 3.10.13

    - name: Install Dependencies - Common - Linux 2
      run: |
        conda info
        conda list
        conda install nomkl
        conda install astunparse numpy ninja pyyaml setuptools cmake cffi typing_extensions future six requests dataclasses
        export PYTORCH_CPU_CAPABILITY=cpu
        export ATEN_CPU_CAPABILITY_DEFAULT=cpu
        export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}
        export ATEN_CPU_CAPABILITY=default
        export USE_NNPACK=0
        export MAX_JOBS=4
        export USE_CUDA=0
        export USE_ROCM=0
        export BLAS=OpenBLAS
        export CMAKE_ARGS="-D CMAKE_BUILD_TYPE=Release -D USE_AVX=OFF -D USE_NNPACK=OFF -D C_HAS_AVX_2=OFF -D C_HAS_AVX2_2=OFF -D CXX_HAS_AVX_2=OFF -D CXX_HAS_AVX2_2=OFF -D CAFFE2_COMPILER_SUPPORTS_AVX512_EXTENSIONS=OFF -DPYTHON_INCLUDE_DIR=$(python -c "import sysconfig; print(sysconfig.get_path('include'))") -DPYTHON_LIBRARY=$(python -c "import sysconfig; print(sysconfig.get_config_var('LIBDIR'))") -DPYTHON_EXECUTABLE:FILEPATH=`which python`"
        pip install build wheel typing_extensions
        python setup.py bdist_wheel
    - name: Archive production artifacts
      uses: actions/upload-artifact@v4
      with:
        name: dist-without-markdown
        path: |
          dist
          !dist/**/*.md
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126200
Approved by: https://github.com/jgong5, https://github.com/kit1980
2024-05-31 19:38:42 +00:00
bbf892dd58 Revert "Add back private function torch.cuda.amp.autocast_mode._cast (#127433)"
This reverts commit 6e0eeecc7cd4dc389683e35d1f2e34738e09e597.

Reverted https://github.com/pytorch/pytorch/pull/127433 on behalf of https://github.com/fbgheith due to depends on https://github.com/pytorch/pytorch/pull/126898 which is failing internally and needs to be reverted ([comment](https://github.com/pytorch/pytorch/pull/127433#issuecomment-2142869610))
2024-05-31 19:35:15 +00:00
1103444870 [AOTI] Add back include_pytorch for specifying link paths (#126802)
Summary: Running dashboard with the cpp wrapper mode sometimes hit erros like "undefined symbol: aoti_torch_empty_stride", although it can not be reproduced locally and seems only happen on the dashboard CI.

Differential Revision: [D57911442](https://our.internmc.facebook.com/intern/diff/D57911442)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126802
Approved by: https://github.com/chenyang78
ghstack dependencies: #126916, #127037
2024-05-31 19:32:52 +00:00
8af1c655e5 improve eager overhead of _disable_dynamo (#127325)
it seems like `_disable_dynamo` actually has a fair amount of overhead (especially when it was added to `DTensor.__new__`: this change speeds up @wanchaol 's repro from 0.380 -> 0.312s: P1378202570 (that repro runs a vanilla MLP using 2D parallelism, and calls the DTensor constructor 1280 times).

It looks like most of the slowndown is in the fact that we are repeatedly running `import torch._dynamo` and constructing an instance of `torch._dynamo.disable(fn, recursive)` on every call to the constructor - this PR caches it on the first invocation.

~~Update: I realized I cannot use `torch.compiler.is_compiling` to know when to fast-path, because when we hit a graph break, cpython will be running so it will return False.~~

~~As a test / potential fix, I added a new config, `torch._dynamo.config._is_compiling` that is set to True **always** inside a compiled region (even on frames that are run by cpython). This definitely seems to do what I want in terms of knowing when to fastpath and avoid overhead - although interested in feedback on how reasonable this is~~

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127325
Approved by: https://github.com/wanchaol, https://github.com/anijain2305
2024-05-31 19:30:47 +00:00
b704c7cf0f Re trying Support min/max carry over for eager mode from_float method (#127576)
Summary:
Original commit changeset: 2605900516c8

Original Phabricator Diff: D57977896

Test Plan: Re enabling due to prod failure

Reviewed By: jerryzh168

Differential Revision: D57978925

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127576
Approved by: https://github.com/jerryzh168
2024-05-31 19:08:07 +00:00
121c55d8d1 Old branch deletion script to also delete old ciflow tags (#127625)
Change branch deletion script to also delete left over ciflow tags that the bot doesn't get to, as well as the one created by triggering a workflow on HUD

Example run https://github.com/pytorch/pytorch/actions/runs/9322082915/job/25662376463?pr=127625
(didn't actually delete the tag, but lists what tags it would delete)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127625
Approved by: https://github.com/huydhn
2024-05-31 18:54:54 +00:00
0be06b08fc [GPT-fast benchmark] Merge GPT-fast and micro benchmark output as one CSV file (#127586)
Consolidate GPT-fast models benchmark with micro-benchmark, and save output as one CSV file with the same format as https://github.com/pytorch/pytorch/pull/126754#issue-2307296847.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127586
Approved by: https://github.com/Chillee
2024-05-31 18:50:49 +00:00
4a0d96e496 Add a GH action to autolabel docathon PRs (#127569)
To ease oncall burden for the docathon PR reviewers and ensure all PRs are correctly labeled, adding this GH action that will look for the issue number in the PR and if that issue has a docathon-h1-2024 label, then it would propagate the labels from the issues into the PR. It should not conflict with the existing labelers because we use ``pull_request.add_to_labels`` - credit @kit1980.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127569
Approved by: https://github.com/kit1980
2024-05-31 17:57:07 +00:00
b2f5fd8efb [ts_converter] Basic support for prim::If conversion (#127336)
Script module:
```
graph(%self : __torch__.M,
      %x.1 : Tensor,
      %y.1 : Tensor):
  %11 : int = prim::Constant[value=1]()
  %5 : bool = aten::Bool(%x.1) # /data/users/angelayi/pytorch2/test/export/test_converter.py:27:19
  %21 : Tensor = prim::If(%5) # /data/users/angelayi/pytorch2/test/export/test_converter.py:27:16
    block0():
      %8 : Tensor = aten::mul(%y.1, %y.1) # /data/users/angelayi/pytorch2/test/export/test_converter.py:28:27
      -> (%8)
    block1():
      %12 : Tensor = aten::add(%y.1, %y.1, %11) # /data/users/angelayi/pytorch2/test/export/test_converter.py:30:27
      -> (%12)
  return (%21)
```
ExportedProgram:
```
ExportedProgram:
    class GraphModule(torch.nn.Module):
        def forward(self, x_1: "b8[]", y_1: "i64[]"):
            # File: <eval_with_key>.23:9 in forward, code: cond = torch.ops.higher_order.cond(l_args_0_, cond_true_0, cond_false_0, [l_args_3_0_]);  l_args_0_ = cond_true_0 = cond_false_0 = l_args_3_0_ = None
            true_graph_0 = self.true_graph_0
            false_graph_0 = self.false_graph_0
            conditional = torch.ops.higher_order.cond(x_1, true_graph_0, false_graph_0, [y_1]);  x_1 = true_graph_0 = false_graph_0 = y_1 = None
            return (conditional,)

        class <lambda>(torch.nn.Module):
            def forward(self, y_1: "i64[]"):
                # File: <eval_with_key>.20:6 in forward, code: mul_tensor = torch.ops.aten.mul.Tensor(l_args_3_0__1, l_args_3_0__1);  l_args_3_0__1 = None
                mul: "i64[]" = torch.ops.aten.mul.Tensor(y_1, y_1);  y_1 = None
                return mul

        class <lambda>(torch.nn.Module):
            def forward(self, y_1: "i64[]"):
                # File: <eval_with_key>.21:6 in forward, code: add_tensor = torch.ops.aten.add.Tensor(l_args_3_0__1, l_args_3_0__1, alpha = 1);  l_args_3_0__1 = None
                add: "i64[]" = torch.ops.aten.add.Tensor(y_1, y_1);  y_1 = None
                return add
```

This PR also adds support for TupleIndex and incorporates some changes from https://github.com/pytorch/pytorch/pull/127341
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127336
Approved by: https://github.com/BoyuanFeng
2024-05-31 17:46:16 +00:00
cyy
3e66052e16 Improve python3 discovery code in CMake (#127600)
The improvement is based on my comments in #124613 and it also fixes the current linux-s390x-binary-manywheel  CI failures.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127600
Approved by: https://github.com/Skylion007
2024-05-31 17:29:06 +00:00
8d7393cb5e Update triton-xpu commit pin merge rules for XPU (#127203)
Add the ".ci/docker/ci_commit_pins/triton-xpu.txt" to the XPU merge rules.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127203
Approved by: https://github.com/atalman
2024-05-31 17:19:19 +00:00
1699edaabb [DeviceMesh] Adding nD slicing support back (#127465)
Fixes #126530

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127465
Approved by: https://github.com/wconstab, https://github.com/wanchaol
2024-05-31 17:06:36 +00:00
8bf2c0a203 [BE][Ez]: Update ruff to 0.4.6 (#127614)
Update ruff linter to 0.4.6. Uneventful PR that fixes bugs and reduces false positives.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127614
Approved by: https://github.com/albanD
2024-05-31 17:01:50 +00:00
58b461d57a Revert "[ROCm] Update triton pin to fix libtanh issue (#125396)"
This reverts commit 19333d1eb9b8965edd6c8a52fd59b5c67b4fb523.

Reverted https://github.com/pytorch/pytorch/pull/125396 on behalf of https://github.com/atalman due to Broke nightly builds ([comment](https://github.com/pytorch/pytorch/pull/125396#issuecomment-2142638237))
2024-05-31 16:51:39 +00:00
225ec08e35 Fix typo in .ci/docker/ubuntu-cuda/Dockerfile (#127503)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127503
Approved by: https://github.com/nWEIdia, https://github.com/Skylion007
2024-05-31 16:50:35 +00:00
67f0807042 [Inductor] [CI] [CUDA] Skip the failed models and tests the better way (#127150)
Address subtasks in https://github.com/pytorch/pytorch/issues/126692

After enabling the disabled shards, the following two models regressed (for cu124 configuration):
dynamic_inductor_timm_training.csv
cspdarknet53,pass,7   (expected)                                        | cspdarknet53,fail_accuracy,7           (actual)
eca_botnext26ts_256,pass,7        (expected)                            | eca_botnext26ts_256,fail_accuracy,7 (actual)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127150
Approved by: https://github.com/huydhn, https://github.com/eqy, https://github.com/atalman
2024-05-31 16:35:57 +00:00
64c581a1d4 [DSD] Make distributed state_dict support torch.distributed is not initialized case (#127385)
Fixes https://github.com/pytorch/pytorch/issues/124942

Summary:
Allow DSD to support loading the regular optimizer state_dict and can be used when torch.distributed.is_initialized() is False.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127385
Approved by: https://github.com/wz337
ghstack dependencies: #127070, #127071, #127384
2024-05-31 16:28:16 +00:00
8b4ad3a8d9 [DSD] Unify the API signatures of set_model_state_dict and set_optimizer_state_dict (#127384)
Summary:
Allow the optim_state_dict argument to be a positional argument. This make sense since this is a required argument and this will make the function signature the consistent as set_model_state_dict without causing BC issues.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127384
Approved by: https://github.com/wz337
ghstack dependencies: #127070, #127071
2024-05-31 16:24:51 +00:00
bd868eeb28 [DSD] Support flattening the optimizer state_dict when saving and unflattening when loading (#127071)
Fixes https://github.com/pytorch/pytorch/issues/126595

**What does this PR do?**
This PR unflattens the optimizer state_dict, similar to what TorchRec does. The current `get_optimizer_state_dict()` converts the parameter IDs to FQNs in order to avoid any conflict with different optimizers on different ranks. The current returned optimizer state_dict looks like the following one:
```
{
    "state": {
          "layer1.weight": {"step": 10, "exp_avg": SomeTensor, "exp_avg_sq": SomeTensor},
          "layer2.weight": {"step": 10, "exp_avg": SomeTensor, "exp_avg_sq": SomeTensor},
    },
    "param_group": [
         {"lr": 0.0, "betas": (0.9, 0.95), ..., "params": ["layer1.weight", "layer2.weight"]}
    ]
}
```
While this can avoid the conflict and can support merging multiple optimizers use case (e.g., optimizer in backward), the current optimizer state_dict still cannot support MPMD (e.g., pipeline parallelism). The root cause is `param_group`. `param_group` cannot generate unique keys during saving -- DCP will flatten the dict but for `param_group`, DCP will get the keys like, `param_group.lr` or `param_group.params`. These keys will conflict when using pipeline parallelism.

This PR flatten the optimizer state_dict to the one as the following one:
```
{
    "state.layer1.weight.step": 10,
    "state.layer2.weight.step": 10,
    "state.layer1.weight.exp_avg": SomeTensor,
    "state.layer2.weight.exp_avg": SomeTensor,
    "state.layer1.weight.exp_avg_sq": SomeTensor,
    "state.layer2.weight.exp_avg_sq": SomeTensor,
    "param_group.layer1.weight.lr" : 0.1,
    "param_group.layer2.weight.lr" : 0.1,
    "param_group.layer1.weight.betas" : (0.9, 0.95),
    "param_group.layer2.weight.betas" : (0.9, 0.95),
}
```
This allows distributed state_dict (DSD) to support MPMD (e.g., pipeline parallelism).

**Pros and Cons**
*Pros*
1. Can support optimizer resharding (e.g., changing the parallelisms from 3D to 2D or changing the number of workers).
2. User don't need to manually add prefix to different optimizer.
3. Allow users to merge the optimizer states easily. One use case is loop-based pipeline parallelism.

*Cons*
1. The implementation has a strong assumption of the structure of `param_groups` and its value. If the assumption changes or some customized optimizers do not meet the assumption, the implementations will be broken.
2. There will be extra values saved in the checkpoints. The assumption here is `param_group` generally contains scalars which are cheap to save.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127071
Approved by: https://github.com/wconstab, https://github.com/wz337
ghstack dependencies: #127070
2024-05-31 16:20:36 +00:00
6b1b8d0193 [DSD] Remove the support of Dict[nn.Module, Dict[str, Any]] state_dict (#127070)
Summary:
This is a very complicated signature that is hard for users to reason. Remove the support of this feature.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127070
Approved by: https://github.com/wz337
2024-05-31 16:16:05 +00:00
a010fa9e24 [DCP] Fix variable spelling (#127565)
Summary: tsia

Test Plan: sandcastle

Differential Revision: D57983752

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127565
Approved by: https://github.com/wz337, https://github.com/fegin
2024-05-31 15:32:08 +00:00
75e7588f47 [Inductor UT] Fix expected failure but pass for test case on Intel GPU. (#127595)
The XPU expected failure test case `TritonCodeGenTests.test_codegen_config_option_dont_assume_alignment` should have been expected passed after the PR #126261 merged, but due to test flaky, this case was skiped when landing the PR. The expected failure but passed error then exposed in periodic test: https://github.com/pytorch/pytorch/actions/runs/9302864965/job/25605549183#step:14:2082.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127595
Approved by: https://github.com/EikanWang, https://github.com/chuanqi129, https://github.com/peterbell10, https://github.com/atalman
2024-05-31 15:32:00 +00:00
4644def434 Update docstring for weights_only (#127575)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127575
Approved by: https://github.com/malfet
2024-05-31 14:27:31 +00:00
cddb8dbebe add workloadd events to pytorch (#127415)
Summary: add workloadd events to pytorch

Test Plan: CIs

Differential Revision: D57914472

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127415
Approved by: https://github.com/sraikund16
2024-05-31 14:25:44 +00:00
10a92b5f84 [AOTI] Fix a bool value codegen issue when calling custom ops (#127398)
Summary: fixes https://github.com/pytorch/pytorch/issues/127392

Differential Revision: [D57911527](https://our.internmc.facebook.com/intern/diff/D57911527)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127398
Approved by: https://github.com/angelayi, https://github.com/chenyang78
ghstack dependencies: #126916, #127037
2024-05-31 14:01:36 +00:00
17c5b6508b [AOTI] Support _CollectiveKernel in the cpp wrapper mode (#127037)
Summary: _CollectiveKernel appears in TorchBench moco training. It's a special Fallback op that requires extra care.

Differential Revision: [D57911441](https://our.internmc.facebook.com/intern/diff/D57911441)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127037
Approved by: https://github.com/malfet
ghstack dependencies: #126916
2024-05-31 13:58:50 +00:00
413b81789f [AOTI][refactor] Unify val_to_arg_str and val_to_cpp_arg_str (#126916)
Summary: Now fallback argument type information has been passed, so time to unify val_to_arg_str and val_to_cpp_arg_str

Differential Revision: [D57907751](https://our.internmc.facebook.com/intern/diff/D57907751)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126916
Approved by: https://github.com/chenyang78
2024-05-31 13:56:11 +00:00
aaef7b29e9 Only register _inductor_test ops if not running with deploy (#127557)
Internal xref: https://fb.workplace.com/groups/1405155842844877/posts/8498194410207616

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127557
Approved by: https://github.com/zou3519
2024-05-31 13:34:23 +00:00
029b3ec775 Revert "[inductor][cpp] bf16/fp16 gemm template computed with fp32 w/o epilogue fusion (#126068)"
This reverts commit dae33a4961addb5847dbb362e7bb907bbfc64929.

Reverted https://github.com/pytorch/pytorch/pull/126068 on behalf of https://github.com/PaliC due to failing internal tests ([comment](https://github.com/pytorch/pytorch/pull/126068#issuecomment-2141992307))
2024-05-31 12:33:25 +00:00
cyy
a6bae1f6db Remove more caffe2 files (#127511)
Remove more caffe2 files.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127511
Approved by: https://github.com/r-barnes
2024-05-31 11:26:27 +00:00
df0c69f32d [inductor] Add fallback for collectives size estimation for unbacked (#127562)
Differential Revision: [D57982928](https://our.internmc.facebook.com/intern/diff/D57982928)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127562
Approved by: https://github.com/yifuwang
2024-05-31 11:15:46 +00:00
f4d7cdc5e6 [dynamo] Add current instruction to BlockStackEntry (#127482)
Will be used by exception handling in later PRs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127482
Approved by: https://github.com/jansel
2024-05-31 08:58:53 +00:00
2a03bf5a14 [inductor] fix grid z bug for large grid (#127448)
Fixes #123210

2f3d3ddd70/torch/_inductor/runtime/triton_heuristics.py (L1733-L1753)

If a kernel's y_grid  is larger than 65535, it will be split into multiple z grids. The above grad_fn does this split before the kernel launch; however, the computations for yoffset and the y_grid are incorrect. For example, if we have xy numel of `(1*XBLOCK, 65537*YBLOCK)`, this function will return an [xyz]_grid with (1, 32768, 2). XBLOCK and YBLOCK here are used for the following `get_grid_dim`. Let's use their default values (4, 1024).

2f3d3ddd70/torch/_inductor/runtime/triton_heuristics.py (L1734)

[xyz]_grid = (1, 32768, 2) means the workload are divided to two z grids. Because the triton kernel generation still follows xy dimension, one of the exampled generated kernel is shown below.

```python
@triton.jit
def triton_(in_ptr0, out_ptr0, ynumel, xnumel, YBLOCK : tl.constexpr, XBLOCK : tl.constexpr):
    ynumel = 65537*1024
    xnumel = 1*4
    yoffset = tl.program_id(1) * (tl.program_id(2) + 1) * YBLOCK
    yindex = yoffset + tl.arange(0, YBLOCK)[None, :]
    ymask = yindex < ynumel
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:, None]
    xmask = xindex < xnumel
    x2 = xindex
    y0 = yindex % 128
    y1 = (yindex // 128)
    y3 = yindex
    tmp0 = tl.load(in_ptr0 + (y0 + (128*x2) + (512*y1)), xmask, eviction_policy='evict_last')
    tl.store(out_ptr0 + (x2 + (4*y3)), tmp0, xmask)
```

For a trition block with xyz index (0, 0, 1), its yoffset and xoffset are both 0s based on the compuation `yoffset = tl.program_id(1) * (tl.program_id(2) + 1) * YBLOCK` and `xoffset = tl.program_id(0) * XBLOCK`. So, this triton block will access the very first elements of the input.  However, the correct yoffset should be `(y_index + z_index * y_grid ) * YBLOCK` which is the starting position of the 2nd z grid.

At the same time, because we used `y_grid = y_grid // div` to compute the maximum number of element in y dimension, the y_grid is 32768. The total y grids is 32768*2 = 65536, which is less than the actual y grids 65537. So, we should use `y_grid = ceildiv(y_grid, div)` to compute the y grid to save the remaining grids.

#123210 is not about AOTInductor, the root cause is the triton kernel generated by torchinductor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127448
Approved by: https://github.com/eellison
2024-05-31 08:01:34 +00:00
4935a019e4 [ONNX] Update decomposition table to core ATen ops (#127353)
Fixes #125894

Previous to this PR, there are ATen core ops missing in the decomposition table because we thought they might be decomposed into prim ops, as they are under _refs. The PR picks them back according to f6ef832e87/torch/_decomp/__init__.py (L253)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127353
Approved by: https://github.com/justinchuby
2024-05-31 06:35:47 +00:00
cyy
0c5faee372 Replace python::python with Python::Module (#127485)
Use found Python::Module target
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127485
Approved by: https://github.com/ezyang
2024-05-31 05:57:05 +00:00
b5e85b8ecc Add deferred_runtime_assertion pass after run_decompositions (#127305)
Summary: We also want to reinsert the deferred_runtime passes after run_decompositions as well

Test Plan: CI

Reviewed By: zhxchen17

Differential Revision: D57802237

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127305
Approved by: https://github.com/BoyuanFeng
2024-05-31 05:45:28 +00:00
ae47152ca8 Expand supported labels to most self-hosted linux pull.yml workflows (#127578)
Initial set of runners added in https://github.com/pytorch/pytorch/pull/127566 seem to be working.

Expanding to include more machine types, especially GPU machines
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127578
Approved by: https://github.com/huydhn
2024-05-31 05:40:16 +00:00
ec098b88b6 [compiled autograd] torch.compile API (#125880)
- enter existing compiled autograd ctx manager before entering torch.compile frames

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125880
Approved by: https://github.com/jansel
2024-05-31 04:38:20 +00:00
cyy
ee08cf5792 Improve MAGMA conditional macro in BatchLinearAlgebra.cpp (#127495)
Unnecessary TORCH_CHECK(false) are changed to macro coverage as mentioned in #127371
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127495
Approved by: https://github.com/ezyang
2024-05-31 04:27:20 +00:00
159632aecd [dynamo] Support hasattr on BuiltinVariable (#127372)
Fixes https://github.com/pytorch/pytorch/issues/127172

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127372
Approved by: https://github.com/williamwen42, https://github.com/yanboliang
ghstack dependencies: #127377
2024-05-31 04:23:56 +00:00
bb6bfd9ad8 [dynamo][compile-time] Cache the child guard managers (#127377)
Reduces compile time of MobileBertForMaskedLM model from 39 seconds to 26 seconds. This was a regression introduced by #125202. Before that PR, compile time was 24 seconds. The extra two seconds is just because we are going through enormous number of guards.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127377
Approved by: https://github.com/jansel
2024-05-31 04:23:56 +00:00
f264745ff1 [interformer] batch pointwise op + unbind stack pass in post grad (#126959)
Summary: Tested on H100 with single GPU, and the bs is set to 64.

Test Plan:
# local script

```
buck2 run mode/opt scripts/jackiexu0313/pt2:uniarch_perf_benchmark -- single-module-benchmark --provider interformer --enable_pt2 True --batch_size 64
```

baseline: P1370993922

| Metric             | Value        |
|:-------------------|:-------------|
| Latency            | 120.84 ms    |
| Model size         | 5.93 G bytes |
| Flops/example      | 62.22 GB     |
| TFLOPS             | 32.95        |
| MFU                | 4.12%        |
| Activation/example | 128.17 MB    |

proposal: P1371676068

config
```
torch._inductor.config.pre_grad_fusion_options = {}
torch._inductor.config.post_grad_fusion_options = {
        "batch_aten_mul": {"min_fuse_set_size": 50},
        "batch_aten_sigmoid": {"min_fuse_set_size": 50},
        "batch_aten_relu": {"min_fuse_set_size": 50},
        "batch_linear_post_grad": {"min_fuse_set_size": 50},
        "unbind_stack_aten_pass": {},
}
```

| Metric             | Value        |
|:-------------------|:-------------|
| Latency            | 117.30 ms    |
| Model size         | 5.93 G bytes |
| Flops/example      | 62.65 GB     |
| TFLOPS             | 34.18        |
| MFU                | 4.27%        |
| Activation/example | 163.12 MB    |

Differential Revision: D57595173

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126959
Approved by: https://github.com/jackiexu1992
2024-05-31 03:54:43 +00:00
cyy
8629f9b3f2 Remove more unused variables in tests (#127510)
Follows #127379

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127510
Approved by: https://github.com/Skylion007, https://github.com/r-barnes
2024-05-31 03:39:45 +00:00
0aaac68c57 Add structured logging for tensor fakeification (#126879)
This adds dumps of MetaTensorDesc and MetaStorageDesc to structured logs
when they are triggered from Dynamo.  The logs look like this:

```
V0522 08:13:25.267000 140224882566144 torch/_subclasses/meta_utils.py:195] {"describe_storage": {"id": 0, "describer_id": 0, "size": 32}, "frame_id": 0, "frame_compile_id": 0, "attempt": 0}
V0522 08:13:25.267000 140224882566144 torch/_subclasses/meta_utils.py:220] {"describe_tensor": {"id": 0, "ndim": 1, "dtype": "torch.float32", "device": "device(type='cpu')", "size": [8], "is_leaf": true, "stride": [1], "storage": 0, "view_func": "<built-in method _view_func_unsafe of Tensor object at 0x7f882959e840>", "describer_id": 0}, "frame_id": 0, "frame_compile_id": 0, "attempt": 0}
V0522 08:13:25.268000 140224882566144 torch/_subclasses/meta_utils.py:1594] {"describe_source": {"describer_id": 0, "id": 0, "source": "L['x']"}, "frame_id": 0, "frame_compile_id": 0, "attempt": 0}
```

The `describer_id` is used to disambiguate ids.  We expect it to be
unique per frame id, but if there is a bug it possibly is not.  Note you will get
redundant dumps when evaluation restarts.

tlparse can use this to give a visualization of input tensors to a
model, you could also use this to generate example inputs to run graphs
on.

Some care is taken to avoid redumping the tensor metadata multiple
times, which would happen ordinarily because AOTAutograd refakifies
everything after Dynamo, to deal with metadata mutation.

Partially fixes https://github.com/pytorch/pytorch/issues/126644

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126879
Approved by: https://github.com/jamesjwu
2024-05-31 01:58:44 +00:00
b1792a622d [pipelining] handle param aliasing (#127471)
Adds support for parameter aliasing in pipelining. Does this by reading the state_dict, and creating a map of id -> valid tensor FQNs (to be used in _sink_params). Assigns additional FQN attributes that may be used, runs _sink_params(), and then deletes unused attributes. Shares some similarity with how export's unflattener does it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127471
Approved by: https://github.com/kwen2501
2024-05-31 01:52:57 +00:00
d535de1747 [inductor] remove reordering_reindex (#127367)
This fixes the loop ordering issue for avg_pool2d here (https://github.com/pytorch/pytorch/issues/126255#issuecomment-2117931529).

The reason we can not fuse the 2 kernels for avg_pool2d is due to ComputedBuffer.iter_reordering_reindex. Take a simpler example:

```
        def f(x, y):
            """
            Add a matmul since inductor may force layout for output.
            """
            return (x.sum(dim=-1) + 1) @ y

        # Make the first 2 dimension not able to merge on purpose so that
        # ComputedBuffer.iter_reoredering_reindex will be updated.
        x = rand_strided([20, 20, 30], [30, 900, 1], device="cuda")
        y = torch.randn(20, 20)
```

Suppose x.sum is stored to x2. The computed buffer for x2 will remember that we have reordered it's first and second dimension (i.e. loop order [1, 0]). Later one when we decide the loop order for x2 when computing 'x2 + 1' , we decide to pick loop order [1, 0] according to the stride analysis. And then we use the saved ComputedBuffer.iter_reordering_reindex to further reorder the loop order. The net effect is that we use loop order [0, 1] which cause the pointwise kernel not able to fuse with the reduction kernel.

I feel that we don't need ComputedBuffer.iter_reordering_reindex. And test result shows removing it has neutral impact on the dashboard [link](https://hud.pytorch.org/benchmark/compilers?startTime=Wed%2C%2022%20May%202024%2017%3A30%3A29%20GMT&stopTime=Wed%2C%2029%20May%202024%2017%3A30%3A29%20GMT&granularity=hour&suite=torchbench&mode=training&dtype=amp&lBranch=gh/shunting314/153/head&lCommit=195f42cf1a414d2d1a0422b8a081a85ff52b7d20&rBranch=main&rCommit=d6e3e89804c4063827ea21ffcd3d865e5fe365d9)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127367
Approved by: https://github.com/jansel
2024-05-31 01:36:43 +00:00
7646825c3e Revert "distributed debug handlers (#126601)"
This reverts commit 3d541835d509910fceca00fc5a916e9718c391d8.

Reverted https://github.com/pytorch/pytorch/pull/126601 on behalf of https://github.com/PaliC due to breaking internal typechecking tests ([comment](https://github.com/pytorch/pytorch/pull/126601#issuecomment-2141076987))
2024-05-31 01:21:24 +00:00
cyy
d44daebdbc [Submodule] Remove deprecated USE_TBB option and TBB submodule (#127051)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127051
Approved by: https://github.com/cpuhrsch, https://github.com/malfet
2024-05-31 01:20:45 +00:00
da9fb670d2 Nadam support the flag for "maximize" (#127214)
Fixes https://github.com/pytorch/pytorch/issues/126642

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127214
Approved by: https://github.com/janeyx99
2024-05-31 01:11:16 +00:00
f6e303fa47 Revert "[DeviceMesh] Adding nD slicing support back (#127465)"
This reverts commit e72232f8f032b970b74da18200678b3a4617bf95.

Reverted https://github.com/pytorch/pytorch/pull/127465 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing lint e72232f8f0, the error does not like look trivial fix, so I revert the change for a forward fix ([comment](https://github.com/pytorch/pytorch/pull/127465#issuecomment-2141051630))
2024-05-31 00:43:13 +00:00
af5ed05416 Include triton in py3.12 binaries (#127547)
Additional Builder PR: https://github.com/pytorch/builder/pull/1846/
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127547
Approved by: https://github.com/williamwen42
2024-05-31 00:30:10 +00:00
fc73d07e5e [c10d] Decorate methods in NCCLUtils.hpp with TORCH_API (#127550)
Summary:
User-defined PyTorch modules that uses `C10D_NCCL_CHECK` run into undefined symbol errors
when loaded by `torch.library.load()`, because they have not been exported.  This change
exports the symbols needed to resolve those runtime errors.

Test Plan: PyTorch CI

Differential Revision: D57977944

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127550
Approved by: https://github.com/Skylion007
2024-05-31 00:17:25 +00:00
a2bff4dc8c Fix lint (#127584)
Trivial fix after https://github.com/pytorch/pytorch/pull/124678

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127584
Approved by: https://github.com/huydhn
2024-05-31 00:00:11 +00:00
e72232f8f0 [DeviceMesh] Adding nD slicing support back (#127465)
Fixes #126530

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127465
Approved by: https://github.com/wconstab
2024-05-30 23:55:21 +00:00
214dd44608 [c10d] add Work's numel to logger for debugging purposes (#127468)
Summary:
We have seen some cases that all ranks call into a collective but it got
stuck probably due to incorrect sizes of the tensors. Adding the size
info into logging for debugging

Also, taking this chance to consolidate all logger related status
metrics in to one struct

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127468
Approved by: https://github.com/wconstab
2024-05-30 23:32:33 +00:00
620ec081ec Extract inner loops into separate function for ARM64 fp16_dot_with_fp32_arith (#127476)
Summary: Preparing to generalize to bf16. (This should not be committed unless the following bf16 PR is committed!)

Test Plan: Spot-checked llm_experiments benchmark result to make sure it didn't regress.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127476
Approved by: https://github.com/malfet
ghstack dependencies: #127435, #127451
2024-05-30 23:28:17 +00:00
603bde1de3 Use efficient ARM fp16 dot product for gemm_transa_ general case (#127451)
Summary: This doesn't change the overall gemm algorithm away from repeated dot products, just uses our efficient fp16 dot product developed for the gemv case. It seems to improve performance for every prompt length I tested.

Test Plan: Use https://github.com/malfet/llm_experiments/blob/main/benchmarks/benchmark_torch_mm.py , edited to test only the trans_b (really gemm_transa_) case for the sizes outlined in the output.

Before:
```
Matrix-vector:
m=8, n=128, k=1
====================
trans_b  torch.float32    1.05 usec
trans_b  torch.float16    0.97 usec
trans_b torch.bfloat16    1.06 usec
m=128, n=8, k=1
====================
trans_b  torch.float32    0.80 usec
trans_b  torch.float16    0.97 usec
trans_b torch.bfloat16    1.00 usec
m=4096, n=4096, k=1
====================
trans_b  torch.float32 2160.75 usec
trans_b  torch.float16  659.77 usec
trans_b torch.bfloat16 3800.13 usec
m=11008, n=4096, k=1
====================
trans_b  torch.float32 6343.68 usec
trans_b  torch.float16 1789.42 usec
trans_b torch.bfloat16 10098.34 usec
m=4096, n=11008, k=1
====================
trans_b  torch.float32 6217.20 usec
trans_b  torch.float16 1874.47 usec
trans_b torch.bfloat16 10490.30 usec
m=32000, n=4096, k=1
====================
trans_b  torch.float32 17934.45 usec
trans_b  torch.float16 5323.81 usec
trans_b torch.bfloat16 29320.80 usec

Matrix-matrix (prompt len 4:
m=8, n=128, k=4
====================
trans_b  torch.float32    2.40 usec
trans_b  torch.float16    1.22 usec
trans_b torch.bfloat16    1.22 usec
m=128, n=8, k=4
====================
trans_b  torch.float32    1.52 usec
trans_b  torch.float16    1.33 usec
trans_b torch.bfloat16    1.77 usec
m=4096, n=4096, k=4
====================
trans_b  torch.float32 4317.09 usec
trans_b  torch.float16 15541.04 usec
trans_b torch.bfloat16 15032.29 usec
m=11008, n=4096, k=4
====================
trans_b  torch.float32 6191.19 usec
trans_b  torch.float16 40436.29 usec
trans_b torch.bfloat16 40626.93 usec
m=4096, n=11008, k=4
====================
trans_b  torch.float32 6049.22 usec
trans_b  torch.float16 42367.16 usec
trans_b torch.bfloat16 42482.43 usec
m=32000, n=4096, k=4
====================
trans_b  torch.float32 17611.36 usec
trans_b  torch.float16 117368.54 usec
trans_b torch.bfloat16 116958.85 usec

Matrix-matrix (prompt len 8:
m=8, n=128, k=8
====================
trans_b  torch.float32    1.04 usec
trans_b  torch.float16    1.71 usec
trans_b torch.bfloat16    1.74 usec
m=128, n=8, k=8
====================
trans_b  torch.float32    2.10 usec
trans_b  torch.float16    2.01 usec
trans_b torch.bfloat16    2.91 usec
m=4096, n=4096, k=8
====================
trans_b  torch.float32 2456.23 usec
trans_b  torch.float16 30112.76 usec
trans_b torch.bfloat16 29941.58 usec
m=11008, n=4096, k=8
====================
trans_b  torch.float32 6236.12 usec
trans_b  torch.float16 80361.22 usec
trans_b torch.bfloat16 80466.64 usec
m=4096, n=11008, k=8
====================
trans_b  torch.float32 6236.10 usec
trans_b  torch.float16 82990.74 usec
trans_b torch.bfloat16 83899.80 usec
m=32000, n=4096, k=8
====================
trans_b  torch.float32 17606.43 usec
trans_b  torch.float16 234397.38 usec
trans_b torch.bfloat16 237057.29 usec

Matrix-matrix (prompt len 16:
m=8, n=128, k=16
====================
trans_b  torch.float32    1.31 usec
trans_b  torch.float16    2.67 usec
trans_b torch.bfloat16    2.72 usec
m=128, n=8, k=16
====================
trans_b  torch.float32    1.66 usec
trans_b  torch.float16    3.36 usec
trans_b torch.bfloat16    5.18 usec
m=4096, n=4096, k=16
====================
trans_b  torch.float32 2504.24 usec
trans_b  torch.float16 60896.53 usec
trans_b torch.bfloat16 59852.49 usec
m=11008, n=4096, k=16
====================
trans_b  torch.float32 6407.11 usec
trans_b  torch.float16 163294.92 usec
trans_b torch.bfloat16 161199.10 usec
m=4096, n=11008, k=16
====================
trans_b  torch.float32 6132.30 usec
trans_b  torch.float16 167244.77 usec
trans_b torch.bfloat16 170064.35 usec
m=32000, n=4096, k=16
====================
trans_b  torch.float32 17635.56 usec
trans_b  torch.float16 475020.00 usec
trans_b torch.bfloat16 476332.29 usec

Matrix-matrix (prompt len 32:
m=8, n=128, k=32
====================
trans_b  torch.float32    1.40 usec
trans_b  torch.float16    4.67 usec
trans_b torch.bfloat16    4.80 usec
m=128, n=8, k=32
====================
trans_b  torch.float32    1.24 usec
trans_b  torch.float16    6.10 usec
trans_b torch.bfloat16   10.03 usec
m=4096, n=4096, k=32
====================
trans_b  torch.float32 2660.63 usec
trans_b  torch.float16 122436.04 usec
trans_b torch.bfloat16 121687.96 usec
m=11008, n=4096, k=32
====================
trans_b  torch.float32 6405.60 usec
trans_b  torch.float16 324708.42 usec
trans_b torch.bfloat16 324866.67 usec
m=4096, n=11008, k=32
====================
trans_b  torch.float32 6566.74 usec
trans_b  torch.float16 330801.04 usec
trans_b torch.bfloat16 332561.79 usec
m=32000, n=4096, k=32
====================
trans_b  torch.float32 18610.84 usec
trans_b  torch.float16 944578.75 usec
trans_b torch.bfloat16 940674.33 usec

Matrix-matrix (prompt len 128:
m=8, n=128, k=128
====================
trans_b  torch.float32    2.48 usec
trans_b  torch.float16   16.43 usec
trans_b torch.bfloat16   17.11 usec
m=128, n=8, k=128
====================
trans_b  torch.float32    1.83 usec
trans_b  torch.float16   22.31 usec
trans_b torch.bfloat16   37.00 usec
m=4096, n=4096, k=128
====================
trans_b  torch.float32 4806.59 usec
trans_b  torch.float16 485338.83 usec
trans_b torch.bfloat16 478835.08 usec
m=11008, n=4096, k=128
====================
trans_b  torch.float32 12109.51 usec
trans_b  torch.float16 1300928.58 usec
trans_b torch.bfloat16 1293181.63 usec
m=4096, n=11008, k=128
====================
trans_b  torch.float32 11223.70 usec
trans_b  torch.float16 1326119.92 usec
trans_b torch.bfloat16 1330395.12 usec
m=32000, n=4096, k=128
====================
trans_b  torch.float32 33485.34 usec
trans_b  torch.float16 3869227.17 usec
trans_b torch.bfloat16 3792905.00 usec
```

After:
```
Matrix-vector:
m=8, n=128, k=1
====================
trans_b  torch.float32    0.75 usec
trans_b  torch.float16    0.71 usec
trans_b torch.bfloat16    0.81 usec
m=128, n=8, k=1
====================
trans_b  torch.float32    0.75 usec
trans_b  torch.float16    0.93 usec
trans_b torch.bfloat16    0.98 usec
m=4096, n=4096, k=1
====================
trans_b  torch.float32 2194.31 usec
trans_b  torch.float16  661.27 usec
trans_b torch.bfloat16 3758.42 usec
m=11008, n=4096, k=1
====================
trans_b  torch.float32 5792.04 usec
trans_b  torch.float16 1789.98 usec
trans_b torch.bfloat16 10120.67 usec
m=4096, n=11008, k=1
====================
trans_b  torch.float32 6101.22 usec
trans_b  torch.float16 1927.34 usec
trans_b torch.bfloat16 10469.47 usec
m=32000, n=4096, k=1
====================
trans_b  torch.float32 18353.20 usec
trans_b  torch.float16 5161.06 usec
trans_b torch.bfloat16 29601.69 usec

Matrix-matrix (prompt len 4:
m=8, n=128, k=4
====================
trans_b  torch.float32    2.14 usec
trans_b  torch.float16    0.85 usec
trans_b torch.bfloat16    1.19 usec
m=128, n=8, k=4
====================
trans_b  torch.float32    1.47 usec
trans_b  torch.float16    1.85 usec
trans_b torch.bfloat16    1.75 usec
m=4096, n=4096, k=4
====================
trans_b  torch.float32 4416.40 usec
trans_b  torch.float16 2688.36 usec
trans_b torch.bfloat16 14987.33 usec
m=11008, n=4096, k=4
====================
trans_b  torch.float32 6140.24 usec
trans_b  torch.float16 7467.26 usec
trans_b torch.bfloat16 40295.52 usec
m=4096, n=11008, k=4
====================
trans_b  torch.float32 6143.10 usec
trans_b  torch.float16 7298.04 usec
trans_b torch.bfloat16 41393.43 usec
m=32000, n=4096, k=4
====================
trans_b  torch.float32 17650.72 usec
trans_b  torch.float16 21346.63 usec
trans_b torch.bfloat16 116849.98 usec

Matrix-matrix (prompt len 8:
m=8, n=128, k=8
====================
trans_b  torch.float32    1.05 usec
trans_b  torch.float16    1.03 usec
trans_b torch.bfloat16    1.69 usec
m=128, n=8, k=8
====================
trans_b  torch.float32    2.05 usec
trans_b  torch.float16    3.08 usec
trans_b torch.bfloat16    2.95 usec
m=4096, n=4096, k=8
====================
trans_b  torch.float32 2323.99 usec
trans_b  torch.float16 5265.45 usec
trans_b torch.bfloat16 29942.40 usec
m=11008, n=4096, k=8
====================
trans_b  torch.float32 6202.01 usec
trans_b  torch.float16 14677.90 usec
trans_b torch.bfloat16 80625.18 usec
m=4096, n=11008, k=8
====================
trans_b  torch.float32 6112.05 usec
trans_b  torch.float16 14340.52 usec
trans_b torch.bfloat16 82799.99 usec
m=32000, n=4096, k=8
====================
trans_b  torch.float32 17650.65 usec
trans_b  torch.float16 42551.43 usec
trans_b torch.bfloat16 236081.08 usec

Matrix-matrix (prompt len 16:
m=8, n=128, k=16
====================
trans_b  torch.float32    1.26 usec
trans_b  torch.float16    1.34 usec
trans_b torch.bfloat16    2.69 usec
m=128, n=8, k=16
====================
trans_b  torch.float32    1.60 usec
trans_b  torch.float16    5.81 usec
trans_b torch.bfloat16    5.34 usec
m=4096, n=4096, k=16
====================
trans_b  torch.float32 2328.05 usec
trans_b  torch.float16 10526.58 usec
trans_b torch.bfloat16 60028.28 usec
m=11008, n=4096, k=16
====================
trans_b  torch.float32 6243.35 usec
trans_b  torch.float16 28505.08 usec
trans_b torch.bfloat16 163670.15 usec
m=4096, n=11008, k=16
====================
trans_b  torch.float32 5870.11 usec
trans_b  torch.float16 28597.89 usec
trans_b torch.bfloat16 165404.88 usec
m=32000, n=4096, k=16
====================
trans_b  torch.float32 17746.27 usec
trans_b  torch.float16 83393.87 usec
trans_b torch.bfloat16 472313.13 usec

Matrix-matrix (prompt len 32:
m=8, n=128, k=32
====================
trans_b  torch.float32    1.35 usec
trans_b  torch.float16    2.01 usec
trans_b torch.bfloat16    4.68 usec
m=128, n=8, k=32
====================
trans_b  torch.float32    1.19 usec
trans_b  torch.float16   10.98 usec
trans_b torch.bfloat16   10.13 usec
m=4096, n=4096, k=32
====================
trans_b  torch.float32 2525.29 usec
trans_b  torch.float16 23106.71 usec
trans_b torch.bfloat16 122987.04 usec
m=11008, n=4096, k=32
====================
trans_b  torch.float32 6131.34 usec
trans_b  torch.float16 57537.41 usec
trans_b torch.bfloat16 327825.00 usec
m=4096, n=11008, k=32
====================
trans_b  torch.float32 6395.01 usec
trans_b  torch.float16 57456.33 usec
trans_b torch.bfloat16 331325.58 usec
m=32000, n=4096, k=32
====================
trans_b  torch.float32 19078.68 usec
trans_b  torch.float16 167735.08 usec
trans_b torch.bfloat16 975736.88 usec

Matrix-matrix (prompt len 128:
m=8, n=128, k=128
====================
trans_b  torch.float32    2.40 usec
trans_b  torch.float16    6.07 usec
trans_b torch.bfloat16   16.83 usec
m=128, n=8, k=128
====================
trans_b  torch.float32    1.78 usec
trans_b  torch.float16   40.35 usec
trans_b torch.bfloat16   37.21 usec
m=4096, n=4096, k=128
====================
trans_b  torch.float32 4827.60 usec
trans_b  torch.float16 84341.24 usec
trans_b torch.bfloat16 478917.75 usec
m=11008, n=4096, k=128
====================
trans_b  torch.float32 11879.96 usec
trans_b  torch.float16 226484.33 usec
trans_b torch.bfloat16 1289465.50 usec
m=4096, n=11008, k=128
====================
trans_b  torch.float32 10707.75 usec
trans_b  torch.float16 229200.58 usec
trans_b torch.bfloat16 1327416.67 usec
m=32000, n=4096, k=128
====================
trans_b  torch.float32 33306.32 usec
trans_b  torch.float16 662898.21 usec
trans_b torch.bfloat16 3815866.63 usec
```

torch.float16 performance seems to be improved for all except the
m=128, n=8, k=128 case, where it is roughly neutral. This case
motivated the addition of the "first-tier tail fixup" in the dot
kernel.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127451
Approved by: https://github.com/malfet
ghstack dependencies: #127435
2024-05-30 23:28:17 +00:00
74b89b9283 Extract dot-product functions from fp16_gemv_trans gemv kernels (#127435)
Summary: Refactoring step before we attempt to use these to implement a less bad fp16 GEMM.

Test Plan: Existing tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127435
Approved by: https://github.com/malfet
2024-05-30 23:28:17 +00:00
a3c00e4331 [Easy] Move V.fake_mode inside of replace_by_example (#127494)
Was writing docs and saw that we always have this duplicated usage.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127494
Approved by: https://github.com/shunting314, https://github.com/aorenste
2024-05-30 23:23:42 +00:00
f9a1bc2c65 [FSDP] Remove _sync_module_states (#124678)
Remove this unused API

Differential Revision: [D56445639](https://our.internmc.facebook.com/intern/diff/D56445639/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124678
Approved by: https://github.com/awgu
2024-05-30 23:02:09 +00:00
029af29e6d support operator.index function (#127440)
Fix https://github.com/pytorch/pytorch/issues/127426

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127440
Approved by: https://github.com/mlazos
ghstack dependencies: #126444, #127146, #127424
2024-05-30 22:44:18 +00:00
3b88c27c46 Mark DynamicShapesExportTests::test_retracibility_dict_container_inp_out as slow (#127558)
Same as https://github.com/pytorch/pytorch/pull/117896, another slowpoke `DynamicShapesExportTests::test_retracibility_dict_container_inp_out` shows up on recently on MacOS.  For example, https://ossci-raw-job-status.s3.amazonaws.com/log/25585713394

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127558
Approved by: https://github.com/clee2000
2024-05-30 22:40:48 +00:00
e02971fcfb Revert "Enable UFMT on test_shape_ops.py test_show_pickle.py test_sort_and_select.py (#127165)"
This reverts commit a288b95d4e5ceed327c5bdb9696331aa87688d60.

Reverted https://github.com/pytorch/pytorch/pull/127165 on behalf of https://github.com/atalman due to lint is failing ([comment](https://github.com/pytorch/pytorch/pull/127165#issuecomment-2140930658))
2024-05-30 22:06:46 +00:00
4ee003abdf [inductor] Repeat should not return a view (#127533)
Fixes #127474

`as_strided` unwraps views and looks at the underlying storage, so it isn't
legal to lower `repeat`, which should return a new storage, into a view.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127533
Approved by: https://github.com/lezcano
2024-05-30 21:38:59 +00:00
a288b95d4e Enable UFMT on test_shape_ops.py test_show_pickle.py test_sort_and_select.py (#127165)
Fixes some files in #123062

Run lintrunner on files:
test_shape_ops.py
test_show_pickle.py
test_sort_and_select.py

```bash
$ lintrunner --take UFMT --all-files
ok No lint issues.
Successfully applied all patches.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127165
Approved by: https://github.com/ezyang
2024-05-30 21:34:16 +00:00
f471482eb2 Try to include NCCL related header file with macro USE_C10D_NCCL (#127501)
Fixes #ISSUE_NUMBER
Try to include NCCL related header file with macro USE_C10D_NCCL, so that third-party device compilation will not be interrupted.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127501
Approved by: https://github.com/ezyang
2024-05-30 21:33:41 +00:00
6849b80411 Add ninja as dev dependency (#127380)
`ninja` is required to build C++ extensions in tests.

```pytb
ERROR: test_autograd_cpp_node (__main__.TestCompiledAutograd)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/PanXuehai/Projects/pytorch/torch/testing/_internal/common_utils.py", line 2741, in wrapper
    method(*args, **kwargs)
  File "test/inductor/test_compiled_autograd.py", line 1061, in test_autograd_cpp_node
    module = torch.utils.cpp_extension.load_inline(
  File "/home/PanXuehai/Projects/pytorch/torch/utils/cpp_extension.py", line 1643, in load_inline
    return _jit_compile(
  File "/home/PanXuehai/Projects/pytorch/torch/utils/cpp_extension.py", line 1718, in _jit_compile
    _write_ninja_file_and_build_library(
  File "/home/PanXuehai/Projects/pytorch/torch/utils/cpp_extension.py", line 1800, in _write_ninja_file_and_build_library
    verify_ninja_availability()
  File "/home/PanXuehai/Projects/pytorch/torch/utils/cpp_extension.py", line 1849, in verify_ninja_availability
    raise RuntimeError("Ninja is required to load C++ extensions")
RuntimeError: Ninja is required to load C++ extensions

To execute this test, run the following from the base repo dir:
     python test/inductor/test_compiled_autograd.py -k TestCompiledAutograd.test_autograd_cpp_node
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127380
Approved by: https://github.com/ezyang
2024-05-30 21:22:42 +00:00
094183dba6 [torchbench][pt2] Enable Huggingface and Timm models for interal buck runner (#127460)
Summary: Add huggingface and timm model runs to the  internal pt2 benchmark runner.

Test Plan:
Tesing huggingface model:

```
$ buck2 run mode/opt //pytorch/benchmark:pt2 -- --only BlenderbotSmallForCausalLM --performance --training --device=cuda --amp

 33/ 33 +0 frames   2s 13 graphs 13 graph calls    0/ -12 =   0% ops   0% time
```

Testing timm model:

```
$ buck2 run mode/opt //pytorch/benchmark:pt2 -- --only coat_lite_mini --performance --training --device=cuda --amp

loading model: 0it [00:11, ?it/s]
cuda train coat_lite_mini
  8/  8 +0 frames   4s  2 graphs  2 graph calls    0/  -1 =   0% ops   0% time
```

Differential Revision: D57930582

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127460
Approved by: https://github.com/HDCharles, https://github.com/huydhn
2024-05-30 21:18:28 +00:00
cyy
bf2f5e70dd Fix warnings in SmallVector (#127250)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127250
Approved by: https://github.com/ezyang
2024-05-30 21:13:20 +00:00
ad1b18ab2f Add repo-specific scale config files (#127566)
Part of moving pytorch/pytorch CI infra to a Linux foundation run AWS account.

For self-hosted runners that can run jobs from just a single repo, the runner scalers expect them to be stored in the repo itself.

These scale-config files define how the linux foundation's self-hosted runners are configured. These will apply to runners that only are available to the pytorch/pytorch and pytorch/pytorch-canary repos
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127566
Approved by: https://github.com/zxiiro, https://github.com/huydhn, https://github.com/atalman
2024-05-30 21:08:45 +00:00
846f79e61a Revert "Reduce number of samples in {svd,pca}_lowrank OpInfos (#127199)"
This reverts commit 18a3f781e6382e2222d7c30c18136267407f9953.

Reverted https://github.com/pytorch/pytorch/pull/127199 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing MacOS trunk job 18a3f781e6 (25619618844) ([comment](https://github.com/pytorch/pytorch/pull/127199#issuecomment-2140834363))
2024-05-30 20:45:31 +00:00
cce2192396 [pipelining] Support calling multiple recv fwd/bwd ops (#127084)
Currently, only a single `get_fwd_recv_ops` or `get_bwd_recv_ops` can be called before `forward_one_chunk` and `backward_one_chunk` since they both share the same chunk_id counter. This creates a separate `recv_chunk_id` counter so that recvs can be accumulated.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127084
Approved by: https://github.com/wconstab
2024-05-30 20:15:52 +00:00
aa3d041830 [pipelining] Fix block comments for doc rendering (#127418)
Previous:
<img width="915" alt="image" src="https://github.com/pytorch/pytorch/assets/14858254/14626937-7d79-4a7a-9d0b-3fcfe64b4667">
<img width="926" alt="image" src="https://github.com/pytorch/pytorch/assets/14858254/58ab009c-3f93-46d7-a04f-499a2a0ba390">

New:
https://docs-preview.pytorch.org/pytorch/pytorch/127418/distributed.pipelining.html

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127418
Approved by: https://github.com/wconstab
2024-05-30 20:10:07 +00:00
ff23c5b7d7 [cudagraph] improve log for mutating static input tensor addresses (#127145)
Summary: This diff adds more log for cudagraph when static input tensor mutates. For each placeholder whose static input tensor address mutates, we log its name, changed data pointer address, and the input stack trace. Since some placeholder may have empty stack trace, we find its first user with an non-empty stack trace and print this stack trace instead.

Test Plan: buck2 run fbcode//caffe2/test/inductor:cudagraph_trees -- --r test_static_inputs_address_mutation_log

Differential Revision: D57805118

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127145
Approved by: https://github.com/eellison
2024-05-30 19:57:32 +00:00
19333d1eb9 [ROCm] Update triton pin to fix libtanh issue (#125396)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125396
Approved by: https://github.com/pruthvistony, https://github.com/nmacchioni
2024-05-30 19:26:58 +00:00
2cb6f20867 Warn env vars only once during program (#127046)
This avoids logs being excessively noisy in some training runs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127046
Approved by: https://github.com/kwen2501, https://github.com/wconstab
2024-05-30 19:10:53 +00:00
4afc5c7bb9 [torchscript] Handle prim::device and prim::dtype (#127466)
- Support prim::device and prim::dtype during torchscript migration to export
- Add unit tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127466
Approved by: https://github.com/SherlockNoMad
2024-05-30 18:35:44 +00:00
fa426b096b Default meta device to use swap_tensors in nn.Module._apply (.to_empty and .to('meta')) (#126819)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126819
Approved by: https://github.com/albanD
ghstack dependencies: #127313, #126814
2024-05-30 18:28:13 +00:00
bfdec93395 Default XLA to use swap_tensors path in nn.Module._apply (#126814)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126814
Approved by: https://github.com/JackCaoG, https://github.com/albanD
ghstack dependencies: #127313
2024-05-30 18:28:13 +00:00
39cf2f8e66 Added sorting notes for eig/eigvals (#127492)
Fixes #58034

@lezcano , Added suggested comments for eig and eigvals in the documentation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127492
Approved by: https://github.com/lezcano, https://github.com/kit1980
2024-05-30 18:13:22 +00:00
7827afca14 Copy the constant folding pass to the pass under export/passes folder (#127456)
It's a generic pass and I'm trying to find a good place to host it. It's currently needed by quantization flow. See context in D55930580, it's too much effort to land a fix in the inductor folder.

Differential Revision: [D57934182](https://our.internmc.facebook.com/intern/diff/D57934182/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127456
Approved by: https://github.com/angelayi
2024-05-30 18:04:08 +00:00
f9937afd4f Add noqa to prevent lint warnings (#127545)
This is to prevent the import from being removed due to unused import. What's annoying about this is that it's not consistently running: lintrunner doesn't warn me on this PR even without the comment, but it does on other PRs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127545
Approved by: https://github.com/masnesral
2024-05-30 17:56:49 +00:00
12d6446507 Revert "[inductor] fix mkldnn linear binary fusion check ut (#127296)"
This reverts commit cdeb242fc977210e211fd77b217320205c9f4042.

Reverted https://github.com/pytorch/pytorch/pull/127296 on behalf of https://github.com/huydhn due to Sorry for reverting you change but one of the tests is failing on trunk ROCm.  Please help fix and reland the change https://github.com/pytorch/pytorch/actions/runs/9302535020/job/25606932572 ([comment](https://github.com/pytorch/pytorch/pull/127296#issuecomment-2140334323))
2024-05-30 17:18:23 +00:00
e9a6bbbf7c Revert "[CI] add xpu test in periodic workflow (#126410)"
This reverts commit 30d98611a3a35287c47ded9647f0b4c81fbdf036.

Reverted https://github.com/pytorch/pytorch/pull/126410 on behalf of https://github.com/malfet due to Let's sync up on the test strategy/policies here ([comment](https://github.com/pytorch/pytorch/pull/126410#issuecomment-2140269549))
2024-05-30 17:01:02 +00:00
cyy
8777443d73 Remove FindMatlabMex.cmake (#127414)
It is not used anymore.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127414
Approved by: https://github.com/ezyang
2024-05-30 16:26:35 +00:00
b506d37331 Fix multiple errors while parsing NativeFunctions from YAML (#127413)
Fixing multiple errors in parse_native_yaml when loading NativeFunctions from Yaml file.

Add assertions that validates parsed data.

Fixes #127404, #127405, #127406, #127407, #127408, #127409, #127410, #127411

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127413
Approved by: https://github.com/ezyang
2024-05-30 16:25:04 +00:00
ea5c17de90 Revert "Add torchao nightly testing workflow (#126885)"
This reverts commit d938170314fa89acaad6b06fbbaac6b98f1e618f.

Reverted https://github.com/pytorch/pytorch/pull/126885 on behalf of https://github.com/atalman due to Broke inductor periodic test ([comment](https://github.com/pytorch/pytorch/pull/126885#issuecomment-2140139486))
2024-05-30 16:23:06 +00:00
cyy
be7be9fa16 [Distributed] [8/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#125102)
This PR continues to clean clang-tidy warnings in torch/csrc/distributed/c10d, following https://github.com/pytorch/pytorch/pull/124987.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125102
Approved by: https://github.com/ezyang
2024-05-30 16:19:53 +00:00
576c5ef1dd [inductor] fix some tests in test_max_autotune.py (#127472)
Fix https://github.com/pytorch/pytorch/issues/126176  . We should not use torch.empty to generate input data if we are gonna do any accuracy test. torch.empty may return NaN. In that cause both the reference and the actual result may contain NaN at the same index. But `NaN != NaN` so the test fail.

Also if torch.empty returns NaN is not deterministic. It may depends on other tests running earlier.

Generating random data instead of calling torch.empty fixes the problem.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127472
Approved by: https://github.com/eellison, https://github.com/jansel
2024-05-30 16:04:48 +00:00
d2df0f56a3 Fix compilation_latency regression caused by #127060 (#127326)
It seems that while #127060 improved the speed for tacotron2 it introduced a compilation_latency regression for some of the TIMM benchmarks.

The original change was to precompute the Dep metadata - but apparently some benchmarks have few enough overlaps that precomputing O(n) deps was slower than ignoring O(n^2) deps.  So change it to go back to computing the Dep metadata on demand but to then cache the result.

`dm_nfnet_f0` was a good example because on the dashboard it showed an increase from 140s -> 154s.

```
python benchmarks/dynamo/timm_models.py --performance --cold-start-latency --training --amp --backend inductor --dynamic-shapes --dynamic-batch-only --device cuda --total-partitions 5 --partition-id 1 --output timm-0.csv --only dm_nfnet_f0
```

Looking at the compilation_latency result.

On viable (d6e3e8980):
172.777958
176.725071
177.907955

On viable with #127060 and #127061 fully backed out:
158.305166
158.688560
160.791187

On viable w/ this change:
160.094164
160.201845
161.752157

I think that's probably close enough considering the variance.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127326
Approved by: https://github.com/oulgen
2024-05-30 15:37:08 +00:00
ffe506e853 Better graph break msg (and warning) on Dynamo x Python C++ extension (#127301)
Dynamo graph breaks on Python C/C++ extensions (e.g. pybinded
functions). The usual way to handle this is to turn those extensions
into custom ops. This PR adds a nicer graph break message and also
changes it to unconditionally warn on this graph break (because graph
break messages are usually not visible).

Fixes https://github.com/pytorch/pytorch/issues/126799

Test Plan:
- new test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127301
Approved by: https://github.com/jansel
ghstack dependencies: #127291, #127292, #127400, #127423
2024-05-30 14:54:29 +00:00
c9beea13ac Rewrite existing links to custom ops gdocs with the landing page (#127423)
NB: these links will be live after the docs build happens, which is once
a day.

Test Plan:
- existing tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127423
Approved by: https://github.com/jansel, https://github.com/williamwen42
ghstack dependencies: #127291, #127292, #127400
2024-05-30 14:54:29 +00:00
18a3f781e6 Reduce number of samples in {svd,pca}_lowrank OpInfos (#127199)
We don't need to generate so many samples for these very expensive ops.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127199
Approved by: https://github.com/peterbell10, https://github.com/zou3519
ghstack dependencies: #125580
2024-05-30 14:45:58 +00:00
48538d3d14 Implement svd_lowrank and pca_lowrank for complex numbers (#125580)
We fix a number of bugs previously present in the complex
implementation.

We also heavily simplify the implementation, using, among
other things, that we now have conjugate views.

I saw there is a comment regarding how slow some checks on this
function are. As such, I removed quite a few of the combinations of inputs
to make the OpInfo lighter. I still left a couple relevant examples to not regress
coverage though.

Fixes https://github.com/pytorch/pytorch/issues/122188

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125580
Approved by: https://github.com/pearu, https://github.com/peterbell10
2024-05-30 14:45:58 +00:00
3fb8a0b627 Fix nextafter in inductor CPP codegen (#126876)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126876
Approved by: https://github.com/peterbell10, https://github.com/jgong5
2024-05-30 14:08:16 +00:00
ce63b676f3 Revert "[compiled autograd] torch.compile API (#125880)"
This reverts commit e1c322112a3d7b128b42e27f68bc9a714bfd9a09.

Reverted https://github.com/pytorch/pytorch/pull/125880 on behalf of https://github.com/atalman due to sorry your PR broke lint, need to revert ([comment](https://github.com/pytorch/pytorch/pull/125880#issuecomment-2139605376))
2024-05-30 13:53:31 +00:00
6e0eeecc7c Add back private function torch.cuda.amp.autocast_mode._cast (#127433)
This is unfortunately used in a few places in the wild: https://github.com/search?q=torch.cuda.amp.autocast_mode._cast&type=code
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127433
Approved by: https://github.com/zou3519, https://github.com/guangyey
2024-05-30 13:29:23 +00:00
3f5d8636aa [inductor] Copy RedisRemoteCacheBackend into pytorch (#127480)
Summary: We need an implementation of RedisRemoteCacheBackend with the same API that we're using for FbMemcacheRemoteFxGraphCacheBackend. So we'll stop using the Triton implementation and adapt a version for use by inductor. I also renamed parameters and cache entries to match our cache terminology.

Test Plan: Ran this command twice and inspected log output to ensure I got cache hits:
```
TORCH_LOGS=+torch._inductor.codecache TORCHINDUCTOR_FX_GRAPH_REMOTE_CACHE=1 python benchmarks/dynamo/torchbench.py --performance --inductor --device cuda --training --amp --print-compilation-time --only dcgan
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127480
Approved by: https://github.com/oulgen
2024-05-30 13:08:10 +00:00
cdeb242fc9 [inductor] fix mkldnn linear binary fusion check ut (#127296)
In this PR:

(1)Fix the unary fusion for bf16 conv/linear.
    Previously we registered same fusion pattern for `bf16. fp16`. And we do not check the dtype while matching the pattern. This results the `fp16` case matched the `bf16` pattern but in later replacement, we found that we have a float16 here which is not expected, so we do not fuse them.  We fix it by checking dtypes to avoid `fp16` case matched `bf16` pattern.

```
  def _is_valid_computation_unary_fusion(computation_op, lowp_dtype=None):
      def fn(match):
          matched = _is_single_computation_op(computation_op, **lowp_dtype**)(match) # previously we do not check lowp_dtype here

```

It is not exposed before because we only check the match count, and the match count is anyway correct because we matched the pattern. To address this, we add check on number of `generated_kernel`. If it is not fused, there will be an additional kernel to compute the post op.

(2)Previous the ut
```
python test/inductor/test_mkldnn_pattern_matcher.py -k test_linear_binary
```
dose not check the fusion status, fix it in this PR.

(3)Extend `test_conv_binary` to test with lp.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127296
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jansel
2024-05-30 12:29:36 +00:00
9f73c65b8f xpu: pass MAX_JOBS building xpu_mkldnn_proj (#126562)
mkldnn is quite big project and MAX_JOBS support is essential when building on a system with big number of cpus and limited memory.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126562
Approved by: https://github.com/jgong5, https://github.com/guangyey, https://github.com/albanD
2024-05-30 12:10:33 +00:00
30d98611a3 [CI] add xpu test in periodic workflow (#126410)
Works for https://github.com/pytorch/pytorch/issues/114850
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126410
Approved by: https://github.com/EikanWang, https://github.com/atalman
2024-05-30 12:10:15 +00:00
1071437169 Introduce cuda_p2p based fused_all_gather_matmul and fused_matmul_reduce_scatter (#126634)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126634
Approved by: https://github.com/Chillee, https://github.com/wanchaol
2024-05-30 12:10:11 +00:00
705346bf8d [ONNX] Skip optimizer when it fails (#127349)
continue #127039

(1) Skip optimizer when it fails
(2) Update onnx, ort, and onnx-script
(3) The update to onnx-script results in the actual optimizer and rewriter enabling in this PR, and https://github.com/pytorch/pytorch/pull/123379 did not update onnx-script.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127349
Approved by: https://github.com/justinchuby
2024-05-30 07:08:45 +00:00
cd06ae0cb8 Relax use_count constraints for swap_tensors when AccumulateGrad holds a reference (#127313)
### Before this PR:
`torch.utils.swap_tensors(a, b)` required the `use_count` of `a` and `b` to be 1

```python
a = torch.randn(2, 3, requires_grad=True)
b = torch.randn(2, 4)
out = a * 2
out.sum().backward()
# Calling swap_tensors here would fail due to the reference held by AccumulateGrad node, which is not cleaned up after backward
# torch.utils.swap_tensors(a, b)
del out
# Calling swap_tensors here would pass
torch.utils.swap_tensors(a, b)
```
### After this PR:
`torch.utils.swap_tensors(a, b)` requires the `use_count` of `a` and `b` to be 1 or 2 IF the second reference is held by `AccumulateGrad`

A pre-hook will be registered on the `AccumulateGrad` node so that it will fail if it is called (i.e. if user attempts to backward through the graph).

```python
a = torch.randn(2, 3, requires_grad=True)
b = torch.randn(2, 4)
out = a * 2
out.sum().backward()
# Calling swap_tensors here is ok
torch.utils.swap_tensors(a, b)
# If we ever backward to the AccumulateGrad node it will error that it was poisoned by swap_tensors
```

### Application to `nn.Module`

This issue is especially pertinent in context of `nn.Module` where parameters will have `AccumulateGrad` nodes initialized after forward. Specifically, this is intended to address https://github.com/pytorch/pytorch/pull/126814#issuecomment-2127777866. Previously, this would fail at the `m.cpu()` but we want users to be able to do something like the following, and instead raise an error if the user ever attempts to backward through the poisoned `AccumulateGrad` node

```python
import torch
import torch.nn as nn
m = nn.Linear(3, 5)
inp = torch.randn(2, 3)
out = m(inp)
out.sum().backward()
m.cpu()
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127313
Approved by: https://github.com/soulitzer
2024-05-30 07:06:55 +00:00
d44ab8ba6d [dynamo] utility to generate bytecode from template function (#127359)
This will be helpful in reducing some of the hardcoded and python-version-dependent bytecode generation in various places in dynamo - e.g. resume function generation and object reconstruction.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127359
Approved by: https://github.com/jansel
ghstack dependencies: #127329
2024-05-30 06:37:32 +00:00
5d316c81be [Inductor] Add 0 initialization to Triton masked loads (#127311)
For a masked `tl.load` operation, the Triton language specifies that values masked out (i.e. where the mask evaluates to false) are undefined in the output of the load. Triton provides an optional `other` parameter which, when included, provides an explicit value to use for masked out values from the load. If the output from a masked load without the `other` parameter is used in a conditional, unexpected behavior can occur.

Despite the language specification, all Triton backends currently in use by PyTorch Inductor (NVIDIA, AMD, and Intel) 0-initialize masked loads if `other` is not present (we recently changed the Intel backend behavior to match NVIDIA and AMD because that's what our users expect, even if we are not following the Triton spec to the tee). This PR attempts to "future-proof" Inductor for new backends (or perhaps changes in the current backends? - we did not see any performance change from 0-initializing in the Intel XPU backend but one could imagine compiler optimizations to remove paths that depend on undefined) to add an explicit `other` in instances where later conditionals depend on the `tl.load` output. I also removed an exception to `other` behavior for boolean loads, which was put in place for a Triton bug that should be fixed. I added `other` to the getting started documentation as a clue that masked load behavior requires explicit initialization if, even though I don't expect `undef` values to cause the example code to fail if the underlying output is not 0-initialized.  Finally, I added other to the `make_load` function in `select_algorithm.py`, though I wasn't able to determine if that function was actually being called.

Fixes #126535

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127311
Approved by: https://github.com/jansel
2024-05-30 04:50:54 +00:00
3947731887 enable test_parameter_free_dynamic_shapes test when nn module inlining is on (#127424)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127424
Approved by: https://github.com/mlazos
ghstack dependencies: #126444, #127146
2024-05-30 04:20:07 +00:00
15cc9f2e7e [dtensor][be] added checksAssert function and refactored test cases (#127356)
**Summary**
Added c10d checksAsserts functions to reduce written lines of code and refactored test cases. Merged one test case into another.

**Test Plan**
pytest test/distributed/_tensor/debug/test_comm_mode.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127356
Approved by: https://github.com/XilunWu
ghstack dependencies: #127025, #127029, #127040, #127134, #127334
2024-05-30 03:48:17 +00:00
998f38814c [dtensor][debug] added c10d allgather, allgather_coalesced, and allgather_into_tensor_coalesced tracing to CommDebugMode (#127334)
**Summary**
Added c10d allgather, allgather_coalesced, and allgather_into_tensor_coalesced tracing to CommDebugMode and edited test case in test_comm_mode to include added features.

**Test Plan**
pytest test/distributed/_tensor/debug/test_comm_mode.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127334
Approved by: https://github.com/XilunWu, https://github.com/yifuwang
ghstack dependencies: #127025, #127029, #127040, #127134
2024-05-30 03:48:17 +00:00
f58fc16e8f [easy?] Move AsyncCompile to a different file (#127235)
By moving AsyncCompile to its own file, we can import codecache without running the side effects of AsyncCompile. This will be important for AOTAutogradCaching, where we want to share some implementation details with codecache.py without spawning new processes.

To conservatively maintain the same behavior elsewhere, every time we import codecache, I've added an import to torch._inductor.async_compile (except in autograd_cache.py, where the explicit goal is to not do this)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127235
Approved by: https://github.com/aorenste, https://github.com/oulgen, https://github.com/masnesral
2024-05-30 02:43:02 +00:00
e0fc1ab625 Forward fix for templates + views (#127446)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127446
Approved by: https://github.com/eellison
2024-05-30 02:34:35 +00:00
3d541835d5 distributed debug handlers (#126601)
This adds debug handlers as described in:
* https://gist.github.com/d4l3k/828b7be585c7615e85b2c448b308d925 (public copy)
* https://docs.google.com/document/d/1la68szcS6wUYElUUX-P6zXgkPA8lnfzpagMTPys3aQ8/edit (internal copy)

This is only adding the C++ pieces that will be used from the main process. The Python and torchrun pieces will be added in a follow up PR.

This adds 2 handlers out of the box:

* `/handler/ping` for testing purposes
* `/handler/dump_nccl_trace_pickle` as a POC integration with Flight Recorder

Test plan:

```
python test/distributed/elastic/test_control_plane.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126601
Approved by: https://github.com/kurman, https://github.com/c-p-i-o
2024-05-30 02:21:08 +00:00
e1c322112a [compiled autograd] torch.compile API (#125880)
- enter existing compiled autograd ctx manager before entering torch.compile frames

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125880
Approved by: https://github.com/jansel
2024-05-30 02:10:06 +00:00
da39461d61 [optim] Move test_grad_scaling_autocast_fused_optimizers to test_cuda.py (#126418)
this PR address the comments in this PR #124904

- Move test_grad_scaling_autocast_fused_optimizers to test_cuda.py
- Combine _grad_scaling_autocast_fused_optimizers into test_grad_scaling_autocast_fused_optimizers
- Move to OptimizerInfo framework.
- For failing tests test_grad_scaling_autocast_fused_optimizers AdamW_cuda_float32, Adam_cuda_float32
    - Added toleranceOverride in this PR
    - created a issue #127000

```
> (c2env) [sandish@devgpu166.ash6 ~/pytorch (refactoroptimizers)]$ python test/test_cuda.py -k test_grad_scaling_autocast_fused_optimizers -v
/home/sandish/pytorch/torch/backends/cudnn/__init__.py:106: UserWarning: PyTorch was compiled without cuDNN/MIOpen support. To use cuDNN/MIOpen, rebuild PyTorch making sure the library is visible to the build system.
  warnings.warn(
/home/sandish/pytorch/torch/backends/cudnn/__init__.py:106: UserWarning: PyTorch was compiled without cuDNN/MIOpen support. To use cuDNN/MIOpen, rebuild PyTorch making sure the library is visible to the build system.
  warnings.warn(
test_grad_scaling_autocast_fused_optimizers_Adagrad_cpu_float32 (__main__.TestCudaOptimsCPU) ... {'fused': True}
{'fused': True}
{'weight_decay': 0.1, 'fused': True}
{'weight_decay': 0.1, 'fused': True}
{'weight_decay': 0.1, 'maximize': True, 'fused': True}
{'weight_decay': 0.1, 'maximize': True, 'fused': True}
{'lr': 0.1, 'fused': True}
{'lr': 0.1, 'fused': True}
{'initial_accumulator_value': 0.1, 'weight_decay': 0.1, 'fused': True}
{'initial_accumulator_value': 0.1, 'weight_decay': 0.1, 'fused': True}
{'lr': 0.1, 'lr_decay': 0.5, 'weight_decay': 0.1, 'fused': True}
{'lr': 0.1, 'lr_decay': 0.5, 'weight_decay': 0.1, 'fused': True}
{'lr': tensor(0.0010), 'fused': True}
{'lr': tensor(0.0010), 'fused': True}
ok
test_grad_scaling_autocast_fused_optimizers_AdamW_cpu_float32 (__main__.TestCudaOptimsCPU) ... {'fused': True}
{'fused': True}
{'lr': 0.01, 'fused': True}
{'lr': 0.01, 'fused': True}
{'weight_decay': 0.1, 'fused': True}
{'weight_decay': 0.1, 'fused': True}
{'weight_decay': 0.1, 'maximize': True, 'fused': True}
{'weight_decay': 0.1, 'maximize': True, 'fused': True}
{'weight_decay': 0.1, 'amsgrad': True, 'fused': True}
{'weight_decay': 0.1, 'amsgrad': True, 'fused': True}
ok
test_grad_scaling_autocast_fused_optimizers_Adam_cpu_float32 (__main__.TestCudaOptimsCPU) ... {'fused': True}
{'fused': True}
{'lr': 0.01, 'fused': True}
{'lr': 0.01, 'fused': True}
{'weight_decay': 0.1, 'fused': True}
{'weight_decay': 0.1, 'fused': True}
{'weight_decay': 0.1, 'maximize': True, 'fused': True}
{'weight_decay': 0.1, 'maximize': True, 'fused': True}
{'weight_decay': 0.1, 'amsgrad': True, 'fused': True}
{'weight_decay': 0.1, 'amsgrad': True, 'fused': True}
ok
test_grad_scaling_autocast_fused_optimizers_SGD_cpu_float32 (__main__.TestCudaOptimsCPU) ... {'fused': True}
{'fused': True}
{'lr': 0.01, 'fused': True}
{'lr': 0.01, 'fused': True}
{'lr': tensor(0.0010), 'fused': True}
{'lr': tensor(0.0010), 'fused': True}
{'momentum': 0.9, 'fused': True}
{'momentum': 0.9, 'fused': True}
{'momentum': 0.9, 'dampening': 0.5, 'fused': True}
{'momentum': 0.9, 'dampening': 0.5, 'fused': True}
{'momentum': 0.9, 'weight_decay': 0.1, 'fused': True}
{'momentum': 0.9, 'weight_decay': 0.1, 'fused': True}
{'momentum': 0.9, 'nesterov': True, 'weight_decay': 0.1, 'fused': True}
{'momentum': 0.9, 'nesterov': True, 'weight_decay': 0.1, 'fused': True}
{'weight_decay': 0.1, 'maximize': True, 'fused': True}
{'weight_decay': 0.1, 'maximize': True, 'fused': True}
ok
test_grad_scaling_autocast_fused_optimizers_Adagrad_cuda_float32 (__main__.TestCudaOptimsCUDA) ... skipped 'cuda is not supported for fused on Adagrad'
test_grad_scaling_autocast_fused_optimizers_AdamW_cuda_float32 (__main__.TestCudaOptimsCUDA) ... {'fused': True}
{'fused': True}
{'lr': 0.01, 'fused': True}
{'lr': 0.01, 'fused': True}
{'weight_decay': 0.1, 'fused': True}
{'weight_decay': 0.1, 'fused': True}
{'weight_decay': 0.1, 'maximize': True, 'fused': True}
{'weight_decay': 0.1, 'maximize': True, 'fused': True}
{'weight_decay': 0.1, 'amsgrad': True, 'fused': True}
{'weight_decay': 0.1, 'amsgrad': True, 'fused': True}
{'capturable': True, 'fused': True}
{'capturable': True, 'fused': True}
{'weight_decay': 0.1, 'amsgrad': True, 'capturable': True, 'fused': True}
{'weight_decay': 0.1, 'amsgrad': True, 'capturable': True, 'fused': True}
{'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True, 'fused': True}
{'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True, 'fused': True}
ok
test_grad_scaling_autocast_fused_optimizers_Adam_cuda_float32 (__main__.TestCudaOptimsCUDA) ... {'fused': True}
{'fused': True}
{'lr': 0.01, 'fused': True}
{'lr': 0.01, 'fused': True}
{'weight_decay': 0.1, 'fused': True}
{'weight_decay': 0.1, 'fused': True}
{'weight_decay': 0.1, 'maximize': True, 'fused': True}
{'weight_decay': 0.1, 'maximize': True, 'fused': True}
{'weight_decay': 0.1, 'amsgrad': True, 'fused': True}
{'weight_decay': 0.1, 'amsgrad': True, 'fused': True}
{'capturable': True, 'fused': True}
{'capturable': True, 'fused': True}
{'weight_decay': 0.1, 'amsgrad': True, 'capturable': True, 'fused': True}
{'weight_decay': 0.1, 'amsgrad': True, 'capturable': True, 'fused': True}
{'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True, 'fused': True}
{'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True, 'fused': True}
ok
test_grad_scaling_autocast_fused_optimizers_SGD_cuda_float32 (__main__.TestCudaOptimsCUDA) ... {'fused': True}
{'fused': True}
{'lr': 0.01, 'fused': True}
{'lr': 0.01, 'fused': True}
{'lr': tensor(0.0010), 'fused': True}
{'lr': tensor(0.0010), 'fused': True}
{'momentum': 0.9, 'fused': True}
{'momentum': 0.9, 'fused': True}
{'momentum': 0.9, 'dampening': 0.5, 'fused': True}
{'momentum': 0.9, 'dampening': 0.5, 'fused': True}
{'momentum': 0.9, 'weight_decay': 0.1, 'fused': True}
{'momentum': 0.9, 'weight_decay': 0.1, 'fused': True}
{'momentum': 0.9, 'nesterov': True, 'weight_decay': 0.1, 'fused': True}
{'momentum': 0.9, 'nesterov': True, 'weight_decay': 0.1, 'fused': True}
{'weight_decay': 0.1, 'maximize': True, 'fused': True}
{'weight_decay': 0.1, 'maximize': True, 'fused': True}
ok

----------------------------------------------------------------------
Ran 8 tests in 16.117s

OK (skipped=1)

> lintrunner test/test_cuda.py
----------------------------------------------------------------------
ok No lint issues.

> lintrunner torch/testing/_internal/common_optimizers.py
----------------------------------------------------------------------
ok No lint issues.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126418
Approved by: https://github.com/janeyx99
2024-05-30 01:47:41 +00:00
67739d8c6f Revert "[Submodule] Remove deprecated USE_TBB option and TBB submodule (#127051)"
This reverts commit 699db7988d84d163ebb6919f78885e4630182a7a.

Reverted https://github.com/pytorch/pytorch/pull/127051 on behalf of https://github.com/PaliC due to This PR needs to be synced using the import button as there is a bug in our diff train ([comment](https://github.com/pytorch/pytorch/pull/127051#issuecomment-2138496995))
2024-05-30 01:16:57 +00:00
1abcac9dab New Custom Ops Documentation landing page (#127400)
We create a new landing page for PyTorch custom ops (suggested by
jansel). All of our error messages will link here, and I'll work with
the docs team to see if we can boost SEO for this page.

NB: the landing page links some non-searchable webpages. Two of those
(the Python custom ops tutorial and C++ custom ops tutorial) will turn
into actual webpages when PyTorch 2.4 comes around. I'll make the third one
(the Custom Operators Manual) once it stabilizes (we continously add new
things to it and the length means that we might want to create a custom
website for it to make the presentation more ingestable).

Test Plan:
- view docs preview.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127400
Approved by: https://github.com/jansel
ghstack dependencies: #127291, #127292
2024-05-30 01:06:04 +00:00
49ad90349d Correct error message for aten::_local_scalar_dense on meta tensor (#124554)
registering a meta for aten::_local_scalar_dense with a different error message.

Fixes pytorch#119588

Co-authored-by: Edward Z. Yang <ezyang@mit.edu>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124554
Approved by: https://github.com/ezyang
2024-05-30 00:50:29 +00:00
d66f12674c Handle tuple and dict during TorchScript to ExportedProgram conversion (#127341)
* Add some test cases for testing List, Tuple, and Dict
* Refactor the conversion code slightly
* Add a logic to handle Dict
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127341
Approved by: https://github.com/SherlockNoMad, https://github.com/angelayi
2024-05-30 00:08:09 +00:00
f14dc3bde8 Fix check message (#126951)
As title.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126951
Approved by: https://github.com/Skylion007, https://github.com/kit1980
2024-05-29 23:58:09 +00:00
76fc58c160 Document the legacy constructor for Tensor (#122625)
Fixes https://github.com/pytorch/pytorch/issues/122408

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122625
Approved by: https://github.com/albanD
2024-05-29 23:23:19 +00:00
7931eee5c5 Support torch.dtype as parameter in pybind11 cpp extension. (#126865)
Support torch.dtype as parameter in pybind11 cpp extension.
Example:
`
cpp_extension.my_ops(self, other, torch.dtype)
`

@ezyang @bdhirsh
Co-authored-by: Edward Z. Yang <ezyang@mit.edu>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126865
Approved by: https://github.com/ezyang
2024-05-29 23:19:32 +00:00
cyy
8ea1dc8748 Use Python::NumPy target (#127399)
Now that we use FindPython, use it again for numpy detection.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127399
Approved by: https://github.com/malfet
2024-05-29 23:17:58 +00:00
0fa2c5b049 Fix mask propagation in the presence of where (#125574)
Before, when calling ops.where, masks were not properly propagated. We
now restrict the optimisation to `ops.masked`, which I think it was what
the original code intended to do.

I'm not 100% sure that even in the masked case this code is not
introducing some bugs, but this is a strict improvement over the
previous state.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125574
Approved by: https://github.com/peterbell10
ghstack dependencies: #114471, #126783
2024-05-29 23:17:41 +00:00
15a7916c0e Ability to capture Process Groups information into Execution Traces (#126995)
Contains a method added to the ExecutionTraceObserver class to record the snapshot of the current process group config upon tracing start.

Unit test:

```
(pytorch) [dsang@devgpu021.nha2 ~/github/pytorch-fork (viable/strict)]$ touch /tmp/barrier && TEMP_DIR="/tmp" BACKEND="nccl" WORLD_SIZE="2" python test/distributed/test_distributed_spawn.py -v TestDistBackendWithSpawn.test_ddp_profiling_execution_trace
/home/dsang/github/pytorch-fork/torch/distributed/optim/__init__.py:28: UserWarning: TorchScript support for functional optimizers isdeprecated and will be removed in a future PyTorch release. Consider using the torch.compile optimizer instead.
  warn("TorchScript support for functional optimizers is"
test_ddp_profiling_execution_trace (__main__.TestDistBackendWithSpawn.test_ddp_profiling_execution_trace) ... /home/dsang/github/pytorch-fork/torch/distributed/optim/__init__.py:28: UserWarning: TorchScript support for functional optimizers isdeprecated and will be removed in a future PyTorch release. Consider using the torch.compile optimizer instead.
  warn("TorchScript support for functional optimizers is"
/home/dsang/github/pytorch-fork/torch/distributed/optim/__init__.py:28: UserWarning: TorchScript support for functional optimizers isdeprecated and will be removed in a future PyTorch release. Consider using the torch.compile optimizer instead.
  warn("TorchScript support for functional optimizers is"
NCCL version 2.20.5+cuda12.0
[rank1]:[W523 16:06:01.705774398 reducer.cpp:1400] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[rank0]:[W523 16:06:01.705905760 reducer.cpp:1400] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[rank1]:[W523 16:06:01.715182258 execution_trace_observer.cpp:819] Enabling Execution Trace Observer
printing pg info into trace
[rank0]:[W523 16:06:01.715841805 execution_trace_observer.cpp:819] Enabling Execution Trace Observer
printing pg info into trace
[rank1]:[W523 16:06:01.727881877 execution_trace_observer.cpp:831] Disabling Execution Trace Observer
[rank0]:[W523 16:06:01.728792871 execution_trace_observer.cpp:831] Disabling Execution Trace Observer
Execution trace saved at /tmp/tmpdsov4ngi.et.json
[{'id': 3, 'name': '## process_group:init ##', 'ctrl_deps': 2, 'inputs': {'values': ['[{"pg_name": "0", "pg_desc": "default_pg", "backend_config": "cuda:nccl", "ranks": [], "group_size": 2, "group_count": 1}]'], 'shapes': [[]], 'types': ['String']}, 'outputs': {'values': [], 'shapes': [], 'types': []}, 'attrs': [{'name': 'rf_id', 'type': 'uint64', 'value': 1}, {'name': 'fw_parent', 'type': 'uint64', 'value': 0}, {'name': 'seq_id', 'type': 'int64', 'value': -1}, {'name': 'scope', 'type': 'uint64', 'value': 7}, {'name': 'tid', 'type': 'uint64', 'value': 1}, {'name': 'fw_tid', 'type': 'uint64', 'value': 0}, {'name': 'op_schema', 'type': 'string', 'value': ''}, {'name': 'kernel_backend', 'type': 'string', 'value': ''}, {'name': 'kernel_file', 'type': 'string', 'value': ''}]}]
Execution trace saved at /tmp/tmpsdiqy6az.et.json
[{'id': 3, 'name': '## process_group:init ##', 'ctrl_deps': 2, 'inputs': {'values': ['[{"pg_name": "0", "pg_desc": "default_pg", "backend_config": "cuda:nccl", "ranks": [], "group_size": 2, "group_count": 1}]'], 'shapes': [[]], 'types': ['String']}, 'outputs': {'values': [], 'shapes': [], 'types': []}, 'attrs': [{'name': 'rf_id', 'type': 'uint64', 'value': 1}, {'name': 'fw_parent', 'type': 'uint64', 'value': 0}, {'name': 'seq_id', 'type': 'int64', 'value': -1}, {'name': 'scope', 'type': 'uint64', 'value': 7}, {'name': 'tid', 'type': 'uint64', 'value': 1}, {'name': 'fw_tid', 'type': 'uint64', 'value': 0}, {'name': 'op_schema', 'type': 'string', 'value': ''}, {'name': 'kernel_backend', 'type': 'string', 'value': ''}, {'name': 'kernel_file', 'type': 'string', 'value': ''}]}]
ok

----------------------------------------------------------------------
Ran 1 test in 24.447s

OK
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126995
Approved by: https://github.com/briancoutinho, https://github.com/sraikund16
2024-05-29 23:16:17 +00:00
3174e6cb8e [Temp][CI] Run older MPS tests/Mac builds on MacOS 13 (#127428)
To avoid ambiguity while migration outlined in https://github.com/pytorch-labs/pytorch-gha-infra/pull/399 is in progress. Otherwise, MPS jobs for Ventura can be accidentally scheduled on Sonoma or builds, which might result in flaky failures on trunk.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127428
Approved by: https://github.com/huydhn
2024-05-29 22:58:41 +00:00
9257a0698b [Split Build] Load dependencies from libtorch in __init__.py (#126826)
This PR makes it such that we search for a libtorch wheel when initializing pytorch in order to find the necessary shared libraries.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126826
Approved by: https://github.com/huydhn, https://github.com/atalman, https://github.com/ZainRizvi
2024-05-29 22:03:50 +00:00
b0ef363972 [dtensor] rename _Partial -> Partial for all imports (#127420)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127420
Approved by: https://github.com/awgu
2024-05-29 21:42:40 +00:00
d99b115eb3 Fix delete old branches workflow (#127442)
The ubuntu runner started using 2.45.1 (prev 2.43.2), which includes 1f49f7506f (changes +00:00 timezone to Z)

Python versions prior to 3.11 do not support Z when parsing isoformat, so update the workflow to use 3.11

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127442
Approved by: https://github.com/huydhn, https://github.com/malfet
2024-05-29 21:17:09 +00:00
38a33c3202 don't call .item in onehot for XLA (#127335)
We found that `nn.function.one_hot` will cause a graph break due to the item call in the native implementation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127335
Approved by: https://github.com/ezyang
2024-05-29 20:37:26 +00:00
cyy
84b5aa9a68 [Caffe2] [Reland] Remove Caffe2 proto files (#127394)
Reland of #126134, which was reverted due to the wrong base. Now that #126705 has been relanded, it's time to remand this one.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127394
Approved by: https://github.com/r-barnes
2024-05-29 20:37:02 +00:00
92d081e228 [Docs] Add str type to cuda.get_device_name() and cuda. get_device_capability() function (#126743)
Fixes #126400

The `get_device_name()` and `get_device_capability()` allow passing in a string, but it's not stated in the doc.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126743
Approved by: https://github.com/eqy, https://github.com/kit1980
2024-05-29 20:09:52 +00:00
24a4bfdcc2 [AdaRound] Make versatile for data / extra param for callback function (#126891)
Summary:
For Speech sequential model, there could be a case where model(data) does not work correctly for feed forward,

Speech model uses a different type of Criterion (a.k.a loss function) to feed a data on individual components like encoder, predictor, joiner.

Hence we need extra parameter to pass feedforward wrapper

Differential Revision: D57680391

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126891
Approved by: https://github.com/jerryzh168
2024-05-29 20:05:27 +00:00
c404b2968c Support min/max carry over for eager mode from_float method (#127309)
Summary:
After QAT is completed or given pre-tuned weight observer via tunable PTQ algorithm, it should not over-write again with a given weight, at least for static QAT never.

Dynamic QAT also does not require to re-run weight observer again by design.

This is a fix

Test Plan: Signals

Differential Revision: D57747749

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127309
Approved by: https://github.com/jerryzh168
2024-05-29 19:33:26 +00:00
82a370ae3a Revert "Refresh OpOverloadPacket if a new OpOverload gets added (#126863)" (#127366)
This reverts commit ed734178abc99bc1d83ad2c61d3a1e4d4f5d20c8.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127366
Approved by: https://github.com/zou3519
2024-05-29 19:26:06 +00:00
05e99154ee Allow int vals to go down the fastpath for _foreach_max (#127303)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127303
Approved by: https://github.com/albanD
ghstack dependencies: #127187
2024-05-29 19:08:58 +00:00
601c5e085d Add _foreach_max (#127187)
This PR adds _foreach_max support, the second reduction foreach op we have :D

I did have to change the autogen slightly for foreach. I can promise that the existing foreach ops' derivative behavior has not changed as I've added a skip list for the harder requirement I am setting (that the arg list should match in length). I needed to add this requirement as there is another wrong max (the one that does take in a dim for reduction) that keeps getting matched first.

Caveats!
- We do not fast path if the shapes, dtypes, device, the regular shebang for foreach are not met. We fall back to slowpath!
- MORE IMPORTANTLY, we also do not fast path for int8 and int16 and bool, but that's really a skill issue on my end as I've hardcoded -INFINITY into the CUDA kernels, and -INFINITY is not defined for small ints. It'd be nice to know how to do this properly, but that work can also come later.
- This does NOT support empty Tensors in the list, because the original max op also does not support empty Tensors. ~I think this should be allowed though, and this PR may come later.~ I understand why this is not allowed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127187
Approved by: https://github.com/albanD
2024-05-29 19:08:58 +00:00
90f4b3fcb2 PyTorch Distributed security assumptions (#127403)
To highlight, that PyTorch Distributed should only be used in a trusted environment and never on the nodes with open network access, which is very similar in spirit to https://github.com/tensorflow/tensorflow/blob/master/SECURITY.md#running-a-tensorflow-server

Thanks to @Xbalien and @K1ingzzz for drawing attention to missing documentation on distributed workloads security assumptions
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127403
Approved by: https://github.com/wconstab
2024-05-29 19:08:20 +00:00
5196ef1b59 support builtin id function on user defined object variables. (#127146)
Fix: https://github.com/pytorch/pytorch/pull/127146

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127146
Approved by: https://github.com/anijain2305
ghstack dependencies: #126444
2024-05-29 19:00:37 +00:00
ff65b18fcf Update the is_causal explaination in the SDPA doc (#127209)
Fixes #126873

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127209
Approved by: https://github.com/drisspg
2024-05-29 18:53:17 +00:00
cyy
9cc0d56fdc Remove unused variables in tests (#127379)
Reland test fixes in #127161 and reduce reduce_ops_test into floating point types.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127379
Approved by: https://github.com/ezyang
2024-05-29 18:30:51 +00:00
d938170314 Add torchao nightly testing workflow (#126885)
Add and test torchao nightly testing workflow.

This workflow will be triggered under the following conditions:
1. If the PR has ciflow/torchao label
2. Manual trigger

It will run the torchao benchmark on torchbench/timm/huggingface model workloads with 5 configs (noquant, autoquant, int8dynamic, int8weightonly, int4weightonly). The output will be updated to the PT2 Dashboard: https://hud.pytorch.org/benchmark/compilers
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126885
Approved by: https://github.com/huydhn
2024-05-29 18:22:29 +00:00
090a031d6f Use bit_cast instead of UB type-pun-via-union in Half.h (#127321)
Summary: Type punning via union has undefined behavior due to the strict aliasing rule. bit_cast does the same thing safely (using memcpy under the hood).

Test Plan: CI

Godbolt demonstrates that doing this via memcpy still generates the same instructions: https://godbolt.org/z/PhePzd4Ex
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127321
Approved by: https://github.com/ezyang, https://github.com/Skylion007
2024-05-29 17:43:50 +00:00
8b5cbb7c68 Improve NLLLoss docs (#127346)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127346
Approved by: https://github.com/mikaylagawarecki
2024-05-29 17:29:06 +00:00
28de9143a3 opcheck should be usable without optional dependencies (#127292)
This PR excises opcheck's dependency on
torch.testing._internal.common_utils, (which comes with dependencies on
expecttest and hypothesis). We do this by moving what we need to
torch.testing._utils and adding a test for it.

Fixes #126870, #126871

Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127292
Approved by: https://github.com/williamwen42
ghstack dependencies: #127291
2024-05-29 17:17:49 +00:00
8a31c2aa84 [export] allow complex guards as runtime asserts (#127129)
With the current state of export's dynamic shapes, we struggle with guards and constraints that are beyond the current dynamic shapes language, expressed with dims and derived dims. While we can compile and guarantee correctness for guards within the current language (e.g. min/max ranges, linear relationships, integer divisibility) we struggle to dynamically compile guards which extend beyond that.

For these "complex" guards, we typically do either of the following: 1) raise a constraint violation error, along the lines of "not all values of <symbol> in the specified range satisfy <guard>", with or without suggested fixes, 2) specialize to the provided static values and suggest removing dynamism, or 3) fail compilation due to some arbitrary unsupported case. Previous [work](https://github.com/pytorch/pytorch/pull/124949) went towards resolving this by disabling forced specializations, instead allowing the user to fail at runtime with incorrect inputs.

In this PR, relying on [hybrid backed-unbacked symints](https://github.com/pytorch/pytorch/issues/121749), [deferred runtime asserts](https://github.com/pytorch/pytorch/blob/main/torch/fx/passes/runtime_assert.py), and the function [_is_supported_equivalence()](d7de4c9d80/torch/fx/experimental/symbolic_shapes.py (L1824)), we add a flag `_allow_complex_guards_as_runtime_asserts` which allows the user to compile exported programs containing these guards and maintain dynamism, while adding correctness checks as runtime assertions in the graph.

Hybrid backed-unbacked symints allow us to easily bypass "implicit" guards emitted from computation - guards that we ~expect to be true. Popular examples revolve around reshapes:
```
# reshape
def forward(self, x, y):  # x: [s0, s1], y: [s2]
    return x.reshape([-1]) + y  # guard s0 * s1 = s2

This leads to the following exported program

class GraphModule(torch.nn.Module):
    def forward(self, x: "f32[s0, s1]", y: "f32[s2]"):
        sym_size_int: "Sym(s2)" = torch.ops.aten.sym_size.int(y, 0)
        mul: "Sym(-s2)" = -1 * sym_size_int;  sym_size_int = None
        sym_size_int_1: "Sym(s0)" = torch.ops.aten.sym_size.int(x, 0)
        sym_size_int_2: "Sym(s1)" = torch.ops.aten.sym_size.int(x, 1)
        mul_1: "Sym(s0*s1)" = sym_size_int_1 * sym_size_int_2;  sym_size_int_1 = sym_size_int_2 = None
        add: "Sym(s0*s1 - s2)" = mul + mul_1;  mul = mul_1 = None
        eq: "Sym(Eq(s0*s1 - s2, 0))" = add == 0;  add = None
        _assert_scalar = torch.ops.aten._assert_scalar.default(eq, "Runtime assertion failed for expression Eq(s0*s1 - s2, 0) on node 'eq'");  eq = None

        view: "f32[s0*s1]" = torch.ops.aten.view.default(x, [-1]);  x = None
        add_1: "f32[s0*s1]" = torch.ops.aten.add.Tensor(view, y);  view = y = None
        return (add_1,)
```
Another case is symbol divisibility:
```
def forward(self, x):  # x: [s0, s1]
    return x.reshape([-1, x.shape[0] - 1])  # Eq(Mod(s0 * s1, s0 - 1), 0)
```

Applying deferred runtime asserts also helps dynamic compilation for "explicit" complex guards that typically cause problems for export. For example we can generate runtime asserts for not-equal guards, and complex conditions like the following:
```
class Foo(torch.nn.Module):
    def forward(self, x, y):
        # check that negation of first guard also shows up as runtime assertion
        if x.shape[0] == y.shape[0]:  # False
            return x + y
        elif x.shape[0] == y.shape[0] ** 3:  # False
            return x + 2, y + 3
        elif x.shape[0] ** 2 == y.shape[0] * 3:  # True
            return x * 2.0, y * 3.0
```
For the above graph we will generate 3 runtime assertions: the negation of the first 2, and the 3rd condition as a guard.

One additional benefit here over the current state of exported programs is that this adds further correctness guarantees - previously with explicit complex guards, if compilation succeeded, the guards would be ignored at runtime, treated as given.

As shown above, the runtime asserts appear as math ops in the graph, generated by the sympy interpreter, resulting in an _assert_scalar call. There is an option to avoid adding these asserts into the graph, by setting `TORCH_DYNAMO_DO_NOT_EMIT_RUNTIME_ASSERTS=1`. This results in the "original" computation graph, with dynamism, and any incorrect inputs will fail on ops during runtime. Further work could go into prettifying the printer, so the majority of the graph isn't guard-related.

Ideally this PR would subsume and remove the recently added [_disable_forced_specializations](https://github.com/pytorch/pytorch/pull/124949) flag, but that flag still handles one additional case of specialization: single-variable equalities where the symbol is solvable for a concrete value: see this [PR](https://github.com/pytorch/pytorch/pull/126925)

This PR doesn't change any behavior around data-dependent errors/unbacked symints yet, that could be further work.

NOTE: will take naming change suggestions for the flag :)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127129
Approved by: https://github.com/avikchaudhuri
2024-05-29 17:15:25 +00:00
cc6e72d882 Drop caffe2 core tests and some other stuff (#127089)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127089
Approved by: https://github.com/Skylion007
2024-05-29 17:11:45 +00:00
cyy
e8e327ba82 Cover clang-tidy to torch/csrc/onnx/init.cpp (#127393)
Enabling it will not cause issues.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127393
Approved by: https://github.com/Skylion007
2024-05-29 17:05:28 +00:00
cyy
7de1352457 [1/N] Replace exceptions with static_assert(false) in some templates (#127371)
This PR tries to report some failures at build time. Once the build fails, it generally indicates that we can wrap the code inside some conditional macros, and it is a hint to further reduce the built code size. The sizeof operations were used to ensure that the assertion dependents on specific template instantiations.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127371
Approved by: https://github.com/ezyang, https://github.com/Skylion007
2024-05-29 16:14:00 +00:00
cyy
c69562caf9 [Caffe2]Remove more caffe2 files (#126628)
They are not used.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126628
Approved by: https://github.com/albanD
2024-05-29 16:08:48 +00:00
80a8fc07b2 [dynamo] Handle np.iinfo/finfo/dtype as input (#124482)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124482
Approved by: https://github.com/lezcano
ghstack dependencies: #124481
2024-05-29 16:00:15 +00:00
9a8e8101a8 Fix wording in nn.Linear docstring. (#127240)
Definition (Linear Transformation):
A mapping $T : V \to W$ between $F$-vector spaces $V,W$ is called a *linear transformation* if and only if

a) $T(u+v)=T(u)+T(v)$,
b) $T(cv)=cT(v)$

for all $u, v \in V$, $c \in F$.

Consequently, $T(0_V)=0_W$.

Thus $x \mapsto xA^T+b$ for nonzero $b$ is **not** a linear transformation, but is often referred to as an affine linear transformation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127240
Approved by: https://github.com/soulitzer, https://github.com/albanD
2024-05-29 14:55:40 +00:00
ade075444f [dynamo] Support numpy.dtype (#124481)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124481
Approved by: https://github.com/lezcano
2024-05-29 14:45:14 +00:00
bf966588f1 [BE][Ez]: Update cudnn_frontend submodule to v1.4.0 (#127175)
Updates the cudnn_frontend submodule to the latest 1.4.0 version.

Should be a straightforward, header-only submodule update.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127175
Approved by: https://github.com/ezyang, https://github.com/malfet
2024-05-29 14:23:38 +00:00
0910429d72 [BE][CMake] Use FindPython module (#124613)
As FindPythonInterp and FindPythonLibs has been deprecated since cmake-3.12

Replace `PYTHON_EXECUTABLE` with `Python_EXECUTABLE` everywhere (CMake variable names are case-sensitive)

This makes PyTorch buildable with python3 binary shipped with XCode on MacOS

TODO: Get rid of `FindNumpy` as its part of Python package
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124613
Approved by: https://github.com/cyyever, https://github.com/Skylion007
2024-05-29 13:17:35 +00:00
942d9abd66 [AOTI] Update reinplace to cover mutated buffer (#127297)
Summary: Unlike JIT Inductor, AOTI currently unlifts weights and buffers from input args, so the reinplace pass didn't really work for AOTI because it only checks mutation on placeholder, which led to excessive memory copies for kv_cache updates in LLM models. This PR removes those memory copies and roughly offers a 2x speedup. In the future, we will revert the unlift logic in AOTI and make the behvior consitent with JIT Inductor.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127297
Approved by: https://github.com/peterbell10, https://github.com/chenyang78
2024-05-29 13:07:53 +00:00
af69a52f06 Reapply "Remove more of caffe2 (#126705)" (#127317)
This reverts commit 00fe0a0d795680ade029fc552f33fffed75c0250.

Originally was unnecessarily reverted by an oncall. Landing again.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127317
Approved by: https://github.com/izaitsevfb
2024-05-29 12:20:25 +00:00
749a132fb0 [BE] wrap deprecated function/class with typing_extensions.deprecated (#126898)
Use `typing_extensions.deprecated` for deprecation annotation if possible. Otherwise, add `category=FutureWarning` to `warnings.warn("message")` if the category is missing.

Note that only warnings that their messages contain `[Dd]eprecat(ed|ion)` are updated in this PR.

UPDATE: Use `FutureWarning` instead of `DeprecationWarning`.

Resolves #126888

- #126888

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126898
Approved by: https://github.com/albanD
2024-05-29 12:09:27 +00:00
cyy
699db7988d [Submodule] Remove deprecated USE_TBB option and TBB submodule (#127051)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127051
Approved by: https://github.com/cpuhrsch, https://github.com/malfet
2024-05-29 11:58:03 +00:00
02b1cdab23 [Sync torch_FA2 and FA2 flash_api] + [Expose seqused_k & alibi_slopes arguments] (#126520)
1. **Expose seqused_k & alibi_slopes arguments**:
- This can be used when your sequence length k is not the full extent of the tensor. This is useful for kv cache scenarios and was not previously supported in the FA2 TORCH integration. We need these arguments for external xformers lib call to the _flash_attention_forward API.
Before:
```
  std::optional<Tensor> seqused_k = c10::nullopt;
  std::optional<Tensor> alibi_slopes = c10::nullopt;
```
After:
```
_flash_attention_forward(...
    std::optional<Tensor>& seqused_k,
    std::optional<Tensor>& alibi_slopes,
```

2. There is a difference between the **TORCH_FA2_flash_api:mha_fwd** and **FA2_flash_api:mha_fwd** (same for **mha_varlen_fwd**) at the query transposition (GQA) step.

The **CHECK_SHAPE** is applied on the original query vs the reshaped query. This causes an error (because of the shape constraint) for such inputs:
```
q = torch.randn([7, 1, 4, 256], dtype=torch.bfloat16, device='cuda')
k = torch.randn([7, 51, 1, 256], dtype=torch.bfloat16, device='cuda')
v = torch.randn([7, 51, 1, 256], dtype=torch.bfloat16, device='cuda')
```

![image](https://github.com/pytorch/pytorch/assets/927999/77ea6bf6-b6e9-4f3f-96a9-8d952956ddd9)

- i've modified the code as little as possible, but if you prefer a more verbose change like the following, dont hesitate to tell me:
```
at::Tensor swapped_q = seqlenq_ngroups_swapped
    ? q.reshape({batch_size, num_heads_k, num_heads / num_heads_k, head_size_og}).transpose(1, 2)
    : q;

if (seqlenq_ngroups_swapped) {
    seqlen_q = num_heads / num_heads_k;
    num_heads = num_heads_k;
}

CHECK_SHAPE(swapped_q, batch_size, seqlen_q, num_heads, head_size_og);
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126520
Approved by: https://github.com/drisspg
2024-05-29 11:54:44 +00:00
dae33a4961 [inductor][cpp] bf16/fp16 gemm template computed with fp32 w/o epilogue fusion (#126068)
As part of #125683, this PR adds the initial bf16/fp16 gemm template support with micro-gemm implemented with fused type casting and fp32 computation. It doesn't provide epilogue fusion support yet which will be added in the next PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126068
Approved by: https://github.com/jansel
ghstack dependencies: #124021, #126019
2024-05-29 11:15:41 +00:00
65af1a9c26 FIX the document of distributed.new_group() (#122703)
As for now, the document of distributed.new_group() says that it returns `None` when current ranks is not in the new created process group. However, it actually returns `GroupMember.NON_GROUP_MEMBER`. I have check the code and think it is more appropriate that we fix the document.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122703
Approved by: https://github.com/wconstab, https://github.com/kwen2501
2024-05-29 09:40:25 +00:00
6c81856dca [inductor] Add a subprocess-based parallel compile (#126816)
Summary:
Adds a "safe" parallel compile implementation that a) Popens a sub-process with an entry point we control, and b) Uses a ProcessPoolExecutor in that sub-processes to perform parallel compiles. This change essentially squashes these two implementations from jansel, but removes the "thread-based" approach since benchmarking revealed that compile-time performance was poor compared to the existing impl:
https://github.com/pytorch/pytorch/pull/124682
https://github.com/pytorch/pytorch/pull/122941

This PR adds the implementation, but defaults to the existing "fork". I'll submit a separate change to enable.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126816
Approved by: https://github.com/jansel
2024-05-29 09:40:21 +00:00
92bc444ee3 [inductor][cpp] epilogue support for gemm template (#126019)
As part of #125683, this PR adds the epilogue support for c++ gemm template by reusing the c++ vector codegen on sub-slices of tensors. This is implemented by retracing the epilogue IR nodes with new ranges and offsets. The new `codegen_loop_bodies` and `codegen_functions` methods are added to c++ vector codegen for this purpose. This is leveraged by the `store_output` method of the template kernel for epilogue codegen and store to the final result.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126019
Approved by: https://github.com/jansel
ghstack dependencies: #124021
2024-05-29 09:12:03 +00:00
00999fd8b1 Prefer flip over index_select (#126783)
It's faster and has a lower memory footprint in eager.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126783
Approved by: https://github.com/Skylion007, https://github.com/albanD
ghstack dependencies: #114471
2024-05-29 09:10:25 +00:00
8a21532e53 Fix constant propagation pass (#114471)
This pass was broken in a number of ways, as we were not generating
asserts whenever we took it, even though we need to. While doing so,
we found that the analysis we were using for choosing
whether to generate asserts or not for dynamic shapes was completely
broken.

Eliminating indirect indexing in this way allows for a number of optimisations.
In particular, we can now fuse against these kernels (indirect indexing disallows fusions).

The new strategy is as follows:

- We always propagate sympy expressions if we can.
- If an expression was an indirect_indexing, we call `check_bounds`
- We also call `check_bounds` within `CSEProxy.indirect_indexing`
- The checks are issued in the buffer where they would go if the were used in a load
   - This makes them always be codegen'd before the load and stores
   - In the case of stores, they will be generated potentially much earlier than the stores themselves, which is fine.

We add quite a few asserts to preexisting tests to strengthen them. In particular, we make sure
that issuing an assert plays well with all kinds of C++ vectorisation.

For now, we rely on the logic within `_maybe_evaluate_static` to prove
these bounds. This logic is rather limited though. In the future, we might want
to rely on Z3 here to be able to prove bounds in a more general way.

Supersedes https://github.com/pytorch/pytorch/pull/113068
Fixes https://github.com/pytorch/pytorch/issues/121251

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114471
Approved by: https://github.com/peterbell10
2024-05-29 09:10:25 +00:00
51b22d9cf2 [dynamo] Support enum construction (#127364)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127364
Approved by: https://github.com/yanboliang
ghstack dependencies: #127263
2024-05-29 08:09:21 +00:00
ad7700bfdb [inductor] Misc changes (#127307)
Pulling unrelated changes out of the larger halide PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127307
Approved by: https://github.com/yanboliang
2024-05-29 08:00:06 +00:00
cef776bcd1 [inductor][cpp] GEMM template (infra and fp32) (#124021)
This PR adds the Cpp template infrastructure and the initial FP32 gemm template. See RFC https://github.com/pytorch/pytorch/issues/125683 for more background info.
1. Cpp template infrastructure
Similar template abstractions as the CUTLASS template, i.e., `CppTemplate`, `CppTemplateKernel`, `CppTemplateBuffer`. The MicroGemm micro-kernel abstraction that can be used by Cpp GEMM templates.
2. Initial FP32 gemm template
This involves a GEMM template implementation `CppPackedGemmTemplate` that supports GEMM with constant weight (`B`) requiring `N` to be a multiple of register blocking while allows the static or dynamic sizes for the `M` (batch dim) of `A`. The `B` matrix would be prepacked. This is a typical setting for inference workloads. The template handles the thread decomposition (via `thread_blocking`) and cache blocking (via `cache_blocking`). Then it invokes `CppMicroGemm` which handles register blocking, instruction selection, and other CPU architecture-specific optimizations. A `CppMicroGemmFP32Vec` micro-kernel implementation is provided for fp32 matmuls implemented with ATen vec abstraction.
3. Correctness and performance
The changes have been validated with fp32 inference on the three benchmark suites (torchbench, huggingface and timm_models) with both static shape and dynamic shapes. Since it is an initial implementation, we are still working on further performance improves with follow-up PRs including the optimizations in kernels as well as fusions. The perf gains are only observed from a selective number of models compared to the ATen kernels which are implemented with MKL. The perf gains are more obvious with dynamic shapes since MKL only supports packed gemm for static shapes. Below are details.

Static shapes
| Benchmark | torchbench | huggingface | timm_models |
|------------|-------------|--------------|--------------|
| Multi-threaded (baseline) | 1.47x | 1.36x | 1.91x |
| Multi-threaded (max-autotune) | 1.47x | 1.36x | 1.92x |
| Single-threaded (baseline) | 1.56x | 1.19x | 1.51x |
| Single-threaded (max-autotune) | 1.56x | 1.19x | 1.52x |

Key models being sped up:
drq: 1.14x
soft_act: 1.12
cait_m36_384: 1.18x

Dynamic shapes
| Benchmark | torchbench | huggingface | timm_models |
| --- | --- | --- | --- |
| Multi-threaded (baseline) | 1.43x | 1.28x | 1.85x |
| Multi-threaded (max-autotune) | 1.47x | 1.28x | 1.85x |
| Single-threaded (baseline) | 1.55x | 1.20x | 1.51x |
| Single-threaded (max-autotune) | 1.56x | 1.19x | 1.53x |

Key models being sped up:
BERT_pytorch: 1.22x
pyhpc_turbulent: 1.13x
soft_actor_critic: 1.77x
BlenderbotForCausalLM: 1.09x
cait_m36_384: 1.17x

Differential Revision: [D57585365](https://our.internmc.facebook.com/intern/diff/D57585365)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124021
Approved by: https://github.com/jansel
2024-05-29 07:37:41 +00:00
719589c9bf [dynamo] move bytecode tests from test_misc to new bytecode test file (#127329)
Also merge with bytecode hook test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127329
Approved by: https://github.com/yanboliang, https://github.com/jansel
2024-05-29 06:10:59 +00:00
a60b06bd2b [dtensor] update public API docs (#127340)
This PR updates the API documentations for the public facing APIs

needs more example for each API but plan to add them in a separate PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127340
Approved by: https://github.com/wz337
ghstack dependencies: #127338, #127339
2024-05-29 05:18:47 +00:00
2c9a420da3 [dtensor] move some modules to private namespace (#127339)
as titled, moving some modules that are mainly for DTensor private usage
to be a private module.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127339
Approved by: https://github.com/awgu
ghstack dependencies: #127338
2024-05-29 05:18:47 +00:00
72ef2555e3 [dtensor] make Partial placement public (#127338)
As titled, partial placement is standardized right now and I think we
would want to expose this as a public API to allow user to annotate the
the sharding layout easier. Given that we already have use cases
internal/externally that uses Partial

Keeping the old _Partial name for a while for BC reason

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127338
Approved by: https://github.com/awgu, https://github.com/wz337, https://github.com/kwen2501
2024-05-29 05:18:47 +00:00
5359af0c7e [dynamo] wrap GraphModule exceptions in dynamo-wrapped tests (#126341)
Better approach to https://github.com/pytorch/pytorch/pull/126197 to catch issues like https://github.com/pytorch/pytorch/issues/125568.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126341
Approved by: https://github.com/anijain2305, https://github.com/jansel
2024-05-29 05:18:04 +00:00
cdf2133186 Add compile time profiler for non fbcode targets (#126904)
This is a tool that allow profiling compile time using strobelight profiler, its a meta only tool.
but works on non-fbcode targets.

A follow up diff will unify this with caffe2/fb/strobelight/compile_time_profiler.py.
example test:

```
run  python tools/strobelight/examples/compile_time_profile_example.py
```

```
python torch/utils/_strobelight/examples/compile_time_profile_example.py
strobelight_compile_time_profiler, line 61, 2024-05-23 10:49:28,101, INFO: compile time strobelight profiling enabled
strobelight_compile_time_profiler, line 93, 2024-05-23 10:49:28,102, INFO: Unique sample tag for this run is: 2024-05-23-10:49:282334638devvm4561.ash0.facebook.com
strobelight_compile_time_profiler, line 94, 2024-05-23 10:49:28,102, INFO: You can use the following link to access the strobelight profile at the end of the run: https://www.internalfb.com/intern/scuba/query/?dataset=pyperf_experimental%2Fon_demand&drillstate=%7B%22purposes%22%3A[]%2C%22end%22%3A%22now%22%2C%22start%22%3A%22-30%20days%22%2C%22filterMode%22%3A%22DEFAULT%22%2C%22modifiers%22%3A[]%2C%22sampleCols%22%3A[]%2C%22cols%22%3A[%22namespace_id%22%2C%22namespace_process_id%22]%2C%22derivedCols%22%3A[]%2C%22mappedCols%22%3A[]%2C%22enumCols%22%3A[]%2C%22return_remainder%22%3Afalse%2C%22should_pivot%22%3Afalse%2C%22is_timeseries%22%3Afalse%2C%22hideEmptyColumns%22%3Afalse%2C%22timezone%22%3A%22America%2FLos_Angeles%22%2C%22compare%22%3A%22none%22%2C%22samplingRatio%22%3A%221%22%2C%22metric%22%3A%22count%22%2C%22aggregation_field%22%3A%22async_stack_complete%22%2C%22top%22%3A10000%2C%22aggregateList%22%3A[]%2C%22param_dimensions%22%3A[%7B%22dim%22%3A%22py_async_stack%22%2C%22op%22%3A%22edge%22%2C%22param%22%3A%220%22%2C%22anchor%22%3A%220%22%7D]%2C%22order%22%3A%22weight%22%2C%22order_desc%22%3Atrue%2C%22constraints%22%3A[[%7B%22column%22%3A%22sample_tags%22%2C%22op%22%3A%22all%22%2C%22value%22%3A[%22[%5C%222024-05-23-10:49:282334638devvm4561.ash0.facebook.com%5C%22]%22]%7D]]%2C%22c_constraints%22%3A[[]]%2C%22b_constraints%22%3A[[]]%2C%22ignoreGroupByInComparison%22%3Afalse%7D&view=GraphProfilerView&&normalized=1712358002&pool=uber
strobelight_function_profiler, line 241, 2024-05-23 10:49:34,943, INFO: strobelight run id is: 3507039740348330
strobelight_function_profiler, line 243, 2024-05-23 10:50:00,907, INFO: strobelight profiling running
strobelight_function_profiler, line 224, 2024-05-23 10:50:02,741, INFO: strobelight profiling stopped
strobelight_function_profiler, line 215, 2024-05-23 10:50:06,173, INFO: Total samples: 7
strobelight_function_profiler, line 215, 2024-05-23 10:50:06,173, INFO: GraphProfiler (python stack): https://fburl.com/scuba/pyperf_experimental/on_demand/75cxdro3
strobelight_function_profiler, line 215, 2024-05-23 10:50:06,173, INFO: Icicle view (python stack): https://fburl.com/scuba/pyperf_experimental/on_demand/qsgydsee
strobelight_compile_time_profiler, line 120, 2024-05-23 10:50:06,174, INFO: 1 strobelight success runs out of 1 non-recursive compilation events.
strobelight_function_profiler, line 241, 2024-05-23 10:50:08,137, INFO: strobelight run id is: 8721740011604497
strobelight_function_profiler, line 243, 2024-05-23 10:50:34,801, INFO: strobelight profiling running
strobelight_function_profiler, line 224, 2024-05-23 10:50:36,803, INFO: strobelight profiling stopped
strobelight_function_profiler, line 215, 2024-05-23 10:50:41,289, INFO: Total samples: 3
strobelight_function_profiler, line 215, 2024-05-23 10:50:41,289, INFO: GraphProfiler (python stack): https://fburl.com/scuba/pyperf_experimental/on_demand/qmi2ucwp
strobelight_function_profiler, line 215, 2024-05-23 10:50:41,289, INFO: Icicle view (python stack): https://fburl.com/scuba/pyperf_experimental/on_demand/7fjkhs9i
strobelight_compile_time_profiler, line 120, 2024-05-23 10:50:41,289, INFO: 2 strobelight success runs out of 2 non-recursive compilation events.
strobelight_function_profiler, line 241, 2024-05-23 10:50:43,597, INFO: strobelight run id is: 1932476082259558
strobelight_function_profiler, line 243, 2024-05-23 10:51:09,791, INFO: strobelight profiling running
strobelight_function_profiler, line 224, 2024-05-23 10:51:11,883, INFO: strobelight profiling stopped
strobelight_function_profiler, line 215, 2024-05-23 10:51:16,218, INFO: Total samples: 3
strobelight_function_profiler, line 215, 2024-05-23 10:51:16,218, INFO: GraphProfiler (python stack): https://fburl.com/scuba/pyperf_experimental/on_demand/vy1ujxec
strobelight_function_profiler, line 215, 2024-05-23 10:51:16,218, INFO: Icicle view (python stack): https://fburl.com/scuba/pyperf_experimental/on_demand/2xgadviv
strobelight_compile_time_profiler, line 120, 2024-05-23 10:51:16,219, INFO: 3 strobelight success runs out of 3 non-recursive compilation events.
```

or pass TORCH_COMPILE_STROBELIGHT=TRUE for any torch compile python program.
ex running on XLNetLMHeadModel.
```
 TORCH_COMPILE_STROBELIGHT=TRUE TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 time python benchmarks/dynamo/huggingface.py --ci --accuracy --timing --explain --inductor --device cuda --training --amp  --only XLNetLMHeadModel
 ```
 result:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126904
Approved by: https://github.com/aorenste
ghstack dependencies: #126444
2024-05-29 05:06:37 +00:00
2b72e2a596 [Cudagraph] better support for streams (#126809)
This PR fixes Issue #124391.

There are two root causes.

### Root Cause 1 [better support for stream during cudagraph capture]

When recording a new function, CUDA graph tree records memory block states (e.g., address, size, allocated, etc) via `getCheckpointState`. Let's say the record is called `block_state`.

Later, CUDA graph tree would like to recover exactly the same memory block states by `apply_checkpoint_execution_state_in_allocator`, which a) frees all memory blocks; b) allocate all recorded block states (regardless of `block_state->allocated`); c) free blocks with `block_state->allocated == False`; and d) check block_state matches remaining blocks (e.g., `block_state->ptr == block->ptr`).

An error may occur when multiple streams exists during recording. [Note](https://github.com/pytorch/pytorch/blob/main/c10/cuda/CUDACachingAllocator.cpp#L2149-L2152) that a block will not be merged with other blocks if it is used by some streams, even if `block->allocated==False`. This may lead to a mismatch between `block_state->ptr` and `block->ptr` in `apply_checkpoint_execution_state_in_allocator`.

This PR solves the issue by avoiding inserting events if this events coming from a stream used during cudagraph capture. The reason is that we know all events or streams used during cudagraph capture must have been completed before cudagraph capture finishes.

### Root Cause 2 [fix a bug in checkpoint state]
When we getCheckpointState, we create block state. At that time, we do not record block->device. So block_state->device == 0 no matter the real value of block->device. See [how](https://github.com/pytorch/pytorch/blob/main/c10/cuda/CUDACachingAllocator.cpp#L744-L750) BlockState is created from a block.

When use block state during setSegmentStateToCheckpoint, we use [block_state.device (=0)](https://github.com/pytorch/pytorch/blob/main/c10/cuda/CUDACachingAllocator.cpp#L1526). This leads to errors.

We fixed this issue by recording block->device into block_state in getCheckpointState.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126809
Approved by: https://github.com/eellison
2024-05-29 04:52:35 +00:00
a41f828da7 [c10d] fix group_name/group_desc set up in eager initialization (#127053)
Summary:
ProcessGroupNCCL set up group_name/desc in c10d log and NCCL when initializing nccl communicator. In eager initialization mode, pg_name and pg_desc is set after communicator initialization so the information won't be available in pytorch log or NCCL communicator.

This PR fix this by setting pg_name/desc earlier

Differential Revision: D57759816

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127053
Approved by: https://github.com/wconstab, https://github.com/kwen2501
2024-05-29 04:42:34 +00:00
932e04142d extract calculate_time_spent from print_time_report (#127362)
Fixes #ISSUE_NUMBER

wrap certain steps in a separate function for easier TTFB instrumentation (fb internal use case)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127362
Approved by: https://github.com/yanboliang, https://github.com/mengluy0125
2024-05-29 04:37:15 +00:00
a25b28a753 [Split Build] Add option to create libtorch wheel and use it to build pytorch as a separate wheel (#126328)
Creates an option to just build the libtorch portion of pytorch such that we have the necessary .so files.  Then it builds a torch package using the libtorch wheel. These options are enabled using ` BUILD_LIBTORCH_WHL` and `BUILD_PYTHON_ONLY`.

We run

```
 BUILD_LIBTORCH_WHL=1 python setup.py install
python setup.py clean
BUILD_PYTHON_ONLY=1 python setup.py install
```

to produce

```
sahanp@devgpu086 ~/pytorch (detached HEAD|REBASE-i 3/5)> ls /home/sahanp/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/torch/lib/                                                                                                                (pytorch-3.10)
libshm.so*  libtorch_global_deps.so*  libtorch_python.so*
sahanp@devgpu086 ~/pytorch (detached HEAD|REBASE-i 3/5)> ldd build/lib/libtorch_python.so                                                                                                                                                                (pytorch-3.10)
        linux-vdso.so.1 (0x00007ffdc2d37000)
        libtorch.so => /home/sahanp/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/libtorch/lib/libtorch.so (0x00007f539fe99000)
        libshm.so => /home/sahanp/pytorch/build/lib/libshm.so (0x00007f539fe90000)
        libcudnn.so.8 => /usr/local/cuda-12.1/targets/x86_64-linux/lib/libcudnn.so.8 (0x00007f539e800000)
        libnvToolsExt.so.1 => /usr/local/cuda/lib64/libnvToolsExt.so.1 (0x00007f539e400000)
        libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00007f539e000000)
        libm.so.6 => /lib64/libm.so.6 (0x00007f539fda5000)
        libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f539ebe5000)
        libc.so.6 => /lib64/libc.so.6 (0x00007f539dc00000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f539fea0000)
        libtorch_cpu.so => /home/sahanp/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/libtorch/lib/libtorch_cpu.so (0x00007f5392400000)
        libtorch_cuda.so => /home/sahanp/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/libtorch/lib/libtorch_cuda.so (0x00007f5380000000)
        librt.so.1 => /lib64/librt.so.1 (0x00007f539fd9e000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f539fd99000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00007f539fd94000)
        libc10.so => /home/sahanp/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/libtorch/lib/libc10.so (0x00007f539eb07000)
        libmkl_intel_lp64.so.2 => /home/sahanp/.conda/envs/pytorch-3.10/lib/libmkl_intel_lp64.so.2 (0x00007f537ec00000)
        libmkl_gnu_thread.so.2 => /home/sahanp/.conda/envs/pytorch-3.10/lib/libmkl_gnu_thread.so.2 (0x00007f537ce00000)
        libmkl_core.so.2 => /home/sahanp/.conda/envs/pytorch-3.10/lib/libmkl_core.so.2 (0x00007f5378800000)
        libomp.so => /home/sahanp/.conda/envs/pytorch-3.10/lib/libomp.so (0x00007f539e707000)
        libcupti.so.12 => /usr/local/cuda/lib64/libcupti.so.12 (0x00007f5377e00000)
        libcudart.so.12 => /usr/local/cuda/lib64/libcudart.so.12 (0x00007f5377a00000)
        libc10_cuda.so => /home/sahanp/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/libtorch/lib/libc10_cuda.so (0x00007f539ea6a000)
        libcusparse.so.12 => /usr/local/cuda/lib64/libcusparse.so.12 (0x00007f5368400000)
        libcufft.so.11 => /usr/local/cuda/lib64/libcufft.so.11 (0x00007f535ee00000)
        libcusolver.so.11 => /usr/local/cuda/lib64/libcusolver.so.11 (0x00007f534c800000)
        libcurand.so.10 => /usr/local/cuda/lib64/libcurand.so.10 (0x00007f5346200000)
        libcublas.so.12 => /usr/local/cuda/lib64/libcublas.so.12 (0x00007f533f800000)
        libcublasLt.so.12 => /usr/local/cuda/lib64/libcublasLt.so.12 (0x00007f531e800000)
        libutil.so.1 => /lib64/libutil.so.1 (0x00007f539ea63000)
        libnvJitLink.so.12 => /usr/local/cuda/lib64/libnvJitLink.so.12 (0x00007f531b800000)
sahanp@devgpu086 ~/pytorch (detached HEAD|REBASE-i 3/5)> ldd build/lib/libtorch_global_deps.so                                                                                                                                                           (pytorch-3.10)
        linux-vdso.so.1 (0x00007ffc265df000)
        libmkl_intel_lp64.so.2 => /home/sahanp/.conda/envs/pytorch-3.10/lib/libmkl_intel_lp64.so.2 (0x00007fa93fc00000)
        libmkl_gnu_thread.so.2 => /home/sahanp/.conda/envs/pytorch-3.10/lib/libmkl_gnu_thread.so.2 (0x00007fa93de00000)
        libmkl_core.so.2 => /home/sahanp/.conda/envs/pytorch-3.10/lib/libmkl_core.so.2 (0x00007fa939800000)
        libm.so.6 => /lib64/libm.so.6 (0x00007fa940f05000)
        libcudart.so.12 => /usr/local/cuda/lib64/libcudart.so.12 (0x00007fa939400000)
        libnvToolsExt.so.1 => /usr/local/cuda/lib64/libnvToolsExt.so.1 (0x00007fa939000000)
        libgomp.so.1 => /home/sahanp/.conda/envs/pytorch-3.10/lib/libgomp.so.1 (0x00007fa93fb07000)
        libc.so.6 => /lib64/libc.so.6 (0x00007fa938c00000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00007fa940efe000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fa940ef9000)
        /lib64/ld-linux-x86-64.so.2 (0x00007fa940ff5000)
        librt.so.1 => /lib64/librt.so.1 (0x00007fa940ef2000)
        libstdc++.so.6 => /home/sahanp/.conda/envs/pytorch-3.10/lib/libstdc++.so.6 (0x00007fa93921d000)
        libgcc_s.so.1 => /home/sahanp/.conda/envs/pytorch-3.10/lib/libgcc_s.so.1 (0x00007fa93faec000)
        ```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126328
Approved by: https://github.com/atalman
2024-05-29 04:33:56 +00:00
8090145936 [pipelining] add back support for multi-use parameters/buffers (#126653)
## Motivation
Resolves #126626 to support TorchTitan.

With this PR, we add back support for cases where a parameter or buffer is used in multiple stages. An example of such usage is in LLaMA (torchtitan), code snippet:
```
for layer in self.layers.values():
    h = layer(h, self.freqs_cis)
```

## Solution
Step 1:
Remove the previous guards of `if len(node.users) == 1`.
Step 2:
Call `move_param_to_callee` multiple times, one for each stage ("callee").
Step 3:
Delay deletion of the `get_attr` node (for getting the param) from root till this param has been sunk into each stage that uses it.

The PR also cleans up the old code around this (dropping the TRANSMIT mode and supporting REPLICATE mode only).

## Test
Changed the `ExampleCode` model to use `mm_param1` in multiple stages.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126653
Approved by: https://github.com/pianpwk
2024-05-29 03:36:47 +00:00
781f26240a Add script to copy distributed commits to stable branch (#126918)
This will be used as part of a prototype of a stable pytorch with a fast-moving distributed folder

Tasks: T189915739

Test plan:

I ran the script in a few configurations on my local machine. It worked as expected
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126918
Approved by: https://github.com/seemethere, https://github.com/malfet
2024-05-29 03:33:44 +00:00
10d2373abd Add a registry for GraphModuleSerializer (#126550)
This PR adds a registration function and a global registry for GraphModuleSerializer. After this PR, custom serialization methods can be done through registration instead of subclassing for ease of maintenance.

## Changes
- Add a test case where it injects custom op to test serialization.
- Add custom op handler
- Change allowed op for verifier
Co-authored-by: Zhengxu Chen <zhxchen17@outlook.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126550
Approved by: https://github.com/zhxchen17
2024-05-29 03:12:48 +00:00
cdbb2c9acc Revert "[Submodule] Remove deprecated USE_TBB option and TBB submodule (#127051)"
This reverts commit 4fdbaa794f9d5af2f171f772a51cb710c51c925f.

Reverted https://github.com/pytorch/pytorch/pull/127051 on behalf of https://github.com/PaliC due to This PR needs to be synced using the import button as there is a bug in our diff train ([comment](https://github.com/pytorch/pytorch/pull/127051#issuecomment-2136428735))
2024-05-29 03:02:35 +00:00
7a506dd005 Revert "[Caffe2]Remove Caffe2 proto files (#126134)"
This reverts commit a40658481ada9ecfd5716513a8537818c79cb3ef.

Reverted https://github.com/pytorch/pytorch/pull/126134 on behalf of https://github.com/malfet due to Broke bazel builds, see https://github.com/pytorch/pytorch/actions/runs/9278148147/job/25528691981 ([comment](https://github.com/pytorch/pytorch/pull/126134#issuecomment-2136373096))
2024-05-29 01:53:45 +00:00
cyy
669560d51a Use hidden visibility in OBJECTCXX files (#127265)
Since it can eliminate some linker warnings on MacOS

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127265
Approved by: https://github.com/ezyang, https://github.com/malfet
2024-05-29 01:40:23 +00:00
52e448a7f9 Revert "Enable Wunused-variable on tests (#127161)"
This reverts commit 6436a6407d9d65c42efb8e55beeb8b391b67fd64.

Reverted https://github.com/pytorch/pytorch/pull/127161 on behalf of https://github.com/malfet due to Broke ReduceTests on Windows (by testing more), see https://github.com/pytorch/pytorch/actions/runs/9274944325/job/25519484937 ([comment](https://github.com/pytorch/pytorch/pull/127161#issuecomment-2136339435))
2024-05-29 01:09:45 +00:00
85172fbe84 Back out "Prevent partitioner from ever saving views (#126446)" (#127316)
Summary: Revert "Prevent partitioner from ever saving views (#126446)" due to a torchinductor failure on CU Training Framework tests.

Reviewed By: Chillee

Differential Revision: D57868343

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127316
Approved by: https://github.com/Chillee
2024-05-29 00:29:44 +00:00
cyy
a40658481a [Caffe2]Remove Caffe2 proto files (#126134)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126134
Approved by: https://github.com/r-barnes
2024-05-29 00:22:14 +00:00
f4cbcff8ef [TorchScript] Expand TorchScript __init__ annotation warning (#127045)
Summary:
Expand TorchScript `__init__` annotation warning to `list` and `dict` with reference to GSD task T187638414 and annotation warning reproduction D56834720.

Currently, the TorchScript compiler ignores and throws `UserWarning`s for the following annotation types for empty values within the `__init__` function: `List`, `Dict`, `Optional`. However, the compiler should additionally cover warnings for `list` and `dict`. This diff adds support for `list` and `dict`.

Test Plan:
Added 4 new unit tests:

`test_annotated_empty_list_lowercase` and `test_annotated_empty_dict_lowercase` verify that TorchScript throws UserWarnings for the list and dict type annotations on empty values.
```
(base) [jananisriram@devvm2248.cco0 /data/users/jananisriram/fbsource/fbcode (e4ce427eb)]$ buck2 test @mode/{opt,inplace} //caffe2/test:jit -- --regex test_annotated_empty_list_lowercase
...
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
(base) [jananisriram@devvm2248.cco0 /data/users/jananisriram/fbsource/fbcode (e4ce427eb)]$ buck2 test @mode/{opt,inplace} //caffe2/test:jit -- --regex test_annotated_empty_dict_lowercase
...
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0
```

`test_annotated_with_jit_empty_list_lowercase` and `test_annotated_with_jit_empty_dict_lowercase` verify that TorchScript throws UserWarnings for the list and dict type annotations on empty values with the jit annotation.
```
(base) [jananisriram@devvm2248.cco0 /data/users/jananisriram/fbsource/fbcode (e4ce427eb)]$ buck2 test @mode/{opt,inplace} //caffe2/test:jit -- --regex test_annotated_with_jit_empty_list_lowercase
...
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
(base) [jananisriram@devvm2248.cco0 /data/users/jananisriram/fbsource/fbcode (e4ce427eb)]$ buck2 test @mode/{opt,inplace} //caffe2/test:jit -- --regex test_annotated_with_jit_empty_dict_lowercase
...
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0
```

Differential Revision: D57752002

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127045
Approved by: https://github.com/davidberard98
2024-05-28 23:49:10 +00:00
1be7e4086a Drop caffe2 nomnigraph (#127086)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127086
Approved by: https://github.com/Skylion007
2024-05-28 23:20:46 +00:00
f6ef832e87 [inductor] Use symbolic_hint when bounding fallback size hint (#127262)
The previous fallback ignores any known hint values in the expression and only
looks at the value ranges. By using the `symbolic_hint` we will use both hints
and value ranges.

Also removed the recursive use of `size_hint` on the bounds, since these should
always be constants.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127262
Approved by: https://github.com/lezcano
ghstack dependencies: #127251
2024-05-28 22:51:45 +00:00
26a8fa3a06 [inductor] Restore ExpandView sanity checks (#127251)
This restores the assertion removed in #124864

The handling of unbacked symints is incidental, the main purpose of this assert
was to catch bugs in lowerings.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127251
Approved by: https://github.com/lezcano
2024-05-28 22:51:45 +00:00
db0a0ecb60 [FSDP2] Added test for N-way TP and 1-way FSDP with CPU offloading (#127024)
This PR shows that we can use FSDP solely for CPU offloading when composing with N-way TP. Each FSDP mesh is just 1 rank.

This was motivated from an ask on Slack :)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127024
Approved by: https://github.com/weifengpy, https://github.com/wanchaol
2024-05-28 22:51:36 +00:00
6b24155827 [dtensor][debug] added c10d gather, reduce, scatter tracing to CommDebugMode (#127134)
**Summary**
Added c10d gather, reduce, and scatter tracing to CommDebugMode and edited test case in test_comm_mode to include added features.

**Test Plan**
pytest test/distributed/_tensor/debug/test_comm_mode.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127134
Approved by: https://github.com/XilunWu
ghstack dependencies: #127025, #127029, #127040
2024-05-28 22:48:07 +00:00
eqy
a76faff71c [NCCL][CUDA] Optionally avoid rethrowing CUDA Errors in NCCL Watchdog (#126587)
Doesn't affect current behavior by default, for #126544
I'm not sure what the exact mechanism is here but CUDA errors appear to already be thrown in the main process, meaning that the watchdog is separately throwing CUDA errors again. However this rethrown error causes the process to be terminated as it cannot be handled from user code (which doesn't have visibility of the watchdog thread).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126587
Approved by: https://github.com/kwen2501
2024-05-28 22:17:15 +00:00
93bfe57144 cudagraphs: fix backward hooks & fsdp interaction (#126914)
Fixes

> ERROR: expected to be in states [<TrainingState.FORWARD_BACKWARD: 2>] but current state is TrainingState.IDLE

Error that would occur when composing pt2 fsdp and cudagraphs. Cudagraphs caches output tensor impls in the fast path, so we were inadvertently accumulating multiple hooks on what should have been fresh allocations.

from code comment:
```
# this output represents a fresh allocated tensor.
# We return the same TensorImpl from run to run to avoid overhead.
# autograd.Function will reset the Autograd meta of output tensors
# as part of aot_autograd, but _backward_hooks are stored on tensors separately,
# so we need to manually reset hooks.
``

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126914
Approved by: https://github.com/awgu, https://github.com/xmfan
2024-05-28 22:07:41 +00:00
4154c8358a [BE] Wrap store check in a try/catch (#127030)
Summary:
Global store may already have been destroyed when we do the check.
This leads to a Null Pointer Exception. This caused a SEV in Production.
Stack trace from crash:
```
[trainer2]:# 5  c10d::TCPStore::check(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&)
[trainer2]:# 6  c10d::ProcessGroupNCCL::heartbeatMonitor()
```

Test Plan:
Will deploy in small training job and with `NCCL_DUMP_ON_TIMEOUT` set.
Job should complete with no exceptions.

Reviewers:

Subscribers:

Tasks: T190163458

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127030
Approved by: https://github.com/Skylion007, https://github.com/shuqiangzhang
2024-05-28 20:57:36 +00:00
f206c5c628 [export] handle new roots & root swapping in derived dims suggested fixes (#125543)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125543

This PR address 2 issues with derived dim suggested fixes, 1) newly introduced roots, and 2) root swapping.

1 | Newly introduced roots appear with modulo guards, e.g. Mod(dx, 2) = 0 suggests dx is a derived dim equal to 2 * _dx, introducing a new root _dx. Currently the final suggested fixes handle this correctly, but we can get intermediate results where related derived dims don't rely on a unified root, and are a mixture of min/max range and derived suggestions.

For example:
```
"dx": {"eq": 3*_dx-1, "max": 36}
"dy": {"eq": dx+1}
This should lead to suggested fixes
  _dx = Dim('_dx', max=12)
  dx = 3 * _dx - 1
  dy = 3 * _dx
```

This PR prettifies the suggested fixes routine by unifying to a single root, and making each intermediate suggestion either a derived dim or min/max range, not both.

2 | The current suggested fixes for derived dims can lead to root dims/derived dims being swapped, e.g. `dy - 1, dy` -> `dx, dx + 1`. This leads to problematic suggested fixes that look like `dy - 1 = Dim("dy - 1")` since we don't have access to the original variable name.

This PR only adds a suggested fix for the root dim, and removes all other derived suggestions.

For example, with the export test case test_derived_dim_out_of_order_simplified:
```
_dimz = torch.export.Dim("_dimz", min=6, max=8)
dimy = _dimz - 1
dimx = dimy - 1
dimz = torch.export.Dim("dimz", min=6, max=8)  # doesn't work, should be = _dimz

class Foo(torch.nn.Module):
    def forward(self, x, y, z):
        return x + y[1:] + z[2:]

foo = Foo()
u, v, w = torch.randn(5), torch.randn(6), torch.randn(7)
export(
    foo,
    (u, v, w),
    dynamic_shapes=({0: dimx}, {0: dimy}, {0: dimz}),
)
```

Before:
```
Suggested fixes:
  _dimz = Dim('_dimz', min=3, max=9223372036854775807)  # 2 <= _dimz - 1 <= 9223372036854775806
  _dimz - 2 = Dim('_dimz - 2', min=4, max=6)
  _dimz = Dim('_dimz', min=2, max=9223372036854775806)  # 2 <= _dimz <= 9223372036854775806
  _dimz - 1 = _dimz - 1
  dimz = _dimz
```

New suggested fixes:
```
Suggested fixes:
  dimz = _dimz
```

Note: This assumes the specified derived relations between dims are correct. This should be valid because: 1) if the relation is plain wrong (e.g. (dx, dx - 1) provided with inputs (6, 4)), this gets caught in beforehand in produce_guards. 2) if the relation is correct but does not match the emitted guard, for example:
```
def forward(self, x, y):
    return x.reshape([-1]) + y  # guard: s0 * 2 = s1
dx = Dim("dx")
export(
    model,
    (torch.randn(6, 2), torch.randn(12)),
    dynamic_shapes={"x": (dx, 2), "y": (dx + 6, )}
)
```
This produces two linear equations, leading to specialization since a) produce_guards is able to solve for a concrete value, and b) the export constraint solver will anyways force specializations due to range constraints.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125543
Approved by: https://github.com/avikchaudhuri
2024-05-28 20:41:43 +00:00
cyy
0a9d73a814 Remove c10::guts::bool_constant and c10::guts::negation (#127300)
They are not used.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127300
Approved by: https://github.com/r-barnes
2024-05-28 19:55:20 +00:00
03005bb655 Improve the clarity of the torch.Tensor.backward doc (#127201)
Improve the clarity of the torch.Tensor.backward doc, particularly wrt the arg `gradient`.
Reference https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html,
```
We need to explicitly pass a gradient argument in Q.backward() because it is a vector. gradient is a tensor of the same shape as Q, and it represents the gradient of Q w.r.t. itself
```

@janeyx99 feel free to assign to the corresponding reviewers, thanks
Co-authored-by: Jeffrey Wan <soulitzer@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127201
Approved by: https://github.com/soulitzer
2024-05-28 19:25:51 +00:00
f600faf248 [metal] Improve perf of int4pack_mm shader (#127135)
Using vectorized data types and using SIMD groups to optimize memory access pattern

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127135
Approved by: https://github.com/malfet
2024-05-28 18:22:58 +00:00
c9172d4471 print default value in FunctionSignature (#127059)
Fixes #[126758](https://github.com/pytorch/pytorch/issues/126758) and #[126759](https://github.com/pytorch/pytorch/issues/126759)

The output information in the issue is not accurate because `FunctionSignature::toString()` print the schema strings without default.
cb6ef68caa/torch/csrc/utils/python_arg_parser.cpp (L1282-L1283)
This pr, by adding a `default_value` to save the default str ,which shoule be priented. Of course, can also add an new api to reverse `default_bool/default_int` to string, which is slightly more complicated.
result:
![image](https://github.com/pytorch/pytorch/assets/37650440/f58a4cbf-b0f4-4c81-9106-59f0d35c54ea)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127059
Approved by: https://github.com/janeyx99
2024-05-28 18:04:31 +00:00
045309aa35 [MPS] Enable toch.mm and friends for complex dtypes (#127241)
- Add `supportedFloatingOrComplexType`
- Change dtype check to those
- Extend low-precision fp32 list to complex types
- Mark conv2d as supported now, as it was failing due to the tighter accuracy constrains than the same op for float32 dtype

Fixes https://github.com/pytorch/pytorch/issues/127178

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127241
Approved by: https://github.com/janeyx99
2024-05-28 17:56:13 +00:00
829f594d7d [small] guard_size_oblivious, skip check for meta (#127298)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127298
Approved by: https://github.com/ezyang
2024-05-28 17:53:08 +00:00
9521528f71 Log export result of torch.jit.trace to scuba (#126900)
Summary: We want to track how well torch.jit.trace can be converted to export in large scale. As a first step, we log all of torch.jit.trace unittests whether we can convert the traced module to export module OR we can export the model directly

Test Plan: CI

Differential Revision: D57629682

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126900
Approved by: https://github.com/SherlockNoMad
2024-05-28 17:49:34 +00:00
3f79e09515 Revert "Made some minor improvements to flexattention perf + added more autotune configs (#126811)"
This reverts commit 84e59f052d4342ac9453703be55758de102e20d3.

Reverted https://github.com/pytorch/pytorch/pull/126811 on behalf of https://github.com/PaliC due to breaking on V100s / internal tests ([comment](https://github.com/pytorch/pytorch/pull/126811#issuecomment-2135798983))
2024-05-28 17:48:26 +00:00
254783ce80 [Fix]: populate input parameter name when convert TorchScript to ExportedProgram (#126787)
## Goal
As title

## Design
Based on the fact that each TorchScript module has a `code` property which provides the original source code for the `forward` function, I implemented a function to extrapolate `forward` function signature by using the AST parser.

Some other tradeoff
* Directly parsing src code as string --> will be very buggy
* Directly using `compile` function in Python to get the function object --> raises a lot of exceptions because of missing packages or undefined variable names
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126787
Approved by: https://github.com/angelayi, https://github.com/tugsbayasgalan
2024-05-28 17:33:44 +00:00
122282111d [inductor][reland] Various improvements to error handling during autotuning (#126847)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126847
This is a reland of [D56764094](https://www.internalfb.com/diff/D56764094) / https://github.com/pytorch/pytorch/pull/125762. It was originally reverted due to rebase conflicts.
Original commit changeset: 45875a1e5de2
Original Phabricator Diff: [D56764094](https://www.internalfb.com/diff/D56764094)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126847
Approved by: https://github.com/chenyang78
2024-05-28 17:22:26 +00:00
df360e2add Update derivatives.yaml (#127193)
Fixed a typo in docs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127193
Approved by: https://github.com/soulitzer
2024-05-28 16:56:03 +00:00
cbb79a2baf [export] Disable backend decomps for capture_pre_autograd (#127120)
Differential Revision: D57785713

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127120
Approved by: https://github.com/ydwu4
2024-05-28 16:37:13 +00:00
cyy
c40408850a [1/N] Fix clang-tidy warnings in aten/src/ATen/cuda/ (#127183)
Fixes clang-tidy warnings

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127183
Approved by: https://github.com/soulitzer, https://github.com/Skylion007
2024-05-28 15:35:29 +00:00
cyy
3d88c618d5 Concat namespaces in torch/csrc/profiler and other fixes (#127266)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127266
Approved by: https://github.com/soulitzer
2024-05-28 15:21:32 +00:00
4d4d2a96f2 Add space in MetaFallbackKernel.cpp error message (#127291)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127291
Approved by: https://github.com/Skylion007
2024-05-28 13:54:38 +00:00
a6b994ed54 Fix lint after #126845 (#127286)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127286
Approved by: https://github.com/NicolasHug, https://github.com/DanilBaibak
2024-05-28 12:38:27 +00:00
ec8b254ef4 Refactored template codegen to explicitly set current body when generating code (#127144)
The main motivation for this refactor is that today, when generating templates, this is what happens.

```
def_kernel() # registers hook for fully generating function definition
store_output() # registers hook for generating the output store. *also* keeps a number of things generated on `self.body`.
```

Later on, when we codegen the template: f8c4c268da/torch/_inductor/codegen/simd.py (L1402)

```
epilogue_node.codegen() # Also writes to body!
template.finalize() # Calls the above two hooks for def_kernel and store_output, which then reads from the accumulated `self.body`
```

Today, this is fine, as long as `store_output` is the last function called in the template. However, there's a couple things we probably want to do with kernels that makes this annoying.

1. In FlexAttention backwards, we might want a `modification` to be positioned *after* the `store_output` (just logically from a code organization POV). This doesn't work today because `modification` also needs to codegen a subgraph, but writing to `body` here conflicts with `store_output`'s implicit saved state on `self.body`.
2. If we want to support prologue fusion, we need to go through a bunch of contortions today to call the template hook finalization a couple times (https://github.com/pytorch/pytorch/pull/121211/files#diff-73b89475038a5b4705da805f1217783883fb90398ee1164995db392fc4a342c1R322)
3. The current code also makes it quite difficult to support fusion into multiple output nodes.

To resolve this, I do two things:
1. I *remove* the default `self.body` on `TritonTemplateKernel`. Instead, I have a dict of `self.subgraph_bodies`, which can be enabled in a context with `TritonTemplateKernel.set_subgraph_body`. This allows multiple different template functions to write to their own isolated bodies.
2. I add functions that allow you to finalize specific hooks on `PartialRender`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127144
Approved by: https://github.com/jansel
2024-05-28 09:49:13 +00:00
457b9f7397 Optimize mask memory for flash attention (#126961)
The PR optimizes the mask memory for flash attention. Instead of directly converting the whole mask to fp32, we do the conversion block-wisely. This can decrease the peak memory usage (we test in https://huggingface.co/microsoft/Phi-3-mini-128k-instruct, peak memory usage reduces ~50%) and have some performance improvements as well.

### Performance result
single socket in Intel (R) Xeon (R) CPU Max 9480
batch_size = 12, q_seq_len = 1030, kv_seq_len = 1179, n_head = 3, head_dim = 33, mask_dim = 4, bool_mask = 0
  | Forward speedup | Backward speedup
-- | -- | --
float64 | 0.82% | 3.76%
float32 | 2.2% | 3.9%
bfloat16 | 16.15% | 7.56%

**segment-anything-fast**
Follow https://github.com/pytorch-labs/segment-anything-fast/tree/main/experiments
Single socket in Intel (R) Xeon (R) CPU Max 9480
Dtype: bfloat16, models: vit_b and vit_h, test in `SDPA` and `Triton` commit https://github.com/pytorch-labs/segment-anything-fast/blob/main/experiments/run_experiments.py#L199-L200, select the time of 20th iteration.
  | vit_b |   | vit_h |  
-- | -- | -- | -- | --
  | attn_mask w/o   block-wise | attn_mask w/   block-wise | attn_mask w/o   block-wise | attn_mask w/   block-wise
SDPA| 10.95s/it | 6.59s/it | 19.93s/it | 12.33s/it
Triton | 10.66s/it | 7.12s/it | 19.87s/it | 12.26s/it

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126961
Approved by: https://github.com/Valentine233, https://github.com/jgong5
2024-05-28 09:12:18 +00:00
1507d5205a [dynamo][fsdp] Skip Dynamo tracing of __getattr__ if its top-level frame (#127263)
The generated bytecode for the first frame is below. Inlined comments about the LOAD_ATTR which causes Dynamo to trigger again on `__getattr__`.

~~~
[__bytecode] MODIFIED BYTECODE fn /data/users/anijain/pytorch2/test/dynamo/test_activation_checkpointing.py line 1129
[__bytecode] 1129           0 COPY_FREE_VARS           1
[__bytecode]                2 RESUME                   0
[__bytecode]                4 PUSH_NULL
[__bytecode]                6 LOAD_GLOBAL             10 (__compiled_fn_1)
[__bytecode]               18 LOAD_FAST                0 (x)
[__bytecode]               20 LOAD_DEREF               1 (mod)
[__bytecode]               22 LOAD_ATTR                6 (_checkpoint_wrapped_module)
[__bytecode]               32 LOAD_CONST               1 (0)
[__bytecode]               34 BINARY_SUBSCR
[__bytecode]               44 LOAD_ATTR                7 (weight)
[__bytecode]               54 LOAD_DEREF               1 (mod)
[__bytecode]               56 LOAD_ATTR                6 (_checkpoint_wrapped_module)
[__bytecode]               66 LOAD_CONST               1 (0)
[__bytecode]               68 BINARY_SUBSCR
[__bytecode]               78 LOAD_ATTR                8 (bias)

# When this optimized bytecode is executed, these two lines call the __getattr__ of ActivationWrapper module.
# Dynamo gets invoked on __getattr__.

# If we had inlined __getattr__ during the tracing, we would have seen the LOAD_ATTR
# on more low level data structures like _modules, obviating the need for CPython
# to call python overriden __getattr__. But today, UnspecializedNNModuleVariable
# calls python getattr at tracing time (instead of inlining it), resulting in LOAD_ATTR
# on the module itself.

# To prevent Dynamo to skip tracing of __Getattr__ on the optimized bytecode,
# we can check if its top level frame and just skip it.

[__bytecode]               88 LOAD_DEREF               1 (mod)
[__bytecode]               90 LOAD_ATTR                0 (a)

[__bytecode]              100 PRECALL                  4
[__bytecode]              104 CALL                     4
[__bytecode]              114 UNPACK_SEQUENCE          1
[__bytecode]              118 RETURN_VALUE
~~~~

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127263
Approved by: https://github.com/yf225
2024-05-28 08:16:53 +00:00
cyy
d6e3e89804 Remove c10::void_t (#127248)
OSS version doesn't use it anymore.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127248
Approved by: https://github.com/ezyang
2024-05-28 06:59:20 +00:00
246311c944 Unconditionally add asserts after export (#127132)
Summary: Today AOTAutograd drops some of assert nodes so we reapply it after strict export.

Test Plan: CI

Reviewed By: angelayi

Differential Revision: D57786907

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127132
Approved by: https://github.com/zhxchen17
2024-05-28 06:31:39 +00:00
cyy
e4b245292f Remove caffe2::tensorrt target code from cuda.cmake (#127204)
Following #126542.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127204
Approved by: https://github.com/ezyang
2024-05-28 04:42:14 +00:00
cyy
c6b36ec2f9 Remove calls of deprecated _aminmax (#127182)
While  #125995 is pending, the calls should be removed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127182
Approved by: https://github.com/ezyang
2024-05-28 03:51:45 +00:00
d957c2d5de [Doc] update default magma cuda version in readme (#122125)
Since we use cuda 12.1 by default now, it would be better to update the doc.

Many people (including me), want to directly copy-paste commands in readme 😉  Let's make our life easier.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122125
Approved by: https://github.com/malfet
2024-05-28 03:37:23 +00:00
7c61e7be5c Address issue #125307 (#126351)
PyTorch overrides SymPy's Mod and does its own symbolic simplification. Inspired by issue #125307, this PR adds one more simplification tactic.

Fixes #125307

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126351
Approved by: https://github.com/ezyang
2024-05-28 02:03:24 +00:00
8979412442 Enable ufmt format on test files (#126845)
Fixes some files in  #123062

Run lintrunner on files:

test/test_nnapi.py,
test/test_numba_integration.py,
test/test_numpy_interop.py,
test/test_openmp.py,
test/test_optim.py

```bash
$ lintrunner -a --take UFMT --all-files
ok No lint issues.
Successfully applied all patches.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126845
Approved by: https://github.com/ezyang
2024-05-28 01:42:07 +00:00
cyy
57000708fc Remove c10::invoke_result (#127160)
Following #124169 , it can be safely remove from OSS version.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127160
Approved by: https://github.com/ezyang
2024-05-28 01:39:28 +00:00
cyy
6436a6407d Enable Wunused-variable on tests (#127161)
This PR enables unused-variable warnings in tests and fixes some test code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127161
Approved by: https://github.com/ezyang
2024-05-28 01:37:46 +00:00
cyy
70d8bc2da1 Fix various errors in TCPStoreLibUvBackend.cpp (#127230)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127230
Approved by: https://github.com/Skylion007
2024-05-27 19:14:01 +00:00
0ff2f8b522 update kineto submodule hash (#126780)
Summary: update kineto submodule hash

Test Plan: CIs

Differential Revision: D57620964

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126780
Approved by: https://github.com/Skylion007
2024-05-27 18:11:48 +00:00
25a9262ba4 Add structured logging for fx graph cache hash (#127156)
Summary: Add structured logging for fx graph cache hash so that we can debug MAST jobs easily.

Test Plan: ad hoc testing

Differential Revision: D57791537

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127156
Approved by: https://github.com/jamesjwu
2024-05-27 17:18:41 +00:00
26f4f10ac8 [5/N][Easy] fix typo for usort config in pyproject.toml (kown -> known): sort torch (#127126)
The `usort` config in `pyproject.toml` has no effect due to a typo. Fixing the typo make `usort` do more and generate the changes in the PR. Except `pyproject.toml`, all changes are generated by `lintrunner -a --take UFMT --all-files`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127126
Approved by: https://github.com/kit1980
2024-05-27 14:49:57 +00:00
c7f6fbfa9d Revert "[FSDP2] Added test for N-way TP and 1-way FSDP with CPU offloading (#127024)"
This reverts commit 9117779b0a178ec5ca548585a97bcb44be631644.

Reverted https://github.com/pytorch/pytorch/pull/127024 on behalf of https://github.com/atalman due to failing in CI ([comment](https://github.com/pytorch/pytorch/pull/127024#issuecomment-2133566325))
2024-05-27 14:12:09 +00:00
7121ea6f70 Revert "Add compile time profiler for non fbcode targets (#126904)"
This reverts commit 575cb617db4043dd7a76aaf523dc3ab7ee07e7a5.

Reverted https://github.com/pytorch/pytorch/pull/126904 on behalf of https://github.com/atalman due to Broke nightly smoke test ([comment](https://github.com/pytorch/pytorch/pull/126904#issuecomment-2133418687))
2024-05-27 12:52:09 +00:00
00fe0a0d79 Revert "Remove more of caffe2 (#126705)"
This reverts commit f95dbc12761cb4466099b0e9a3667057ca39272b.

Reverted https://github.com/pytorch/pytorch/pull/126705 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/126705#issuecomment-2133325449))
2024-05-27 11:59:14 +00:00
1110edb94b Fix stream type to generic in comms default hooks (#120069)
In comms default_hooks - decompress stream is hardcoded to cuda type. fix this to use generic type based on the grad tensor device.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120069
Approved by: https://github.com/jgong5, https://github.com/fegin
2024-05-27 10:27:30 +00:00
55c0ab2887 Revert "[5/N][Easy] fix typo for usort config in pyproject.toml (kown -> known): sort torch (#127126)"
This reverts commit 7763c83af67eebfdd5185dbe6ce15ece2b992a0f.

Reverted https://github.com/pytorch/pytorch/pull/127126 on behalf of https://github.com/XuehaiPan due to Broken CI ([comment](https://github.com/pytorch/pytorch/pull/127126#issuecomment-2133044286))
2024-05-27 09:22:08 +00:00
4608971f7a Revert "[inductor][cpp] GEMM template (infra and fp32) (#124021)"
This reverts commit 0d1e22855022a04a8601a2d94f3079950283ba5d.

Reverted https://github.com/pytorch/pytorch/pull/124021 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/124021#issuecomment-2133002071))
2024-05-27 09:01:45 +00:00
343a41fba8 Revert "[inductor][cpp] epilogue support for gemm template (#126019)"
This reverts commit 56c412d9063de3dc8163b8e1b0b9b5bf9581ad05.

Reverted https://github.com/pytorch/pytorch/pull/126019 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/124021#issuecomment-2133002071))
2024-05-27 09:01:45 +00:00
68fddebf84 Revert "[inductor][cpp] bf16/fp16 gemm template computed with fp32 w/o epilogue fusion (#126068)"
This reverts commit 4aa43d11f332b2d7b8f19b4da5ceba612133889d.

Reverted https://github.com/pytorch/pytorch/pull/126068 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/124021#issuecomment-2133002071))
2024-05-27 09:01:45 +00:00
ed9951ace7 Revert "[inductor][cpp] support bf16/fp16 gemm template epilogue fusion (#126545)"
This reverts commit 43baabe9b94c86bd36ba4a00f501e52d833d7ec8.

Reverted https://github.com/pytorch/pytorch/pull/126545 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/124021#issuecomment-2133002071))
2024-05-27 09:01:45 +00:00
4c2e671a3b Revert "[Inductor][CPP] Add Min/Max with VecMask (#126841)"
This reverts commit 1ef4306ab11410a506e0868543a466e87ea879b5.

Reverted https://github.com/pytorch/pytorch/pull/126841 on behalf of https://github.com/DanilBaibak due to Blocks reverting of the broken PR ([comment](https://github.com/pytorch/pytorch/pull/126841#issuecomment-2132995404))
2024-05-27 08:58:01 +00:00
5247446396 Revert "[Inductor][CPP] Add ne with VecMask (#126940)"
This reverts commit f8c4c268da67e9684f3287b7468f36a5a27c6a0b.

Reverted https://github.com/pytorch/pytorch/pull/126940 on behalf of https://github.com/DanilBaibak due to Blocks reverting of the broken PR ([comment](https://github.com/pytorch/pytorch/pull/126841#issuecomment-2132995404))
2024-05-27 08:58:01 +00:00
60523fa674 Revert "Move MKLDNN Specific IR to Separate File (#126504)"
This reverts commit bf2909b871579a78e841b661b9b0c302f311d010.

Reverted https://github.com/pytorch/pytorch/pull/126504 on behalf of https://github.com/DanilBaibak due to Blocks reverting of the broken PR ([comment](https://github.com/pytorch/pytorch/pull/126841#issuecomment-2132995404))
2024-05-27 08:58:01 +00:00
ff63e8bac8 [CI] fix doctest case by adding requires (#126855)
With the triton update, the new dependency `llnl-hatchet` will be introduced. And `pydot` is a dependency of `llnl-hatchet`. So the doctest case `torch/fx/passes/graph_drawer.py::FxGraphDrawer.get_dot_graph:0` won't be skipped.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126855
Approved by: https://github.com/etaf, https://github.com/EikanWang, https://github.com/peterbell10
2024-05-27 07:40:27 +00:00
22712ba5c5 Radam support the flag for "maximize" (#126765)
Fixes #[126642](https://github.com/pytorch/pytorch/issues/126642)

I reference the maximize in `Adam` and add `Radam's` maximize flag. If this pr is OK, I will add another pr for `Nadam`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126765
Approved by: https://github.com/janeyx99
2024-05-27 06:34:50 +00:00
cyy
5cca904c51 [3/N] Enable clang-tidy in aten/src/ATen/detail/ (#127184)
Following #127168

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127184
Approved by: https://github.com/jansel
2024-05-27 06:28:07 +00:00
1c2e221e25 CUDA 12.4 ARM wheel integration to CD - nightly build (#126174)
rebasing https://github.com/pytorch/pytorch/pull/124112.
too many conflict files, so starting a new PR.

Test https://github.com/pytorch/builder/pull/1775 (merged) for ARM wheel addition
Test https://github.com/pytorch/builder/pull/1828 (merged) for setting MAX_JOBS

Current issue to follow up:
https://github.com/pytorch/pytorch/issues/126980

Co-authored-by: Aidyn-A <aidyn.b.aitzhan@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126174
Approved by: https://github.com/nWEIdia, https://github.com/atalman
2024-05-27 05:50:36 +00:00
7763c83af6 [5/N][Easy] fix typo for usort config in pyproject.toml (kown -> known): sort torch (#127126)
The `usort` config in `pyproject.toml` has no effect due to a typo. Fixing the typo make `usort` do more and generate the changes in the PR. Except `pyproject.toml`, all changes are generated by `lintrunner -a --take UFMT --all-files`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127126
Approved by: https://github.com/kit1980
ghstack dependencies: #127122, #127123, #127124, #127125
2024-05-27 04:22:18 +00:00
cyy
4fdbaa794f [Submodule] Remove deprecated USE_TBB option and TBB submodule (#127051)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127051
Approved by: https://github.com/cpuhrsch, https://github.com/malfet
2024-05-27 03:54:03 +00:00
6aa5bb1a76 [inductor] Support persistent reductions for dynamic shapes (#126684)
Currently persistent reductions are only supported when the reduction dimension
is static, however we only really need to know that the rnumel is bounded.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126684
Approved by: https://github.com/lezcano
2024-05-27 02:30:20 +00:00
bf2909b871 Move MKLDNN Specific IR to Separate File (#126504)
**Summary**
Following the discussion in https://github.com/pytorch/pytorch/pull/122593#discussion_r1604144782, Move Inductor MKLDNN specific IRs to a separate file.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126504
Approved by: https://github.com/desertfire, https://github.com/jgong5
ghstack dependencies: #126841, #126940
2024-05-27 00:48:09 +00:00
39de62845a [decomp] Fix default values missing from inplace rrelu decomposition (#126978)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126978
Approved by: https://github.com/lezcano
2024-05-26 23:49:40 +00:00
06934518a2 [AMD] Fix deprecated amdsmi api (#126962)
Summary: https://github.com/pytorch/pytorch/pull/119182 uses an API that has already been deprecated by c551c3caed. So fixing this in a backward compatible way

Differential Revision: D57711088

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126962
Approved by: https://github.com/eqy, https://github.com/izaitsevfb
2024-05-26 20:11:23 +00:00
ee6cb6daa1 Turn the mutation dependency of MutationOutput to weak deps (#127151)
A writeup of how mutation works in Inductor: https://docs.google.com/document/d/1P0fSq4Nm-3CvdUe9v-mLdEWD3dgIHUf1czQXMmQsuxc/edit

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127151
Approved by: https://github.com/oulgen
ghstack dependencies: #127148, #127149
2024-05-26 01:21:03 +00:00
f8c4c268da [Inductor][CPP] Add ne with VecMask (#126940)
**Summary**
Fix https://github.com/pytorch/pytorch/issues/126824#issuecomment-2125039161 which is missing the support of `ne` with `VecMask`.

**Test Plan**
```
python test/inductor/test_torchinductor_opinfo.py -k test_comprehensive_ne_cpu_bool
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126940
Approved by: https://github.com/isuruf, https://github.com/jgong5, https://github.com/peterbell10
ghstack dependencies: #126841
2024-05-25 23:54:48 +00:00
1ef4306ab1 [Inductor][CPP] Add Min/Max with VecMask (#126841)
**Summary**
Fix issue: https://github.com/pytorch/pytorch/issues/126824 which is missing the support of `min/max` with `VecMask`.

**TestPlan**
```
python test/inductor/test_torchinductor_opinfo.py -k test_comprehensive_clamp_max_cpu_bool
python test/inductor/test_torchinductor_opinfo.py -k test_comprehensive_clamp_min_cpu_bool
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126841
Approved by: https://github.com/isuruf, https://github.com/jgong5, https://github.com/peterbell10
2024-05-25 23:52:21 +00:00
b8ee7d0cc1 Change direct uses of MutationOutput to mark_node_as_mutating (#127149)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127149
Approved by: https://github.com/oulgen
ghstack dependencies: #127148
2024-05-25 23:47:39 +00:00
3817c4f9fa Unify add_fake_dep and add_mutation_dep, as they're literally the same thing (#127148)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127148
Approved by: https://github.com/oulgen
2024-05-25 23:47:39 +00:00
cyy
9bead53519 [2/N] Fix clang-tidy warnings in aten/src/ATen/detail/ (#127168)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127168
Approved by: https://github.com/Skylion007
2024-05-25 22:50:02 +00:00
a28bfb5ed5 [4/N][Easy] fix typo for usort config in pyproject.toml (kown -> known): sort functorch (#127125)
The `usort` config in `pyproject.toml` has no effect due to a typo. Fixing the typo make `usort` do more and generate the changes in the PR. Except `pyproject.toml`, all changes are generated by `lintrunner -a --take UFMT --all-files`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127125
Approved by: https://github.com/Skylion007
ghstack dependencies: #127122, #127123, #127124
2024-05-25 22:45:38 +00:00
35ea5c6b22 [3/N][Easy] fix typo for usort config in pyproject.toml (kown -> known): sort torchgen (#127124)
The `usort` config in `pyproject.toml` has no effect due to a typo. Fixing the typo make `usort` do more and generate the changes in the PR. Except `pyproject.toml`, all changes are generated by `lintrunner -a --take UFMT --all-files`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127124
Approved by: https://github.com/Skylion007
ghstack dependencies: #127122, #127123
2024-05-25 19:20:03 +00:00
0dae2ba5bd [2/N][Easy] fix typo for usort config in pyproject.toml (kown -> known): sort caffe2 (#127123)
The `usort` config in `pyproject.toml` has no effect due to a typo. Fixing the typo make `usort` do more and generate the changes in the PR. Except `pyproject.toml`, all changes are generated by `lintrunner -a --take UFMT --all-files`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127123
Approved by: https://github.com/Skylion007
ghstack dependencies: #127122
2024-05-25 18:26:34 +00:00
da141b096b Enable UFMT on test/test_hub.py (#127155)
Partially addresses #123062

Ran lintrunner on:
test/test_hub.py

Detail:
```
$ lintrunner -a --take UFMT test/test_hub.py
ok No lint issues.
Successfully applied all patches.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127155
Approved by: https://github.com/Skylion007
2024-05-25 18:23:24 +00:00
12d11fe4e5 Revert "reset dynamo cache before each test (#126586)"
This reverts commit bd24991f461476036d6ba20fed92651c7e46ef7c.

Reverted https://github.com/pytorch/pytorch/pull/126586 on behalf of https://github.com/malfet due to Broke tons of tests, see bd24991f46  ([comment](https://github.com/pytorch/pytorch/pull/126586#issuecomment-2131365576))
2024-05-25 17:17:19 +00:00
71eafe9e97 Refactor dispatch logic to clarify control flow (#126402)
As discussed, this cleans up the code so that create_aot_dispatcher literally chooses an aot_dispatch function and runs it. Moves wrapper logic to jit_compile_runtime_wrappers, and adds aot_dispatch_export to handle export cases in one place.

This also makes aot_dispatch_* return the same type always: a Callable and the forward metadata, instead of returning different number of arguments in export cases. Callers that don't care about fw_metadata can just ignore it. Added return type hints to enforce the same exact interface among all the aot_dispatch_* functions.

It'd be nice to move the checks from the synthetic base and dedup wrappers that have to do with export outside of those wrappers, but it's probably fine for now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126402
Approved by: https://github.com/oulgen, https://github.com/bdhirsh
ghstack dependencies: #126193
2024-05-25 16:06:34 +00:00
7642cdef25 Improve fusable_read_and_write() (#127061)
Related to https://github.com/pytorch/pytorch/issues/98467

The tacotron2 benchmark creates a lot of nodes which fusion then checks. This improves some of the perf of that checking.

`can_fuse_vertical` calls `fusable_read_and_write` on O(read deps * write deps) combinations - but only cares about write deps that are MemoryDeps - so do the isinstance check outside the inner loop to save O(read deps) when it won't matter anyway.

Also moves `fusable_read_and_write` to a instance method (instead of a closure) since it doesn't actually capture any variables.

I also tried pre-splitting the read deps into `StarDep` vs `MemoryDep` but that didn't actually make any perf difference.

Testing:
```
time python benchmarks/dynamo/torchbench.py --accuracy --inference --amp --backend inductor --disable-cudagraphs --device cuda --only tacotron2
```
Before this change: 10m15s
After this change: 9m31s

Related to #98467

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127061
Approved by: https://github.com/peterbell10, https://github.com/jansel
ghstack dependencies: #127060
2024-05-25 15:17:25 +00:00
6c79299a35 Improve score_fusion_memory() (#127060)
Related to #98467

The tacotron2 benchmark creates a lot of nodes which fusion then checks. This
improves some of the perf of that checking.

`score_fusion_memory` is called O(n^2) times - so by moving the set union, `has_unbacked_symbols` check, and `numbytes_hint` out of the loop we call them O(n) times and the O(n^2) call gets cheaper.

Testing:
```
time python benchmarks/dynamo/torchbench.py --accuracy --inference --amp --backend inductor --disable-cudagraphs --device cuda --only tacotron2
```

Before this change: 12m33s
After this change: 10m15s

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127060
Approved by: https://github.com/peterbell10, https://github.com/jansel
2024-05-25 15:17:25 +00:00
ba3b05fdf3 [1/N][Easy] fix typo for usort config in pyproject.toml (kown -> known): sort stdlib (#127122)
The `usort` config in `pyproject.toml` has no effect due to a typo. Fixing the typo make `usort` do more and generate the changes in the PR. Except `pyproject.toml`, all changes are generated by `lintrunner -a --take UFMT --all-files`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127122
Approved by: https://github.com/kit1980
2024-05-25 08:25:50 +00:00
4a997de8b9 [AOTI] support freezing for MKLDNN (#124350)
## Description
Fixes https://github.com/pytorch/pytorch/issues/114450. This PR builds upon the work from @imzhuhl done in https://github.com/pytorch/pytorch/pull/114451.

This PR requires https://github.com/pytorch/pytorch/pull/122472 to land firstly.

We leverage the serialization and deserialization API from oneDNN v3.4.1 to save the opaque MKLDNN tensor during the compilation and restore the opaque tensor when loading the compiled .so.
ideep version is updated so that we won't break any pipeline even if third_party/ideep is not updated at the same time.

### Test plan:
```sh
python -u test/inductor/test_aot_inductor.py -k AOTInductorTestNonABICompatibleCpu.test_freezing_non_abi_compatible_cpu
python -u test/inductor/test_aot_inductor.py -k AOTInductorTestNonABICompatibleCpu.test_conv_freezing_non_abi_compatible_cpu
python -u test/inductor/test_aot_inductor.py -k AOTInductorTestNonABICompatibleCpu.test_deconv_freezing_non_abi_compatible_cpu
python -u test/inductor/test_aot_inductor.py -k AOTInductorTestNonABICompatibleCpu.test_linear_freezing_non_abi_compatible_cpu
```

### TODOs in follow-up PRs
1. We found that using `AOTI_TORCH_CHECK` will cause performance drop on several models (`DistillGPT2`, `MBartForConditionalGeneration`, `T5ForConditionalGeneration`, `T5Small`) compared with JIT Inductor which uses `TORCH_CHECK`. This may need further discussion how to address (`AOTI_TORCH_CHECK` is introduced in
 https://github.com/pytorch/pytorch/pull/119220).
2. Freezing in non-ABI compatible mode will work with the support in this PR. While for ABI compatible mode, we need to firstly address this issue: `AssertionError: None, i.e. optional output is not supported`.
6c4f43f826/torch/_inductor/codegen/cpp_wrapper_cpu.py (L2023-L2024)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124350
Approved by: https://github.com/jgong5, https://github.com/desertfire
2024-05-25 07:15:36 +00:00
e7a42702f9 generalize custom_fwd&custom_bwd to be device-agnostic (#126531)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126531
Approved by: https://github.com/jgong5, https://github.com/gujinghui, https://github.com/albanD, https://github.com/EikanWang
ghstack dependencies: #126527
2024-05-25 06:48:16 +00:00
c09205a057 Deprecate device-specific GradScaler autocast API (#126527)
# Motivation

## for `torch.amp.GradScaler`,
- `torch.cpu.amp.GradScaler(args...)` is completely equivalent to `torch. amp.GradScaler("cpu", args...)`.
- `torch.cuda.amp.GradScaler(args...)` is completely equivalent to `torch.amp.GradScaler("cuda", args...)`.

So, we intend to depreate them and **strongly recommend** developer to use `torch.amp.GradScaler`.

## for `custom_fwd` and `custom_bwd`,
this is a good solution to make the custom function run with or without effect even in an autocast-enabled region and can be shared by other backends, like CPU and XPU.
So we generalize it to be device-agnostic and put them int `torch/amp/autocast_mode.py` and re-expose to `torch.amp.custom_fwd` and `torch.amp.custom_bwd`. Meanwhile, we deprecate `torch.cuda.amp.custom_fwd` and `torch.cuda.amp.custom_bwd`.

# Additional Context
Add UT to cover the deprecated warning.
No need for more UTs to cover the functionality of `torch.amp.custom_f/bwd`, the existing UTs that previously covered the functionality of `torch.cuda.amp.custom_f/bwd` can cover them.
To facilitate the review, we separate these code changes to two PRs. The first PR cover `torch.amp.GradScaler`. The follow-up covers `custom_fwd` and `custom_bwd`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126527
Approved by: https://github.com/jgong5, https://github.com/gujinghui, https://github.com/janeyx99, https://github.com/EikanWang
2024-05-25 06:41:34 +00:00
ef86a27dba Mark test_set_per_process_memory_fraction serial (#127087)
Occasionally OOMs

Also should probably give the entire GPU for this anyways
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127087
Approved by: https://github.com/huydhn
2024-05-25 06:26:47 +00:00
0f67d38f0f add TORCHDYNAMO_CAPTURE_DYNAMIC_OUTPUT_SHAPE_OPS (#127017)
tlparse prints failure description like this

> dynamic shape operator: aten._unique2.default; to enable, set torch._dynamo.config.capture_dynamic_output_shape_ops = True

adding os env var to set it easier for testing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127017
Approved by: https://github.com/jackiexu1992
2024-05-25 05:42:41 +00:00
84e59f052d Made some minor improvements to flexattention perf + added more autotune configs (#126811)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126811
Approved by: https://github.com/drisspg, https://github.com/yanboliang, https://github.com/Neilblaze
2024-05-25 05:03:31 +00:00
cyy
9f11fc666a [1/N] Fix clang-tidy warnings in aten/src/ATen/detail/ (#127057)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127057
Approved by: https://github.com/Skylion007
2024-05-25 04:55:52 +00:00
bd24991f46 reset dynamo cache before each test (#126586)
In https://github.com/pytorch/pytorch/issues/125967, we found test results depend on test order. The root cause is due to earlier tests populate dynamo cache and affect the later tests.

This PR clear dynamo cache before each unit test so we get more deterministic result for unit test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126586
Approved by: https://github.com/jansel
2024-05-25 04:48:09 +00:00
8bd26ecf0b [pipelining] test composability with DDP and FSDP (#127066)
Added to `multigpu` test config, which is run periodically.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127066
Approved by: https://github.com/H-Huang, https://github.com/wconstab
ghstack dependencies: #127136, #126931
2024-05-25 04:30:40 +00:00
c1d2564acf [pipelining] Add grad test for interleaved schedules (#126931)
Added `test_grad_with_manual_interleaved`:
- Model: `MultiMLP`
- Tested schedules: Interleaved1F1B, LoopedBFS
- Two stages per rank
```
Rank 0 stages: [0, 2]
Rank 1 stages: [1, 3]
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126931
Approved by: https://github.com/wconstab
ghstack dependencies: #127136
2024-05-25 04:13:28 +00:00
eaace67444 [pipelining] do not check inputs for non-0 stages (#127136)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127136
Approved by: https://github.com/wconstab
2024-05-25 04:13:28 +00:00
cc9a3412d4 Implement a post_compile step for aot_dispatch_autograd (#126193)
This PR moves the post compile portion of aot_dispatch_autograd into runtime_wrappers.py. Completing this allows us to run the post compile section on its own when warm starting.

I considered leaving this thing in jit_compile_runtime_wrappers, but we're gonna run into circular dependency issues later if we don't move it over
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126193
Approved by: https://github.com/bdhirsh
ghstack dependencies: #126907
2024-05-25 03:24:20 +00:00
52bcf120e5 Make inductor config hashing more portable (#127022)
Summary: masnesral and I noticed that config contains non portable artifacts. Lets fix that.

Test Plan: adhoc testing

Differential Revision: D57748025

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127022
Approved by: https://github.com/masnesral
2024-05-25 03:01:33 +00:00
665637714f Remove SparseAdam weird allowance of raw Tensor input (#127081)
This continues the full deprecation after https://github.com/pytorch/pytorch/pull/114425. It's been 6 months! And I'm fairly certain no one is going to yell at me as this patch is not really used.

------

# BC Breaking note

As of this PR, SparseAdam will become consistent with the rest of our optimizers in that it will only accept containers of Tensors/Parameters/param groups and fully complete deprecation of this path. Hitherto, the SparseAdam constructor had allowed raw tensors as the params argument to the constructor. Now, if you write the following code, there will be an error similar to every other optim: "params argument given to the optimizer should be an iterable of Tensors or dicts"

```
import torch
param = torch.rand(16, 32)
optimizer = torch.optim.SparseAdam(param)
```

Instead you should replace the last line with
```
optimizer = torch.optim.SparseAdam([param])
```
to no longer error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127081
Approved by: https://github.com/soulitzer
2024-05-25 02:58:24 +00:00
cyy
29a1f62f23 Replace c10::invoke_result with std::invoke_result (#124169)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124169
Approved by: https://github.com/swolchok
2024-05-25 02:42:13 +00:00
9ef6f8dfc1 Fix typo in inductor workflow for CUDA 12.4 jobs (#127121)
Discovered by @clee2000.  The change was introduced in https://github.com/pytorch/pytorch/pull/121956
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127121
Approved by: https://github.com/clee2000, https://github.com/Skylion007
2024-05-25 02:36:39 +00:00
ed838793df [pipelining] Remove qualname mapping (#127018)
`QualnameMapMixin` was intended to provide a mapping from new FQN of the piped model to the FQN of the original model. It was there because previous tracers and flattening during tracing would modify the FQNs.

Now that we use unflattener, the FQN of the stage modules are the same as the original FQNs. We don't need `QualnameMapMixin` any more.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127018
Approved by: https://github.com/H-Huang
2024-05-25 02:32:40 +00:00
5f15110499 Update dispatch stub to make SDPA routing cleaner (#126832)
# Summary

Adds a public method to dispatchstub to check if a fn has been registered for a device. We use this new function to clean up the dispatching logic for SDPA, as well as make the private use dispatching simpler:
#126392
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126832
Approved by: https://github.com/ezyang, https://github.com/albanD
2024-05-25 01:40:53 +00:00
db9c6aeec6 Revert "Skip test_memory_format_nn_BatchNorm2d in inductor (#125970)" (#126594)
This reverts commit 0a9c6e92f8d1a35f33042c8dab39f23b7f39d6e7.

enable the test since it's fixed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126594
Approved by: https://github.com/huydhn
ghstack dependencies: #126593
2024-05-25 01:27:02 +00:00
b03dc3d167 don't check memory format for empty tensors (#126593)
Fix https://github.com/pytorch/pytorch/issues/125967 . The test actually fail for empty 4D or 5D tensors when checking for memory format.

I'm not exactly sure what recent inductor change cause the failure, but it may be not that important to maintain strides for an empty tensor. (?)

I just skip the check for empty tensor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126593
Approved by: https://github.com/ezyang
2024-05-25 01:19:45 +00:00
84f8cd22ac [dynamo][TensorVariable] Support "if param.grad_fn" usecase (#126960)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126960
Approved by: https://github.com/jansel
ghstack dependencies: #126922
2024-05-25 01:09:26 +00:00
bbeb0906c4 Register creak_node_hook (#126671)
Differential Revision: D57469157

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126671
Approved by: https://github.com/angelayi
2024-05-24 23:32:15 +00:00
72f0bdcc22 Remove torch._constrain_as_value (#127103)
Summary: This API doesn't do anything useful and should be subsumed by torch._check.

Test Plan: CI

Differential Revision: D57786740

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127103
Approved by: https://github.com/angelayi
2024-05-24 22:49:46 +00:00
d5bf3a98db [inductor] Refactor indexing() into triton.py (#127047)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127047
Approved by: https://github.com/shunting314
ghstack dependencies: #126944, #126945
2024-05-24 22:46:20 +00:00
92433217cb [inductor] Misc refactors (#126945)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126945
Approved by: https://github.com/shunting314
ghstack dependencies: #126944
2024-05-24 22:46:20 +00:00
1b6e3e3bcb [inductor] Refactor part of IterationRangesEntry into triton.py (#126944)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126944
Approved by: https://github.com/shunting314
2024-05-24 22:46:20 +00:00
83617017e0 [dtensor][debug] add c10d allreduce_coalesced_ tracing to CommDebugMode (#127040)
**Summary**
Added c10d all_reduce_coalesced tracing to CommDebugMode and added test case to test_comm_mode.py.

**Test Plan**
pytest test/distributed/_tensor/debug/test_comm_mode.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127040
Approved by: https://github.com/XilunWu
ghstack dependencies: #127025, #127029
2024-05-24 22:25:44 +00:00
59052071b7 Disallow fusions of foreach and reductions (#127048)
Fixes https://github.com/pytorch/pytorch/issues/120857

This currently isn't supported until we enable foreach reduction kernels.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127048
Approved by: https://github.com/weifengpy
2024-05-24 21:35:06 +00:00
023c1baf82 Add global configurations to cache key (#126907)
This adds a bunch of global configurations to the cache key. There's definitely more I haven't added, but this is just an audit of all of the `torch.*` globals that are used in jit_compile_runtime_wrappers, runtime_wrappers, etc.

It also makes the hash details object subclass FXGraphHashDetails, which implements other hashed data like configs inductor depends on.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126907
Approved by: https://github.com/aorenste
2024-05-24 21:26:46 +00:00
c133665d4a [CUDA] Parallelize upsampling OPS across the batch/channel dimension. (#127082)
This can make this operation 200x+ faster on modern GPUs for small grid sizes, as otherwise this kernel is scheduled with a single block (!)

Tested on A100 with:
```
python test/test_nn.py TestNNDeviceTypeCUDA
```

**Benchmarks FW**
Ran on A100 / bf16
## Forward pass benchmarks

| batch size | input size | output size | before runtime (mem bandwidth) | after runtime (mem bandwidth) | speedup |
|------------|------------|-------------|------------------|-----------------|---------|
| 768 | 16x16 | 6x6 | 5855us (0.07 GB/s) | 38us (10 GB/s) | 154x |
| 768 | 16x16 | 7x7 | 5214us (0.08 GB/s) | 37us (11 GB/s) | 138x |
| 768 | 16x16 | 14x14 | 2314us (0.27 GB/s) | 36us (17 GB/s) | 63x |
| 768 | 16x16 | 16x16 | 1232us (0.59 GB/s) | 33us (21 GB/s) | 36x |
| 768 | 32x32 | 6x6 | 19442us (0.07 GB/s) | 98us (15 GB/s) | 197x |
| 768 | 32x32 | 7x7 | 16918us (0.09 GB/s) | 89us (17 GB/s) | 188x |
| 768 | 32x32 | 14x14 | 6023us (0.28 GB/s) | 69us (25 GB/s) | 86x |
| 768 | 32x32 | 16x16 | 3455us (0.52 GB/s) | 55us (32 GB/s) | 62x |
| 768 | 48x48 | 6x6 | 38597us (0.08 GB/s) | 179us (18 GB/s) | 214x |
| 768 | 48x48 | 7x7 | 34700us (0.09 GB/s) | 163us (20 GB/s) | 211x |
| 768 | 48x48 | 14x14 | 10647us (0.33 GB/s) | 112us (31 GB/s) | 94x |
| 768 | 48x48 | 16x16 | 7388us (0.49 GB/s) | 100us (36 GB/s) | 73x |
| 768 | 64x64 | 6x6 | 76288us (0.07 GB/s) | 310us (19 GB/s) | 246x |
| 768 | 64x64 | 7x7 | 54981us (0.1 GB/s) | 257us (23 GB/s) | 213x |
| 768 | 64x64 | 14x14 | 16565us (0.37 GB/s) | 169us (36 GB/s) | 97x |
| 768 | 64x64 | 16x16 | 12037us (0.51 GB/s) | 141us (43 GB/s) | 84x |
| 1024 | 16x16 | 6x6 | 8123us (0.06 GB/s) | 44us (12 GB/s) | 183x |
| 1024 | 16x16 | 7x7 | 7017us (0.08 GB/s) | 45us (12 GB/s) | 155x |
| 1024 | 16x16 | 14x14 | 3150us (0.27 GB/s) | 45us (18 GB/s) | 69x |
| 1024 | 16x16 | 16x16 | 1695us (0.57 GB/s) | 41us (23 GB/s) | 40x |
| 1024 | 32x32 | 6x6 | 25918us (0.07 GB/s) | 120us (16 GB/s) | 214x |
| 1024 | 32x32 | 7x7 | 22622us (0.09 GB/s) | 108us (18 GB/s) | 208x |
| 1024 | 32x32 | 14x14 | 8245us (0.28 GB/s) | 87us (26 GB/s) | 94x |
| 1024 | 32x32 | 16x16 | 4599us (0.53 GB/s) | 68us (35 GB/s) | 67x |
| 1024 | 48x48 | 6x6 | 51486us (0.08 GB/s) | 219us (20 GB/s) | 234x |
| 1024 | 48x48 | 7x7 | 46501us (0.09 GB/s) | 202us (22 GB/s) | 229x |
| 1024 | 48x48 | 14x14 | 14280us (0.33 GB/s) | 145us (32 GB/s) | 98x |
| 1024 | 48x48 | 16x16 | 9877us (0.49 GB/s) | 125us (39 GB/s) | 79x |
| 1024 | 64x64 | 6x6 | 101731us (0.07 GB/s) | 378us (20 GB/s) | 268x |
| 1024 | 64x64 | 7x7 | 73465us (0.1 GB/s) | 320us (24 GB/s) | 229x |
| 1024 | 64x64 | 14x14 | 22109us (0.37 GB/s) | 218us (37 GB/s) | 101x |
| 1024 | 64x64 | 16x16 | 16081us (0.51 GB/s) | 178us (46 GB/s) | 90x |
| 1536 | 16x16 | 6x6 | 12546us (0.06 GB/s) | 61us (13 GB/s) | 205x |
| 1536 | 16x16 | 7x7 | 11064us (0.07 GB/s) | 63us (13 GB/s) | 175x |
| 1536 | 16x16 | 14x14 | 4839us (0.26 GB/s) | 62us (20 GB/s) | 77x |
| 1536 | 16x16 | 16x16 | 2630us (0.55 GB/s) | 59us (24 GB/s) | 44x |
| 1536 | 32x32 | 6x6 | 38898us (0.07 GB/s) | 170us (17 GB/s) | 227x |
| 1536 | 32x32 | 7x7 | 34079us (0.09 GB/s) | 155us (19 GB/s) | 219x |
| 1536 | 32x32 | 14x14 | 12632us (0.27 GB/s) | 124us (28 GB/s) | 101x |
| 1536 | 32x32 | 16x16 | 6900us (0.53 GB/s) | 98us (37 GB/s) | 70x |
| 1536 | 48x48 | 6x6 | 77272us (0.08 GB/s) | 316us (21 GB/s) | 243x |
| 1536 | 48x48 | 7x7 | 70153us (0.09 GB/s) | 291us (23 GB/s) | 240x |
| 1536 | 48x48 | 14x14 | 21500us (0.33 GB/s) | 208us (34 GB/s) | 103x |
| 1536 | 48x48 | 16x16 | 14851us (0.49 GB/s) | 181us (40 GB/s) | 81x |
| 1536 | 64x64 | 6x6 | 152669us (0.07 GB/s) | 548us (21 GB/s) | 278x |
| 1536 | 64x64 | 7x7 | 110348us (0.1 GB/s) | 466us (25 GB/s) | 236x |
| 1536 | 64x64 | 14x14 | 33350us (0.36 GB/s) | 316us (38 GB/s) | 105x |
| 1536 | 64x64 | 16x16 | 24173us (0.51 GB/s) | 263us (47 GB/s) | 91x |
| 4096 | 16x16 | 6x6 | 34638us (0.06 GB/s) | 138us (16 GB/s) | 249x |
| 4096 | 16x16 | 7x7 | 31590us (0.07 GB/s) | 144us (16 GB/s) | 218x |
| 4096 | 16x16 | 14x14 | 13203us (0.26 GB/s) | 149us (23 GB/s) | 88x |
| 4096 | 16x16 | 16x16 | 7328us (0.53 GB/s) | 143us (27 GB/s) | 51x |
| 4096 | 32x32 | 6x6 | 103802us (0.07 GB/s) | 405us (19 GB/s) | 256x |
| 4096 | 32x32 | 7x7 | 91354us (0.08 GB/s) | 372us (22 GB/s) | 245x |
| 4096 | 32x32 | 14x14 | 34501us (0.26 GB/s) | 312us (29 GB/s) | 110x |
| 4096 | 32x32 | 16x16 | 18465us (0.52 GB/s) | 247us (39 GB/s) | 74x |
## Backward pass benchmarks

| batch size | input size | output size | before runtime (mem bandwidth) | after runtime (mem bandwidth) | speedup |
|------------|------------|-------------|------------------|-----------------|---------|
| 768 | 16x16 | 6x6 | 78656us (0.0 GB/s) | 323us (1 GB/s) | 243x |
| 768 | 16x16 | 7x7 | 67167us (0.0 GB/s) | 292us (1 GB/s) | 230x |
| 768 | 16x16 | 14x14 | 27478us (0.02 GB/s) | 229us (2 GB/s) | 119x |
| 768 | 16x16 | 16x16 | 131us (5.59 GB/s) | 56us (13 GB/s) | 2x |
| 768 | 32x32 | 6x6 | 271752us (0.0 GB/s) | 888us (1 GB/s) | 305x |
| 768 | 32x32 | 7x7 | 224110us (0.0 GB/s) | 813us (1 GB/s) | 275x |
| 768 | 32x32 | 14x14 | 85365us (0.02 GB/s) | 450us (3 GB/s) | 189x |
| 768 | 32x32 | 16x16 | 67700us (0.02 GB/s) | 360us (5 GB/s) | 187x |
| 768 | 48x48 | 6x6 | 593709us (0.0 GB/s) | 1988us (1 GB/s) | 298x |
| 768 | 48x48 | 7x7 | 485566us (0.0 GB/s) | 1694us (1 GB/s) | 286x |
| 768 | 48x48 | 14x14 | 164059us (0.02 GB/s) | 897us (3 GB/s) | 182x |
| 768 | 48x48 | 16x16 | 134317us (0.02 GB/s) | 674us (5 GB/s) | 199x |
| 768 | 64x64 | 6x6 | 1026651us (0.0 GB/s) | 3360us (1 GB/s) | 305x |
| 768 | 64x64 | 7x7 | 770901us (0.0 GB/s) | 2584us (2 GB/s) | 298x |
| 768 | 64x64 | 14x14 | 277850us (0.02 GB/s) | 1556us (3 GB/s) | 178x |
| 768 | 64x64 | 16x16 | 236245us (0.02 GB/s) | 1144us (5 GB/s) | 206x |
| 1024 | 16x16 | 6x6 | 106638us (0.0 GB/s) | 341us (1 GB/s) | 312x |
| 1024 | 16x16 | 7x7 | 90886us (0.0 GB/s) | 314us (1 GB/s) | 288x |
| 1024 | 16x16 | 14x14 | 36572us (0.02 GB/s) | 292us (2 GB/s) | 124x |
| 1024 | 16x16 | 16x16 | 171us (5.69 GB/s) | 56us (17 GB/s) | 3x |
| 1024 | 32x32 | 6x6 | 356900us (0.0 GB/s) | 936us (2 GB/s) | 380x |
| 1024 | 32x32 | 7x7 | 299139us (0.0 GB/s) | 870us (2 GB/s) | 343x |
| 1024 | 32x32 | 14x14 | 113205us (0.02 GB/s) | 576us (4 GB/s) | 196x |
| 1024 | 32x32 | 16x16 | 90886us (0.02 GB/s) | 458us (5 GB/s) | 198x |
| 1024 | 48x48 | 6x6 | 786896us (0.0 GB/s) | 2127us (2 GB/s) | 369x |
| 1024 | 48x48 | 7x7 | 640515us (0.0 GB/s) | 1837us (2 GB/s) | 348x |
| 1024 | 48x48 | 14x14 | 218720us (0.02 GB/s) | 1152us (4 GB/s) | 189x |
| 1024 | 48x48 | 16x16 | 178827us (0.02 GB/s) | 863us (5 GB/s) | 207x |
| 1024 | 64x64 | 6x6 | 1379991us (0.0 GB/s) | 3589us (2 GB/s) | 384x |
| 1024 | 64x64 | 7x7 | 1047466us (0.0 GB/s) | 2774us (2 GB/s) | 377x |
| 1024 | 64x64 | 14x14 | 370139us (0.02 GB/s) | 1999us (4 GB/s) | 185x |
| 1024 | 64x64 | 16x16 | 316501us (0.02 GB/s) | 1470us (5 GB/s) | 215x |
| 1536 | 16x16 | 6x6 | 159057us (0.0 GB/s) | 477us (1 GB/s) | 332x |
| 1536 | 16x16 | 7x7 | 135578us (0.0 GB/s) | 441us (1 GB/s) | 306x |
| 1536 | 16x16 | 14x14 | 53002us (0.02 GB/s) | 400us (3 GB/s) | 132x |
| 1536 | 16x16 | 16x16 | 252us (5.79 GB/s) | 55us (26 GB/s) | 4x |
| 1536 | 32x32 | 6x6 | 545653us (0.0 GB/s) | 1323us (2 GB/s) | 412x |
| 1536 | 32x32 | 7x7 | 447491us (0.0 GB/s) | 1248us (2 GB/s) | 358x |
| 1536 | 32x32 | 14x14 | 173491us (0.02 GB/s) | 787us (4 GB/s) | 220x |
| 1536 | 32x32 | 16x16 | 136395us (0.02 GB/s) | 633us (5 GB/s) | 215x |
| 1536 | 48x48 | 6x6 | 1198639us (0.0 GB/s) | 3057us (2 GB/s) | 392x |
| 1536 | 48x48 | 7x7 | 985549us (0.0 GB/s) | 2645us (2 GB/s) | 372x |
| 1536 | 48x48 | 14x14 | 331419us (0.02 GB/s) | 1581us (4 GB/s) | 209x |
| 1536 | 48x48 | 16x16 | 270972us (0.02 GB/s) | 1186us (6 GB/s) | 228x |
| 1536 | 64x64 | 6x6 | 2094282us (0.0 GB/s) | 5214us (2 GB/s) | 401x |
| 1536 | 64x64 | 7x7 | 1593449us (0.0 GB/s) | 4086us (2 GB/s) | 389x |
| 1536 | 64x64 | 14x14 | 559244us (0.02 GB/s) | 2828us (4 GB/s) | 197x |
| 1536 | 64x64 | 16x16 | 469471us (0.02 GB/s) | 2057us (6 GB/s) | 228x |
| 4096 | 16x16 | 6x6 | 430494us (0.0 GB/s) | 1008us (2 GB/s) | 427x |
| 4096 | 16x16 | 7x7 | 360346us (0.0 GB/s) | 1015us (2 GB/s) | 354x |
| 4096 | 16x16 | 14x14 | 142868us (0.02 GB/s) | 988us (3 GB/s) | 144x |
| 4096 | 16x16 | 16x16 | 658us (5.93 GB/s) | 56us (69 GB/s) | 11x |
| 4096 | 32x32 | 6x6 | 1425928us (0.0 GB/s) | 2796us (2 GB/s) | 509x |
| 4096 | 32x32 | 7x7 | 1188862us (0.0 GB/s) | 2906us (2 GB/s) | 409x |
| 4096 | 32x32 | 14x14 | 464286us (0.02 GB/s) | 1965us (4 GB/s) | 236x |
| 4096 | 32x32 | 16x16 | 363903us (0.02 GB/s) | 1588us (6 GB/s) | 229x |
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127082
Approved by: https://github.com/fmassa
2024-05-24 21:17:12 +00:00
b0871f9b33 [DSD] Add a test to verify FSDP lazy initialization case (#127069)
Summary:
Distributed state_dict should not error out because the `model.state_dict()` will trigger FSDP to initialize.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127069
Approved by: https://github.com/wz337
2024-05-24 21:09:11 +00:00
7394ec7123 [AOTI][refactor] Update DTYPE_TO_CPP mapping (#126915)
Summary: Use more consistent cpp int types in DTYPE_TO_CPP

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126915
Approved by: https://github.com/chenyang78
2024-05-24 21:03:12 +00:00
800f461b2a [User-Written Triton] Handle the scf.for and scf.while case (#127065)
Summary:
This is the official fix of the issue, reported in https://fb.workplace.com/groups/1075192433118967/permalink/1427865377851669/

The root-cause is the MLIR mutation analyze doesn't find the mutated tensors, which made AOT autograd think there is no users of the Triton kernel and then removed it 😔

---

Triton IR: P1369315213
Wrong Analyze Graph: P1364305956
Right Analyze Graph: P1369324977

Test Plan:
buck2 run mode/opt scripts/liptds/domain_kernels:triton_dcpp_flash

unit tests

Differential Revision: D57606053

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127065
Approved by: https://github.com/oulgen, https://github.com/chenyang78
2024-05-24 21:01:13 +00:00
dce29a8a87 Replaced same with assertEqual in two files (#126994)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126994
Approved by: https://github.com/masnesral
2024-05-24 20:50:36 +00:00
c34f8c7f91 Revert "Lift jagged -> padded dense forward / backward kernels from fbgemm_gpu (#125946)"
This reverts commit 5e69e11d098a2cfccc8a59377c431e9c71cab9a8.

Reverted https://github.com/pytorch/pytorch/pull/125946 on behalf of https://github.com/clee2000 due to sorry the Dr CI fix hasn't been merged yet and its still failing 5e69e11d09 https://github.com/pytorch/pytorch/actions/runs/9228887299/job/25393895252 ([comment](https://github.com/pytorch/pytorch/pull/125946#issuecomment-2130305958))
2024-05-24 20:26:07 +00:00
fdda9a22c3 Performance parity for 32-bit-precision in FP16 ARM matrix-vector kernel using FMLAL instruction (#127033)
Summary: I discovered this instruction by checking all the intrinsics on https://arm-software.github.io/acle/neon_intrinsics/advsimd.html .

Test Plan: Existing test coverage
benchmarked custom sizes with https://github.com/malfet/llm_experiments benchmarks/benchmark/torch_mm.py:

```
m=1024, n=1024, k=1
====================
trans_b  torch.float16   43.93 usec

Using FP16 accumulation
trans_b  torch.float16   43.76 usec
m=4100, n=4100, k=1
====================
trans_b  torch.float16  719.35 usec

Using FP16 accumulation
trans_b  torch.float16  719.33 usec
m=4104, n=4104, k=1
====================
trans_b  torch.float16  727.79 usec

Using FP16 accumulation
trans_b  torch.float16  702.72 usec
m=16384, n=16384, k=1
====================
trans_b  torch.float16 18465.11 usec

Using FP16 accumulation
trans_b  torch.float16 11435.28 usec
```

also checked the default sizes. Relevant output before:
```
mv_nt    torch.float16   13.05 usec
trans_b  torch.float16   13.69 usec

Using FP16 accumulation
mv_nt    torch.float16    8.65 usec
trans_b  torch.float16    9.24 usec
```

after:
```
mv_nt    torch.float16    8.66 usec
trans_b  torch.float16    8.85 usec

Using FP16 accumulation
mv_nt    torch.float16    8.52 usec
trans_b  torch.float16    8.60 usec
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127033
Approved by: https://github.com/malfet, https://github.com/Skylion007
ghstack dependencies: #126745, #126746, #126793, #126794, #126877, #127016
2024-05-24 19:47:50 +00:00
1d3aa08327 Cleanup: use c10::ForceUnroll and constexpr variables in ARM FP16 matrix-vector fast path (#127016)
Summary: Just straightforward code cleanup in this path.

Test Plan: Existing CI, double-checked benchmark_torch_mm didn't regress as per previous diffs in stack.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127016
Approved by: https://github.com/peterbell10
ghstack dependencies: #126745, #126746, #126793, #126794, #126877
2024-05-24 19:47:50 +00:00
cyy
67d52d7fcb [caffe2] Remove import_legacy.cpp (#126149)
I think they are for Caffe2 and should be deleted.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126149
Approved by: https://github.com/r-barnes
2024-05-24 19:47:32 +00:00
5e69e11d09 Lift jagged -> padded dense forward / backward kernels from fbgemm_gpu (#125946)
PyTorch can't depend on `fbgemm_gpu` as a dependency because `fbgemm_gpu` already has a dependency on PyTorch. So this PR copy / pastes kernels from `fbgemm_gpu`:
* `dense_to_jagged_forward()` as CUDA registration for new ATen op `_padded_dense_to_jagged_forward()`
* `jagged_to_padded_dense_forward()` as CUDA registration for new ATen op `_jagged_to_padded_dense_forward()`

CPU impls for these new ATen ops will be added in a follow-up PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125946
Approved by: https://github.com/davidberard98
2024-05-24 19:16:29 +00:00
9d4731f952 [AOTI] Disable stack allocation for OSS (#125732)
Summary: Stack allocation is for certain small CPU models, but its coverage still needs improvement, so default to OFF for OSS.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125732
Approved by: https://github.com/chenyang78
ghstack dependencies: #126720, #126801
2024-05-24 19:10:33 +00:00
72d30aa026 [AOTI] Fix an int array codegen issue (#126801)
Summary: fixes https://github.com/pytorch/pytorch/issues/126779. When an int array contains symbol expression, we can't declare it with constexpr.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126801
Approved by: https://github.com/chenyang78
ghstack dependencies: #126720
2024-05-24 19:10:33 +00:00
71f1aebe1f [AOTI] Add more fallback ops (#126720)
Summary: These ops are either in either unit tests or TorchBench. Fixes https://github.com/pytorch/pytorch/issues/122050

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126720
Approved by: https://github.com/chenyang78
2024-05-24 19:10:33 +00:00
f508cd6e00 Update assigntome job (#127027)
Updating for the new docathon

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127027
Approved by: https://github.com/kit1980
2024-05-24 19:04:51 +00:00
3cb16ebf08 [BE]: Update ruff to 0.4.5 (#126979)
Update ruff to 0.4.5 and addresses some false negatives that have been found in the newer version.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126979
Approved by: https://github.com/ezyang
2024-05-24 18:38:35 +00:00
4a09117d16 Introduce ProcessGroupCudaP2P (#122163)
## Context
This stack prototypes automatic micro-pipelining of `all-gather -> matmul` and `matmul -> reduce-scatter` via Inductor. The idea originates from the paper [Overlap Communication with Dependent Computation via
Decomposition in Large Deep Learning Models](https://dl.acm.org/doi/pdf/10.1145/3567955.3567959). The implementation and some key optimizations are heavily influenced by @lw's implementation in xformers.

The stack contains several components:
- `ProcessGroupCudaP2P` - a thin wrapper around `ProcessGroupNCCL`. It in addition maintains a P2P workspace that enables SM-free, one-sided P2P communication which is needed for optimal micro-pipelining.
- `fused_all_gather_matmul` and `fused_matmul_reduce_scatter` dispatcher ops.
- Post-grad fx pass that detects `all-gather -> matmul` and `matmul -> reduce-scatter` and replaces them with the fused dispatcher ops.

To enable the prototype feature:
- Set the distributed backend to `cuda_p2p`.
- Set `torch._inductor.config._micro_pipeline_tp` to `True`.

*NOTE: the prototype sets nothing in stone w.r.t to each component's design. The purpose is to have a performant baseline with reasonable design on which each component can be further improved.*

## Benchmark
Setup:
- 8 x H100 (500W) + 3rd gen NVSwitch.
- Llama3 8B training w/ torchtitan.
- 8-way TP. Reduced the number of layers from 32 to 8 for benchmarking purpose.

Trace (baseline): https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_tmpjaz8zgx0
<img width="832" alt="image" src="https://github.com/pytorch/pytorch/assets/4156752/4addba77-5abc-4d2e-93ea-f68078587fe1">

Trace (w/ micro pipelining): https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_tmpn073b4wn
<img width="963" alt="image" src="https://github.com/pytorch/pytorch/assets/4156752/4f44e78d-8196-43ab-a1ea-27390f07e9d2">

## This PR
`ProcessGroupCudaP2P` is a thin wrapper around `ProcessGroupNCCL`. By default, it routes all collectives to the underlying `ProcessGroupNCCL`. In addition, `ProcessGroupCudaP2P` initializes a P2P workspace that allows direct GPU memory access among the members. The workspace can be used in Python to optimize intra-node communication patterns or to create custom intra-node collectives in CUDA.

`ProcessGroupCudaP2P` aims to bridge the gap where certain important patterns can be better optimized via fine-grained P2P memory access than with collectives in the latest version of NCCL. It is meant to complement NCCL rather than replacing it.
Usage:
```
    # Using ProcessGroupCudaP2P
    dist.init_process_group(backend="cuda_p2p", ...)

    # Using ProcessGroupCudaP2P while specifying ProcessGroupCudaP2P.Options
    pg_options = ProcessGroupCudaP2P.Options()
    dist.init_process_group(backend="cuda_p2p", pg_options=pg_options, ...)

    # Using ProcessGroupCudaP2P while specifying ProcessGroupNCCL.Options
    pg_options = ProcessGroupNCCL.Options()
    dist.init_process_group(backend="cuda_p2p", pg_options=pg_options, ...)

    # Using ProcessGroupCudaP2P while specifying both
    # ProcessGroupCudaP2P.Options and ProcessGroupNCCL.Options
    pg_options = ProcessGroupCudaP2P.Options()
    pg_options.nccl_options = ProcessGroupNCCL.Options()
    dist.init_process_group(backend="cuda_p2p", pg_options=pg_options, ...)

    # Down-casting the backend to access p2p buffers for cuda_p2p specific
    # optimizations
    if is_cuda_p2p_group(group):
        backend = get_cuda_p2p_backend(group)
        if required_p2p_buffer_size > backend.get_buffer_size():
            # fallback
        p2p_buffer = backend.get_p2p_buffer(...)
    else:
        # fallback
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122163
Approved by: https://github.com/wanchaol
2024-05-24 18:33:18 +00:00
01f04230cf [cond] support torch built in function as subgraph (#126909)
Fixes https://github.com/pytorch/pytorch/issues/126818.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126909
Approved by: https://github.com/zou3519
ghstack dependencies: #127026
2024-05-24 18:31:43 +00:00
2d6d2dbc0b [dynamo] make callable(nn_module) return True (#127026)
Before the pr, we have a graph break for `callable(nn_module)`:
```python
class M(nn.Module):
    def forward(self, x):
        return x.sin()

def f(m):
    return callable(m)

res = torch.compile(f, fullgraph=True)(M())
```

```
Traceback (most recent call last):
  File "/data/users/yidi/pytorch/t.py", line 17, in <module>
    out = torch.compile(f, backend="eager", fullgraph=True)(M())
  File "/data/users/yidi/pytorch/torch/_dynamo/eval_frame.py", line 414, in _fn
    return fn(*args, **kwargs)
  File "/data/users/yidi/pytorch/torch/_dynamo/convert_frame.py", line 1077, in catch_errors
    return callback(frame, cache_entry, hooks, frame_state, skip=1)
  File "/data/users/yidi/pytorch/torch/_dynamo/convert_frame.py", line 456, in _convert_frame_assert
    return _compile(
  File "/data/users/yidi/pytorch/torch/_utils_internal.py", line 74, in wrapper_function
    return function(*args, **kwargs)
  File "/home/yidi/.conda/envs/pytorch/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/data/users/yidi/pytorch/torch/_dynamo/convert_frame.py", line 799, in _compile
    guarded_code = compile_inner(code, one_graph, hooks, transform)
  File "/data/users/yidi/pytorch/torch/_dynamo/utils.py", line 210, in time_wrapper
    r = func(*args, **kwargs)
  File "/data/users/yidi/pytorch/torch/_dynamo/convert_frame.py", line 618, in compile_inner
    out_code = transform_code_object(code, transform)
  File "/data/users/yidi/pytorch/torch/_dynamo/bytecode_transformation.py", line 1167, in transform_code_object
    transformations(instructions, code_options)
  File "/data/users/yidi/pytorch/torch/_dynamo/convert_frame.py", line 177, in _fn
    return fn(*args, **kwargs)
  File "/data/users/yidi/pytorch/torch/_dynamo/convert_frame.py", line 564, in transform
    tracer.run()
  File "/data/users/yidi/pytorch/torch/_dynamo/symbolic_convert.py", line 2244, in run
    super().run()
  File "/data/users/yidi/pytorch/torch/_dynamo/symbolic_convert.py", line 886, in run
    while self.step():
  File "/data/users/yidi/pytorch/torch/_dynamo/symbolic_convert.py", line 801, in step
    self.dispatch_table[inst.opcode](self, inst)
  File "/data/users/yidi/pytorch/torch/_dynamo/symbolic_convert.py", line 496, in wrapper
    return inner_fn(self, inst)
  File "/data/users/yidi/pytorch/torch/_dynamo/symbolic_convert.py", line 1255, in CALL_FUNCTION
    self.call_function(fn, args, {})
  File "/data/users/yidi/pytorch/torch/_dynamo/symbolic_convert.py", line 739, in call_function
    self.push(fn.call_function(self, args, kwargs))
  File "/data/users/yidi/pytorch/torch/_dynamo/variables/builtin.py", line 948, in call_function
    return handler(tx, args, kwargs)
  File "/data/users/yidi/pytorch/torch/_dynamo/variables/builtin.py", line 711, in <lambda>
    return lambda tx, args, kwargs: obj.call_function(
  File "/data/users/yidi/pytorch/torch/_dynamo/variables/builtin.py", line 948, in call_function
    return handler(tx, args, kwargs)
  File "/data/users/yidi/pytorch/torch/_dynamo/variables/builtin.py", line 835, in builtin_dipatch
    unimplemented(error_msg)
  File "/data/users/yidi/pytorch/torch/_dynamo/exc.py", line 216, in unimplemented
    raise Unsupported(msg)
torch._dynamo.exc.Unsupported: builtin: callable [<class 'torch._dynamo.variables.nn_module.NNModuleVariable'>] False
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127026
Approved by: https://github.com/jansel
2024-05-24 18:31:43 +00:00
cyy
f2c6fddbe1 Remove unnecessary const_cast and other fixes (#127054)
Removes unnecessary const casts and copies.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127054
Approved by: https://github.com/Skylion007
2024-05-24 18:05:06 +00:00
9117779b0a [FSDP2] Added test for N-way TP and 1-way FSDP with CPU offloading (#127024)
This PR shows that we can use FSDP solely for CPU offloading when composing with N-way TP. Each FSDP mesh is just 1 rank.

This was motivated from an ask on Slack :)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127024
Approved by: https://github.com/weifengpy, https://github.com/wanchaol
ghstack dependencies: #127004
2024-05-24 17:09:12 +00:00
87f79af24d Fix map_location for wrapper subclass and device tensors that go through numpy (#126728)
Fixes https://github.com/pytorch/pytorch/issues/124418

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126728
Approved by: https://github.com/albanD
2024-05-24 16:39:30 +00:00
4ff9113e3d [MPS] Add _weight_int8pack_mm tests (#127041)
As well as extend the test to cover MV cases (where A matrix is 1xM) Limit int8 op testing to 32x32 matrix sizes for now

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127041
Approved by: https://github.com/larryliu0820, https://github.com/manuelcandales
2024-05-24 16:08:06 +00:00
194950c0ca Default TreadPool size to number of physical cores (#125963)
TODO: Some benchmarks

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125963
Approved by: https://github.com/janeyx99, https://github.com/Skylion007, https://github.com/gajjanag, https://github.com/jgong5
2024-05-24 16:06:48 +00:00
5ae9daa4a2 Revert "[AOTI] support freezing for MKLDNN (#124350)"
This reverts commit 654afb6f3ae3ddbd926a753f9af95a6f6e22131c.

Reverted https://github.com/pytorch/pytorch/pull/124350 on behalf of https://github.com/clee2000 due to Seems to have broken inductor/test_aot_inductor.py::AOTInductorTestNonABICompatibleCpu::test_freezing_non_abi_compatible_cpu 654afb6f3a https://github.com/pytorch/pytorch/actions/runs/9224838183/job/25382780192 ([comment](https://github.com/pytorch/pytorch/pull/124350#issuecomment-2129889809))
2024-05-24 16:03:07 +00:00
2ac739cc80 [DOCS] Fixed KLDiv example (#126857)
Small import fix to make the example run
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126857
Approved by: https://github.com/albanD
2024-05-24 15:39:50 +00:00
4105f91cfc [inductor] fix an assertion for node debug str (#127021)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127021
Approved by: https://github.com/aorenste
2024-05-24 13:37:05 +00:00
654afb6f3a [AOTI] support freezing for MKLDNN (#124350)
## Description
Fixes https://github.com/pytorch/pytorch/issues/114450. This PR builds upon the work from @imzhuhl done in https://github.com/pytorch/pytorch/pull/114451.

This PR requires https://github.com/pytorch/pytorch/pull/122472 to land firstly.

We leverage the serialization and deserialization API from oneDNN v3.4.1 to save the opaque MKLDNN tensor during the compilation and restore the opaque tensor when loading the compiled .so.
ideep version is updated so that we won't break any pipeline even if third_party/ideep is not updated at the same time.

### Test plan:
```sh
python -u test/inductor/test_aot_inductor.py -k AOTInductorTestNonABICompatibleCpu.test_freezing_non_abi_compatible_cpu
python -u test/inductor/test_aot_inductor.py -k AOTInductorTestNonABICompatibleCpu.test_conv_freezing_non_abi_compatible_cpu
python -u test/inductor/test_aot_inductor.py -k AOTInductorTestNonABICompatibleCpu.test_deconv_freezing_non_abi_compatible_cpu
python -u test/inductor/test_aot_inductor.py -k AOTInductorTestNonABICompatibleCpu.test_linear_freezing_non_abi_compatible_cpu
```

### TODOs in follow-up PRs
1. We found that using `AOTI_TORCH_CHECK` will cause performance drop on several models (`DistillGPT2`, `MBartForConditionalGeneration`, `T5ForConditionalGeneration`, `T5Small`) compared with JIT Inductor which uses `TORCH_CHECK`. This may need further discussion how to address (`AOTI_TORCH_CHECK` is introduced in
 https://github.com/pytorch/pytorch/pull/119220).
2. Freezing in non-ABI compatible mode will work with the support in this PR. While for ABI compatible mode, we need to firstly address this issue: `AssertionError: None, i.e. optional output is not supported`.
6c4f43f826/torch/_inductor/codegen/cpp_wrapper_cpu.py (L2023-L2024)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124350
Approved by: https://github.com/jgong5, https://github.com/desertfire
2024-05-24 13:34:04 +00:00
43baabe9b9 [inductor][cpp] support bf16/fp16 gemm template epilogue fusion (#126545)
As part of #125683, this PR adds epilogue fusion support for bf16/fp16 gemms. The key changes are as follows:
1. bf16 linear w/ epilogue fusion of some ops was originally supported via ATen oneDNN linear pointwise ops. In order to match the ATen op semantics, in-template epilogue support is added to the cpp gemm template so that we would have: "gemm + in-template epilogues -> template buffer". If the template is chosen for codegen, the in-template epilogues will be concatenated with the out-of-template epilogues that are appended during the scheduling.
2. Support bf16/fp16 legalization for `codegen_loop_bodies` which is used to generate the epilogue loops.
3. We used to leverage the in-place buffer mechanism to handle the in-place buffers in the epilogue codegen, in particular, for the reuses for output buffers of GEMM, template and epilogues. This is not correct since the output buffer is an "output" not an "in-place" buffer of the template kernel itself. Now, we use a dedicated "aliases" dict to manage such buffer reuses and the intermediate aliasing buffers are removed after codegen.
4. Add `localize_buffer` method to `LocalBufferScope` to allow the replacement of a global buffer with a local one in the given inductor IR nodes. This helps the fused loops to work on smaller-sized local buffers for better data locality.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126545
Approved by: https://github.com/jansel
ghstack dependencies: #124021, #126019, #126068
2024-05-24 12:29:06 +00:00
4aa43d11f3 [inductor][cpp] bf16/fp16 gemm template computed with fp32 w/o epilogue fusion (#126068)
As part of #125683, this PR adds the initial bf16/fp16 gemm template support with micro-gemm implemented with fused type casting and fp32 computation. It doesn't provide epilogue fusion support yet which will be added in the next PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126068
Approved by: https://github.com/jansel
ghstack dependencies: #124021, #126019
2024-05-24 12:24:35 +00:00
56c412d906 [inductor][cpp] epilogue support for gemm template (#126019)
As part of #125683, this PR adds the epilogue support for c++ gemm template by reusing the c++ vector codegen on sub-slices of tensors. This is implemented by retracing the epilogue IR nodes with new ranges and offsets. The new `codegen_loop_bodies` and `codegen_functions` methods are added to c++ vector codegen for this purpose. This is leveraged by the `store_output` method of the template kernel for epilogue codegen and store to the final result.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126019
Approved by: https://github.com/jansel
ghstack dependencies: #124021
2024-05-24 12:14:12 +00:00
dd64ca2a02 Inductor respects strides for custom ops by default (#126986)
Previously, the default was that Inductor did not respect strides for
all (builtin and custom) ops unless the op has a
"needs_fixed_stride_order" tag on it. This PR changes it so that:

- inductor doesn't respect strides for builtin ops. To change the
  behavior, one can add the "needs_fixed_stride_order" tag
- inductor does respect strides for custom ops. To change the behavior,
  one can add the "does_not_need_fixed_stride_order" tag

Test Plan:
- new tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126986
Approved by: https://github.com/ezyang, https://github.com/albanD
2024-05-24 11:11:18 +00:00
f14cdc570d Fix to #126656 (#127050)
Fix failure from fbcode - in the case of a foreach node the fake `group` needs to be hashable.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127050
Approved by: https://github.com/DanilBaibak
ghstack dependencies: #126656
2024-05-24 10:56:53 +00:00
47c976b904 Revert "[AOTI] Add more fallback ops (#126720)"
This reverts commit 19cd4484ec8449b8c5ebf46be1f8f2fcbace8c6c.

Reverted https://github.com/pytorch/pytorch/pull/126720 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/126720#issuecomment-2129011751))
2024-05-24 09:07:07 +00:00
f749c5def8 Revert "[AOTI] Fix an int array codegen issue (#126801)"
This reverts commit ff617ab6c8f6f67ae912fbcd45a913a89e19effb.

Reverted https://github.com/pytorch/pytorch/pull/126801 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/126720#issuecomment-2129011751))
2024-05-24 09:07:07 +00:00
fd9cdeed19 Revert "[AOTI] Disable stack allocation for OSS (#125732)"
This reverts commit 599e684ad6f34dd069eff8611f45e25b7695a339.

Reverted https://github.com/pytorch/pytorch/pull/125732 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/126720#issuecomment-2129011751))
2024-05-24 09:07:07 +00:00
f95dbc1276 Remove more of caffe2 (#126705)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126705
Approved by: https://github.com/malfet
2024-05-24 06:53:08 +00:00
0d1e228550 [inductor][cpp] GEMM template (infra and fp32) (#124021)
This PR adds the Cpp template infrastructure and the initial FP32 gemm template. See RFC https://github.com/pytorch/pytorch/issues/125683 for more background info.
1. Cpp template infrastructure
Similar template abstractions as the CUTLASS template, i.e., `CppTemplate`, `CppTemplateKernel`, `CppTemplateBuffer`. The MicroGemm micro-kernel abstraction that can be used by Cpp GEMM templates.
2. Initial FP32 gemm template
This involves a GEMM template implementation `CppPackedGemmTemplate` that supports GEMM with constant weight (`B`) requiring `N` to be a multiple of register blocking while allows the static or dynamic sizes for the `M` (batch dim) of `A`. The `B` matrix would be prepacked. This is a typical setting for inference workloads. The template handles the thread decomposition (via `thread_blocking`) and cache blocking (via `cache_blocking`). Then it invokes `CppMicroGemm` which handles register blocking, instruction selection, and other CPU architecture-specific optimizations. A `CppMicroGemmFP32Vec` micro-kernel implementation is provided for fp32 matmuls implemented with ATen vec abstraction.
3. Correctness and performance
The changes have been validated with fp32 inference on the three benchmark suites (torchbench, huggingface and timm_models) with both static shape and dynamic shapes. Since it is an initial implementation, we are still working on further performance improves with follow-up PRs including the optimizations in kernels as well as fusions. The perf gains are only observed from a selective number of models compared to the ATen kernels which are implemented with MKL. The perf gains are more obvious with dynamic shapes since MKL only supports packed gemm for static shapes. Below are details.

Static shapes
| Benchmark | torchbench | huggingface | timm_models |
|------------|-------------|--------------|--------------|
| Multi-threaded (baseline) | 1.47x | 1.36x | 1.91x |
| Multi-threaded (max-autotune) | 1.47x | 1.36x | 1.92x |
| Single-threaded (baseline) | 1.56x | 1.19x | 1.51x |
| Single-threaded (max-autotune) | 1.56x | 1.19x | 1.52x |

Key models being sped up:
drq: 1.14x
soft_act: 1.12
cait_m36_384: 1.18x

Dynamic shapes
| Benchmark | torchbench | huggingface | timm_models |
| --- | --- | --- | --- |
| Multi-threaded (baseline) | 1.43x | 1.28x | 1.85x |
| Multi-threaded (max-autotune) | 1.47x | 1.28x | 1.85x |
| Single-threaded (baseline) | 1.55x | 1.20x | 1.51x |
| Single-threaded (max-autotune) | 1.56x | 1.19x | 1.53x |

Key models being sped up:
BERT_pytorch: 1.22x
pyhpc_turbulent: 1.13x
soft_actor_critic: 1.77x
BlenderbotForCausalLM: 1.09x
cait_m36_384: 1.17x

Differential Revision: [D57585365](https://our.internmc.facebook.com/intern/diff/D57585365)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124021
Approved by: https://github.com/jansel
2024-05-24 06:26:33 +00:00
505b8ceaa2 Double registers per iteration in FP32-arithmetic FP16 ARM gemv kernel (#126877)
Summary: I found that doubling this significantly improved performance, but doubling again did not, so I stopped here.

Test Plan: CI
Benchmarked with llm_experiments repo as previously in stack; relevant data:

before:
trans_b torch.float16 1396.11 usec (4100)
trans_b torch.float16 1399.54 usec (4104)

after:
trans_b  torch.float16 1096.00 usec (4100)
trans_b  torch.float16 1093.47 usec (4104)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126877
Approved by: https://github.com/malfet
ghstack dependencies: #126745, #126746, #126793, #126794
2024-05-24 05:57:09 +00:00
e8fa0f10c5 Quadruple registers per iteration in ARM64 FP16 kernel (#126794)
The machine has plenty of registers we weren't using. This looks like it might improve performance a couple percent, though there is noise so I'm not certain.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126794
Approved by: https://github.com/malfet
ghstack dependencies: #126745, #126746, #126793
2024-05-24 05:57:09 +00:00
f6366454db Add privateuse1 in FSDP's sharded grad scaler (#126971)
1. add privateuse1 in FSDP's sharded grad scaler
2. support found_inf copy for more devices

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126971
Approved by: https://github.com/awgu, https://github.com/weifengpy
2024-05-24 05:54:25 +00:00
2f6954c7c3 Update the modification api (#127035)
# Summary
Updates the modification jinja template's api, so as to specify the output_name for the fixed buffer. As well updates flex-attention's usage to make the algorithm more clear/ closer align with the vmap impl
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127035
Approved by: https://github.com/Chillee
2024-05-24 04:45:34 +00:00
894efcd0e9 [DTensor] Supported simple replicate strategy for SVD (#127004)
This PR adds a simple strategy to always replicate for `torch.linalg.svd()`. This is to help unblock some GaLore exploration.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127004
Approved by: https://github.com/wanchaol
2024-05-24 04:34:43 +00:00
70dc59c55f Fix perf regression caused by #122074 (#126996)
The original change was about 9.5% slower than then before #122074 .
This improves it to be only about 1.4% slower.

Also touched up some unrelated nits that the linter complained about.

Fixes #126293

Ran torchbench 3 times on each change. Perf values before (stable), after (fix),
and with #122074 backed out (backout):
```
../inductor-tools/scripts/modelbench/inductor_single_run.sh single inference performance torchbench pyhpc_isoneutral_mixing amp first dynamic cpp
stable:
43.948x
45.754x
44.906x

fix:
47.505x
49.987x
47.493x

backout:
48.243x
48.199x
48.192x

../inductor-tools/scripts/modelbench/inductor_single_run.sh single inference performance torchbench pyhpc_equation_of_state amp first static default
stable:
15.224x
13.286x
15.354x

fix:
16.402x
16.370x
16.183x

backout:
16.554x
16.675x
16.787x

../inductor-tools/scripts/modelbench/inductor_single_run.sh single inference performance torchbench lennard_jones float32 first static default
stable:
1.712x
1.651x
1.640x

fix:
1.804x
1.798x
1.792x

backout:
1.864x
1.824x
1.836x
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126996
Approved by: https://github.com/jansel
2024-05-24 04:27:22 +00:00
cb6ef68caa Propagate tokens in aotautograd (#127028)
Test Plan: `buck run mode/dev-nosan //aimp/experimental/pt2:pt2_export -- --model-entity-id 938593492 --output /tmp/938593492.zip --use-torchrec-eager-mp --use-manifold`

Differential Revision: D57750072

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127028
Approved by: https://github.com/tugsbayasgalan
2024-05-24 03:23:17 +00:00
99a11efc8a Revert "Lift jagged -> padded dense forward / backward kernels from fbgemm_gpu (#125946)"
This reverts commit e2f081837f4276c1a6a37739bd28157f62004a06.

Reverted https://github.com/pytorch/pytorch/pull/125946 on behalf of https://github.com/clee2000 due to I think dr ci is wrong and the windows build failure is real e2f081837f https://github.com/pytorch/pytorch/actions/runs/9216826622/job/25357819877 ([comment](https://github.com/pytorch/pytorch/pull/125946#issuecomment-2128388126))
2024-05-24 02:37:46 +00:00
cfb374dc73 [BE] Create grad check util (#126991)
# Summary
Add small utility func for deciding if we shoudl compute LSE and update to also check for gradMode
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126991
Approved by: https://github.com/cpuhrsch
2024-05-24 02:36:00 +00:00
27594be3ed [dtensor][be] remove repeated test in test_comm_mode.py (#127029)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127029
Approved by: https://github.com/XilunWu
ghstack dependencies: #127025
2024-05-24 01:42:13 +00:00
89c638f9a5 [dtensor][debug] add all_reduce_coalesced tracing to CommDebugMode (#127025)
**Summary**
Added all_reduce_coalesced tracing to CommDebugMode and added test case to test_comm_mode test suite.

**Test Plan**
pytest test/distributed/_tensor/debug/test_comm_mode.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127025
Approved by: https://github.com/XilunWu
2024-05-24 01:42:13 +00:00
575cb617db Add compile time profiler for non fbcode targets (#126904)
This is a tool that allow profiling compile time using strobelight profiler, its a meta only tool.
but works on non-fbcode targets.

A follow up diff will unify this with caffe2/fb/strobelight/compile_time_profiler.py.
example test:

```
run  python tools/strobelight/examples/compile_time_profile_example.py
```

```
python torch/utils/_strobelight/examples/compile_time_profile_example.py
strobelight_compile_time_profiler, line 61, 2024-05-23 10:49:28,101, INFO: compile time strobelight profiling enabled
strobelight_compile_time_profiler, line 93, 2024-05-23 10:49:28,102, INFO: Unique sample tag for this run is: 2024-05-23-10:49:282334638devvm4561.ash0.facebook.com
strobelight_compile_time_profiler, line 94, 2024-05-23 10:49:28,102, INFO: You can use the following link to access the strobelight profile at the end of the run: https://www.internalfb.com/intern/scuba/query/?dataset=pyperf_experimental%2Fon_demand&drillstate=%7B%22purposes%22%3A[]%2C%22end%22%3A%22now%22%2C%22start%22%3A%22-30%20days%22%2C%22filterMode%22%3A%22DEFAULT%22%2C%22modifiers%22%3A[]%2C%22sampleCols%22%3A[]%2C%22cols%22%3A[%22namespace_id%22%2C%22namespace_process_id%22]%2C%22derivedCols%22%3A[]%2C%22mappedCols%22%3A[]%2C%22enumCols%22%3A[]%2C%22return_remainder%22%3Afalse%2C%22should_pivot%22%3Afalse%2C%22is_timeseries%22%3Afalse%2C%22hideEmptyColumns%22%3Afalse%2C%22timezone%22%3A%22America%2FLos_Angeles%22%2C%22compare%22%3A%22none%22%2C%22samplingRatio%22%3A%221%22%2C%22metric%22%3A%22count%22%2C%22aggregation_field%22%3A%22async_stack_complete%22%2C%22top%22%3A10000%2C%22aggregateList%22%3A[]%2C%22param_dimensions%22%3A[%7B%22dim%22%3A%22py_async_stack%22%2C%22op%22%3A%22edge%22%2C%22param%22%3A%220%22%2C%22anchor%22%3A%220%22%7D]%2C%22order%22%3A%22weight%22%2C%22order_desc%22%3Atrue%2C%22constraints%22%3A[[%7B%22column%22%3A%22sample_tags%22%2C%22op%22%3A%22all%22%2C%22value%22%3A[%22[%5C%222024-05-23-10:49:282334638devvm4561.ash0.facebook.com%5C%22]%22]%7D]]%2C%22c_constraints%22%3A[[]]%2C%22b_constraints%22%3A[[]]%2C%22ignoreGroupByInComparison%22%3Afalse%7D&view=GraphProfilerView&&normalized=1712358002&pool=uber
strobelight_function_profiler, line 241, 2024-05-23 10:49:34,943, INFO: strobelight run id is: 3507039740348330
strobelight_function_profiler, line 243, 2024-05-23 10:50:00,907, INFO: strobelight profiling running
strobelight_function_profiler, line 224, 2024-05-23 10:50:02,741, INFO: strobelight profiling stopped
strobelight_function_profiler, line 215, 2024-05-23 10:50:06,173, INFO: Total samples: 7
strobelight_function_profiler, line 215, 2024-05-23 10:50:06,173, INFO: GraphProfiler (python stack): https://fburl.com/scuba/pyperf_experimental/on_demand/75cxdro3
strobelight_function_profiler, line 215, 2024-05-23 10:50:06,173, INFO: Icicle view (python stack): https://fburl.com/scuba/pyperf_experimental/on_demand/qsgydsee
strobelight_compile_time_profiler, line 120, 2024-05-23 10:50:06,174, INFO: 1 strobelight success runs out of 1 non-recursive compilation events.
strobelight_function_profiler, line 241, 2024-05-23 10:50:08,137, INFO: strobelight run id is: 8721740011604497
strobelight_function_profiler, line 243, 2024-05-23 10:50:34,801, INFO: strobelight profiling running
strobelight_function_profiler, line 224, 2024-05-23 10:50:36,803, INFO: strobelight profiling stopped
strobelight_function_profiler, line 215, 2024-05-23 10:50:41,289, INFO: Total samples: 3
strobelight_function_profiler, line 215, 2024-05-23 10:50:41,289, INFO: GraphProfiler (python stack): https://fburl.com/scuba/pyperf_experimental/on_demand/qmi2ucwp
strobelight_function_profiler, line 215, 2024-05-23 10:50:41,289, INFO: Icicle view (python stack): https://fburl.com/scuba/pyperf_experimental/on_demand/7fjkhs9i
strobelight_compile_time_profiler, line 120, 2024-05-23 10:50:41,289, INFO: 2 strobelight success runs out of 2 non-recursive compilation events.
strobelight_function_profiler, line 241, 2024-05-23 10:50:43,597, INFO: strobelight run id is: 1932476082259558
strobelight_function_profiler, line 243, 2024-05-23 10:51:09,791, INFO: strobelight profiling running
strobelight_function_profiler, line 224, 2024-05-23 10:51:11,883, INFO: strobelight profiling stopped
strobelight_function_profiler, line 215, 2024-05-23 10:51:16,218, INFO: Total samples: 3
strobelight_function_profiler, line 215, 2024-05-23 10:51:16,218, INFO: GraphProfiler (python stack): https://fburl.com/scuba/pyperf_experimental/on_demand/vy1ujxec
strobelight_function_profiler, line 215, 2024-05-23 10:51:16,218, INFO: Icicle view (python stack): https://fburl.com/scuba/pyperf_experimental/on_demand/2xgadviv
strobelight_compile_time_profiler, line 120, 2024-05-23 10:51:16,219, INFO: 3 strobelight success runs out of 3 non-recursive compilation events.
```

or pass TORCH_COMPILE_STROBELIGHT=TRUE for any torch compile python program.
ex running on XLNetLMHeadModel.
```
 TORCH_COMPILE_STROBELIGHT=TRUE TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 time python benchmarks/dynamo/huggingface.py --ci --accuracy --timing --explain --inductor --device cuda --training --amp  --only XLNetLMHeadModel
 ```
 result:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126904
Approved by: https://github.com/aorenste
ghstack dependencies: #126693
2024-05-24 01:39:40 +00:00
e2f081837f Lift jagged -> padded dense forward / backward kernels from fbgemm_gpu (#125946)
PyTorch can't depend on `fbgemm_gpu` as a dependency because `fbgemm_gpu` already has a dependency on PyTorch. So this PR copy / pastes kernels from `fbgemm_gpu`:
* `dense_to_jagged_forward()` as CUDA registration for new ATen op `_padded_dense_to_jagged_forward()`
* `jagged_to_padded_dense_forward()` as CUDA registration for new ATen op `_jagged_to_padded_dense_forward()`

CPU impls for these new ATen ops will be added in a follow-up PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125946
Approved by: https://github.com/davidberard98
2024-05-24 00:42:59 +00:00
3f5b59eef4 [codemod] c10::optional -> std::optional in caffe2/aten/src/ATen/DeviceGuard.h +117 (#126901)
Summary:
Generated with
```
fbgs -f '.*\.(cpp|cxx|cc|h|hpp|cu|cuh)$' c10::optional -l | perl -pe 's/^fbsource.fbcode.//' | grep -v executorch | xargs -n 50 perl -pi -e 's/c10::optional/std::optional/g'
```

 - If you approve of this diff, please use the "Accept & Ship" button :-)

(117 files modified.)

Test Plan: Sandcastle

Reviewed By: palmje

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126901
Approved by: https://github.com/Skylion007, https://github.com/eqy
2024-05-24 00:26:15 +00:00
cyy
95e5c994f9 [Submodule] Clear USE_QNNPACK build option (#126941)
Following the removal of QNNPACK third-party module #126657, we can clear more build system code. Also third_party/neon2sse was removed because it is not used.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126941
Approved by: https://github.com/ezyang
2024-05-24 00:12:56 +00:00
dfabae5b89 Revert "[pipelining] Add grad test for interleaved schedules (#126931)"
This reverts commit abf6d4e6bc1a9a0e08bfc2204560ca7858fa90cd.

Reverted https://github.com/pytorch/pytorch/pull/126931 on behalf of https://github.com/clee2000 due to newly added test fails distributed/pipelining/test_schedule.py::ScheduleTest::test_grad_with_manual_interleaved_ScheduleClass0 abf6d4e6bc https://github.com/pytorch/pytorch/actions/runs/9214413308/job/25352507591, pull workflow failed on startup on PR, so no distributed tests ran at all ([comment](https://github.com/pytorch/pytorch/pull/126931#issuecomment-2128228496))
2024-05-23 23:51:29 +00:00
2db13633e7 [export] disable forced specializations, even when solvable with single var (#126925)
Summary:
Previously https://github.com/pytorch/pytorch/pull/124949 added the ability to disable forced specializations on dynamic shapes for export, keeping dynamism for complex guards instead of specializing, allowing unsoundness by having the user fail at runtime.

It avoided disabling one case: single-variable equality guards, where a variable is specified as dynamic but can be solvable for a concrete value, suggesting the correct behavior is specialization. For example, guard : Eq(s0 // 4, 400) suggests s0 should specialize to 1600.

In debugging, some users (e.g. APS) would like to keep this dynamic, and defer to failing at runtime instead. This PR adds this, so now all forced specializations should be turned off. Mostly this should be used for debugging, since it produces unsoundness, and lets the user proceed with (probably) incorrect dynamism.

Test Plan: export tests

Differential Revision: D57698601

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126925
Approved by: https://github.com/angelayi
2024-05-23 23:43:30 +00:00
6eac3f45c7 Add basic sanity checks for graph ops to cache key (#124745)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124745
Approved by: https://github.com/bdhirsh
2024-05-23 23:37:43 +00:00
ff82e2e7cf [traced-graph][sparse] propagate sparsity metadata into traced graph (#117907)
Propagate sparsity metadata from sparse tensors of torch.sparse into the traced graph representation (with would be useful for a JIT backend that supports a "sparse compiler"). This is a first careful attempt, since the actual "meta" feature seem still incomplete for coo and completely lacking for csr/csc/bsr/bsc.

For background see forum postings (with examples):
  https://discuss.pytorch.org/t/connecting-pytorch-sparse-tensors-with-mlir/195145
  https://dev-discuss.pytorch.org/t/connecting-pytorch-sparse-tensors-with-mlir/1803

And feature request:
  https://github.com/pytorch/pytorch/issues/117188

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117907
Approved by: https://github.com/pearu, https://github.com/ezyang
2024-05-23 22:46:46 +00:00
93ba5e7291 Fix typo for input (#126981)
The variable name should be `cloned_inputs` rather than `clone_inputs`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126981
Approved by: https://github.com/xuzhao9
2024-05-23 22:08:14 +00:00
d11e44c0d0 Reset grad state across unittests (#126345)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126345
Approved by: https://github.com/ezyang
2024-05-23 21:16:39 +00:00
a31a60d85b Change run_test.py arg parsing to handle additional args better (#126709)
Do not inherit parser from common_utils
* I don't think we use any variables in run_test that depend on those, and I think all tests except doctests run in a subprocess so they will parse the args in common_utils and set the variables.  I don't think doctests wants any of those variables?

Parse known args, add the extra args as extra, pass the extra ones along to the subprocess
Removes the first instance of `--`

I think I will miss run_test telling me if an arg is valid or not

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126709
Approved by: https://github.com/ZainRizvi, https://github.com/huydhn, https://github.com/Flamefire
2024-05-23 21:08:12 +00:00
09a73da190 Downgrade requests to 2.31.0 for ios and android (#126989)
Ex https://github.com/pytorch/pytorch/actions/runs/9211850483/job/25342181353
https://github.com/pytorch/pytorch/actions/runs/9211850483/job/25342182105

2.32.0 isn't on the conda channels yet?

Is there a way to add them?

If not here's a PR to downgrad
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126989
Approved by: https://github.com/atalman, https://github.com/malfet
2024-05-23 21:02:50 +00:00
0d2ac9782b [FSDP1] Update docstring to include device_mesh arg (#126589)
Fixes #126548

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126589
Approved by: https://github.com/wanchaol
2024-05-23 20:40:48 +00:00
0902929d58 [CUDA] [CI]: Enable CUDA 12.4 CI (#121956)
Reference PR: https://github.com/pytorch/pytorch/pull/93406

Co-authored-by: Aidyn-A <31858918+Aidyn-A@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121956
Approved by: https://github.com/atalman
2024-05-23 20:37:47 +00:00
abf6d4e6bc [pipelining] Add grad test for interleaved schedules (#126931)
Added `test_grad_with_manual_interleaved`:
- Model: `MultiMLP`
- Tested schedules: Interleaved1F1B, LoopedBFS
- Two stages per rank
```
Rank 0 stages: [0, 2]
Rank 1 stages: [1, 3]
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126931
Approved by: https://github.com/wconstab
ghstack dependencies: #126812, #126721, #126735, #126927
2024-05-23 20:26:08 +00:00
c46b38bc75 [pipelining] Generalize definition of MultiMLP for testing interleaved schedules (#126927)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126927
Approved by: https://github.com/wconstab
ghstack dependencies: #126812, #126721, #126735
2024-05-23 20:26:08 +00:00
6b39146b3f [pipelining] Validate stage input/output shape/dtype (#126732)
Address the classes of user errors stemming from (possibly)
unintentional dynamic shapes usage or mismatch of configuration time and
run time data shapes/dtypes.

The goal is to ensure a clear error is raised rather than relying on some underlying
error to bubble up when a tensor shape is not compatible, or worse,
having a silent correctness issue.

**Classes of shape/dtype errors**
* (a) error is thrown within the stage-module forward code, but may be
hard to understand/trace back to an input issue
* (b) silent correctness issue happens inside the stage-module forward,
but the correct output shape is still produced
produces the expected output shape
* (c) the stage-module produces an output that is locally correct, but not
matching the expectation of the following stage, leading to a hang or
correctness issue down the line

**How validation helps**

Input shape validation
- improves debugability of case (a)
- guards against case (b)
- only needed on first stage, since subsequent stages use pre-allocated recv
  buffers that can't change shape/size even if they wanted to

Output shape validation
- guards against case (c)

Validation of first stage input and all stages' outputs inductively verifies all shapes

Shape/dtype are most critical as they literally affect the number of
bytes on the wire.  Strides and other tensor properties may also (?)
matter, and the validation function can be adjusted accordingly if needed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126732
Approved by: https://github.com/kwen2501
2024-05-23 20:16:06 +00:00
9b91c91e64 Don't add to replacements when guard is suppressed (#126210)
Also improve logging when guards are suppressed

Partially addresses https://github.com/pytorch/pytorch/issues/125641

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126210
Approved by: https://github.com/jbschlosser
2024-05-23 20:10:29 +00:00
f8857cef45 [Reland] Verify types in custom op schemas (#126861)
Summary:
co-dev reland of https://github.com/pytorch/pytorch/pull/124520, which requires
the removal of some executorch tests.

Before this PR, we didn't check that types in a schema were valid. This
is because TorchScript treats unknown types as type variables.

This PR checks types in a schema for the TORCH_LIBRARY APIs. To do this,
we add an `allow_typevars` flag to parseSchema so that TorchScript can
use allow_typevars=True. We also add some error messages for common
mistakes (e.g. using int64_t or double in schema).

Test Plan: Wait for tests

Differential Revision: D57666659

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126861
Approved by: https://github.com/albanD
2024-05-23 19:53:52 +00:00
c921c5cc77 [c10d] Print certain logs only on head rank of each node (#125432)
Recently we added the following warning, which is printed on every rank and makes the log a bit verbose.

This PR dedups certain logs that are identical across ranks and prints them only on head rank of each node.

Resolves https://github.com/pytorch/pytorch/issues/126275

=========================================

[rank0]:[W502 14:06:55.821964708 ProcessGroupNCCL.cpp:1113] WARNING: process group has NOT been destroyed before it is being destructed. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL data transfers have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4
[rank1]:[W502 14:06:57.994276972 ProcessGroupNCCL.cpp:1113] WARNING: process group has NOT been destroyed before it is being destructed. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL data transfers have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4
[rank2]:[W502 14:07:00.353013116 ProcessGroupNCCL.cpp:1113] WARNING: process group has NOT been destroyed before it is being destructed. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL data transfers have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4
[rank3]:[W502 14:07:02.515511670 ProcessGroupNCCL.cpp:1113] WARNING: process group has NOT been destroyed before it is being destructed. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL data transfers have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125432
Approved by: https://github.com/wconstab
2024-05-23 19:16:11 +00:00
0625f92993 [inductor] Run some tests on correct device (#126943)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126943
Approved by: https://github.com/yanboliang
2024-05-23 18:47:44 +00:00
abf40320dd remove ax/ay arrays in fp16 ARM matmul kernels (#126793)
These shouldn't do anything as only two elements are live at once, so we can simplify the code. (I checked assembly for the inner loops in instruments and it seems to be the same.)

Differential Revision: [D57732738](https://our.internmc.facebook.com/intern/diff/D57732738)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126793
Approved by: https://github.com/malfet
ghstack dependencies: #126745, #126746
2024-05-23 18:42:45 +00:00
5dcf3d0f9e use arith-by-dot-products approach for fp32 accumulation in fp16 matmul (#126746)
Summary: The faster fp16-native kernel is gated off by default. Let's give people better performance in the default case.

Test Plan: CI
benchmarked matmul of size 4100x4100x1 and 4104x4104x1 using https://github.com/malfet/llm_experiments/blob/main/benchmarks/benchmark_torch_mm.py (4100 % 32 = 4100 % 8 = 4). Relevant timing numbers without FP16 reduction (which then uses this kernel):

after:
trans_b  torch.float16 1396.11 usec (4100)
trans_b  torch.float16 1399.54 usec (4104)

before:
trans_b  torch.float16 1840.79 usec (4100)
trans_b  torch.float16 1786.67 usec (4104)

Differential Revision: [D57732736](https://our.internmc.facebook.com/intern/diff/D57732736)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126746
Approved by: https://github.com/malfet
ghstack dependencies: #126745
2024-05-23 18:42:45 +00:00
fd4fd24080 add tail fixup for fp16 gemv transposed fast path (#126745)
Summary: We previously had restrictive gating for the fp16 kernel; now it supports arbitrary m & n.

Test Plan: 1) ran test coverage added in  #126700, passes
2) benchmarked matmul of size 4100x4100x1 and 4104x4104x1 using https://github.com/malfet/llm_experiments/blob/main/benchmarks/benchmark_torch_mm.py (4100 % 32 = 44100 % 8 = 4). Relevant timing numbers with FP16 reduction enabled (which gates this kernel):

after:
trans_b  torch.float16  716.42 usec (4100)
trans_b  torch.float16  711.10 usec (4104)

Before:
trans_b  torch.float16 1808.66 usec (4100)
trans_b  torch.float16 1083.18 usec (4104)

Differential Revision: [D57732737](https://our.internmc.facebook.com/intern/diff/D57732737)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126745
Approved by: https://github.com/malfet
2024-05-23 18:42:35 +00:00
b36e390b6c Revert "Default XLA to use swap_tensors path in nn.Module._apply (#126814)"
This reverts commit eb41ed5d90e946e62dd664d7037ebbb021baf33e.

Reverted https://github.com/pytorch/pytorch/pull/126814 on behalf of https://github.com/mikaylagawarecki due to broke xla ci ([comment](https://github.com/pytorch/pytorch/pull/126814#issuecomment-2127719337))
2024-05-23 17:43:06 +00:00
6a06d36296 Revert "Default meta device to use swap_tensors in nn.Module._apply (.to_empty and .to('meta')) (#126819)"
This reverts commit ab61309ab8f6452975021994a6d4a102d55feba8.

Reverted https://github.com/pytorch/pytorch/pull/126819 on behalf of https://github.com/mikaylagawarecki due to broke xla ci ([comment](https://github.com/pytorch/pytorch/pull/126814#issuecomment-2127719337))
2024-05-23 17:43:06 +00:00
041e8d73fd Separate non/strict functions in _export (#126718)
Move non/strict _export to different functions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126718
Approved by: https://github.com/angelayi
2024-05-23 17:41:23 +00:00
cyy
e5db6758c8 [BE]: Use make_unique (#126966)
Adds make_unique in places

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126966
Approved by: https://github.com/Skylion007
2024-05-23 17:39:48 +00:00
264155a8d7 [DCP][AC] Add test for apply AC with FSDP1 (#126935)
Adding test for this cherry pick. https://github.com/pytorch/pytorch/pull/126559/

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126935
Approved by: https://github.com/fegin
2024-05-23 17:35:54 +00:00
bbe68a16b9 [codemod][lowrisk] Remove extra semi colon from caffe2/caffe2/core/observer.h (#126976)
Summary:
`-Wextra-semi` or `-Wextra-semi-stmt`

If the code compiles, this is safe to land.

Test Plan: Sandcastle

Reviewed By: palmje

Differential Revision: D57632765

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126976
Approved by: https://github.com/Skylion007
2024-05-23 17:31:19 +00:00
a63310eebc TorchScript 2 ExportedProgram Converter (#126920)
Summary:
Initial commit for TorchScript 2 ExportedProgram Converter.

TODO:
- Improve TorchScript IR coverage
- parameter and buffers should be owned by output ExportedProgram
- Experiment on conditional op conversion

Test Plan: buck2 run mode/dev-nosan fbcode//caffe2/test:test_export -- -r TestConverter

Differential Revision: D57694784

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126920
Approved by: https://github.com/angelayi, https://github.com/tugsbayasgalan
2024-05-23 17:00:18 +00:00
1b29c16e5e Revert "Introduce ProcessGroupCudaP2P (#122163)"
This reverts commit 2dd269986027ea25c092f769ef8e9524920aaef6.

Reverted https://github.com/pytorch/pytorch/pull/122163 on behalf of https://github.com/jithunnair-amd due to This is breaking ROCm distributed CI on trunk ([comment](https://github.com/pytorch/pytorch/pull/122163#issuecomment-2127518473))
2024-05-23 16:06:14 +00:00
ab61309ab8 Default meta device to use swap_tensors in nn.Module._apply (.to_empty and .to('meta')) (#126819)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126819
Approved by: https://github.com/albanD
ghstack dependencies: #126814
2024-05-23 15:43:32 +00:00
eb41ed5d90 Default XLA to use swap_tensors path in nn.Module._apply (#126814)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126814
Approved by: https://github.com/JackCaoG, https://github.com/albanD
2024-05-23 15:43:32 +00:00
f0366de414 [dynamo] Support __contains__ on obj.__dict__ (#126922)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126922
Approved by: https://github.com/jansel, https://github.com/yanboliang
2024-05-23 09:01:29 +00:00
25b8dbc3e4 Revert "[inductor][cpp] GEMM template (infra and fp32) (#124021)"
This reverts commit 9da7efa6774777890c8e4a713f6d23ea5cfcf6a4.

Reverted https://github.com/pytorch/pytorch/pull/124021 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/124021#issuecomment-2126568331))
2024-05-23 08:50:18 +00:00
45784cd229 Revert "[inductor][cpp] epilogue support for gemm template (#126019)"
This reverts commit 08f57b4bffe6edfdb016703219744482b4d03e23.

Reverted https://github.com/pytorch/pytorch/pull/126019 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/124021#issuecomment-2126568331))
2024-05-23 08:50:18 +00:00
926327e8fc Revert "[inductor][cpp] bf16/fp16 gemm template computed with fp32 w/o epilogue fusion (#126068)"
This reverts commit 31412cb2f25bda0fe31dae7b2afc88278794cad6.

Reverted https://github.com/pytorch/pytorch/pull/126068 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/124021#issuecomment-2126568331))
2024-05-23 08:50:18 +00:00
30c9ca0899 Revert "[inductor][cpp] support bf16/fp16 gemm template epilogue fusion (#126545)"
This reverts commit 7b6d036c05bd782f5e59bdb353f9e47865e9db50.

Reverted https://github.com/pytorch/pytorch/pull/126545 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/124021#issuecomment-2126568331))
2024-05-23 08:50:18 +00:00
da7bf1d588 [export] Fix unflatten with empty nn_module_stack (#126785)
Fixes https://fb.workplace.com/groups/1075192433118967/permalink/1433418843962989/
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126785
Approved by: https://github.com/tugsbayasgalan
2024-05-23 08:34:25 +00:00
a6155d23d1 [easy] Delete dead code global (#126903)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126903
Approved by: https://github.com/aorenste
ghstack dependencies: #126083
2024-05-23 08:29:29 +00:00
cc61d03ac9 Do not trace into triton/backends (#126083)
Fixes #125807

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126083
Approved by: https://github.com/yanboliang, https://github.com/jansel
2024-05-23 08:29:29 +00:00
558c4413ce add strobelight cli function profiler (#126693)
This is a meta only tool, this allow users to profile any python function by annotating it with **strobelight** using
the strobelight profiler.
ex
```
    def fn(x, y, z):
        return x * y + z

    # use decorator with default profiler.
    @strobelight()
    @torch.compile()
    def work():
        for i in range(100):
            for j in range(5):
                fn(torch.rand(j, j), torch.rand(j, j), torch.rand(j, j))

    work()
```

test
```
 python torch/utils/strobelight/examples/cli_function_profiler_example.py
strobelight_cli_function_profiler, line 274, 2024-05-20 11:05:41,513, INFO: strobelight run id is: -6222660165281106
strobelight_cli_function_profiler, line 276, 2024-05-20 11:06:08,318, INFO: strobelight profiling running
strobelight_cli_function_profiler, line 257, 2024-05-20 11:06:11,867, INFO: strobelight profiling stopped
strobelight_cli_function_profiler, line 237, 2024-05-20 11:06:16,164, INFO: Total samples: 2470
strobelight_cli_function_profiler, line 237, 2024-05-20 11:06:16,164, INFO: GraphProfiler (python stack): https://fburl.com/scuba/pyperf_experimental/on_demand/oiqmyltg
strobelight_cli_function_profiler, line 237, 2024-05-20 11:06:16,164, INFO: Icicle view (python stack): https://fburl.com/scuba/pyperf_experimental/on_demand/b10x92x0
strobelight_cli_function_profiler, line 274, 2024-05-20 11:06:18,476, INFO: strobelight run id is: -4112659701221677
strobelight_cli_function_profiler, line 276, 2024-05-20 11:06:45,096, INFO: strobelight profiling running
strobelight_cli_function_profiler, line 257, 2024-05-20 11:06:52,366, INFO: strobelight profiling stopped
strobelight_cli_function_profiler, line 237, 2024-05-20 11:06:56,222, INFO: Total samples: 1260
strobelight_cli_function_profiler, line 237, 2024-05-20 11:06:56,222, INFO: GraphProfiler (python stack): https://fburl.com/scuba/pyperf_experimental/on_demand/0yyx6el5
strobelight_cli_function_profiler, line 237, 2024-05-20 11:06:56,223, INFO: Icicle view (python stack): https://fburl.com/scuba/pyperf_experimental/on_demand/8m2by4ea
(base) [lsakka@devvm4561.ash0 /data/users/lsakka/pytorch/pytorch (strobelight2)]$ python torch/profiler/strobelight_cli_function_profiler_example.py
strobelight_cli_function_profiler, line 274, 2024-05-20 11:07:26,701, INFO: strobelight run id is: -2373009368202256
strobelight_cli_function_profiler, line 276, 2024-05-20 11:07:53,477, INFO: strobelight profiling running
strobelight_cli_function_profiler, line 257, 2024-05-20 11:07:56,827, INFO: strobelight profiling stopped
strobelight_cli_function_profiler, line 237, 2024-05-20 11:08:01,138, INFO: Total samples: 2372
strobelight_cli_function_profiler, line 237, 2024-05-20 11:08:01,138, INFO: GraphProfiler (python stack): https://fburl.com/scuba/pyperf_experimental/on_demand/dk797xg9
strobelight_cli_function_profiler, line 237, 2024-05-20 11:08:01,138, INFO: Icicle view (python stack): https://fburl.com/scuba/pyperf_experimental/on_demand/4w6c8vnm
strobelight_cli_function_profiler, line 274, 2024-05-20 11:08:03,235, INFO: strobelight run id is: -1919086123693716
strobelight_cli_function_profiler, line 276, 2024-05-20 11:08:29,848, INFO: strobelight profiling running
strobelight_cli_function_profiler, line 257, 2024-05-20 11:08:37,233, INFO: strobelight profiling stopped
strobelight_cli_function_profiler, line 237, 2024-05-20 11:08:41,138, INFO: Total samples: 1272
strobelight_cli_function_profiler, line 237, 2024-05-20 11:08:41,138, INFO: GraphProfiler (python stack): https://fburl.com/scuba/pyperf_experimental/on_demand/43r58aew
strobelight_cli_function_profiler, line 237, 2024-05-20 11:08:41,138, INFO: Icicle view (python stack): https://fburl.com/scuba/pyperf_experimental/on_demand/9g52onmw
(base) [lsakka@devvm4561.ash0 /data/users/lsakka/pytorch/pytorch (strobelight2)]$
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126693
Approved by: https://github.com/aorenste
2024-05-23 07:42:25 +00:00
7b6d036c05 [inductor][cpp] support bf16/fp16 gemm template epilogue fusion (#126545)
As part of #125683, this PR adds epilogue fusion support for bf16/fp16 gemms. The key changes are as follows:
1. bf16 linear w/ epilogue fusion of some ops was originally supported via ATen oneDNN linear pointwise ops. In order to match the ATen op semantics, in-template epilogue support is added to the cpp gemm template so that we would have: "gemm + in-template epilogues -> template buffer". If the template is chosen for codegen, the in-template epilogues will be concatenated with the out-of-template epilogues that are appended during the scheduling.
2. Support bf16/fp16 legalization for `codegen_loop_bodies` which is used to generate the epilogue loops.
3. We used to leverage the in-place buffer mechanism to handle the in-place buffers in the epilogue codegen, in particular, for the reuses for output buffers of GEMM, template and epilogues. This is not correct since the output buffer is an "output" not an "in-place" buffer of the template kernel itself. Now, we use a dedicated "aliases" dict to manage such buffer reuses and the intermediate aliasing buffers are removed after codegen.
4. Add `localize_buffer` method to `LocalBufferScope` to allow the replacement of a global buffer with a local one in the given inductor IR nodes. This helps the fused loops to work on smaller-sized local buffers for better data locality.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126545
Approved by: https://github.com/jansel
ghstack dependencies: #124021, #126019, #126068
2024-05-23 07:39:29 +00:00
31412cb2f2 [inductor][cpp] bf16/fp16 gemm template computed with fp32 w/o epilogue fusion (#126068)
As part of #125683, this PR adds the initial bf16/fp16 gemm template support with micro-gemm implemented with fused type casting and fp32 computation. It doesn't provide epilogue fusion support yet which will be added in the next PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126068
Approved by: https://github.com/jansel
ghstack dependencies: #124021, #126019
2024-05-23 07:39:29 +00:00
08f57b4bff [inductor][cpp] epilogue support for gemm template (#126019)
As part of #125683, this PR adds the epilogue support for c++ gemm template by reusing the c++ vector codegen on sub-slices of tensors. This is implemented by retracing the epilogue IR nodes with new ranges and offsets. The new `codegen_loop_bodies` and `codegen_functions` methods are added to c++ vector codegen for this purpose. This is leveraged by the `store_output` method of the template kernel for epilogue codegen and store to the final result.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126019
Approved by: https://github.com/jansel
ghstack dependencies: #124021
2024-05-23 07:39:29 +00:00
9da7efa677 [inductor][cpp] GEMM template (infra and fp32) (#124021)
This PR adds the Cpp template infrastructure and the initial FP32 gemm template. See RFC https://github.com/pytorch/pytorch/issues/125683 for more background info.
1. Cpp template infrastructure
Similar template abstractions as the CUTLASS template, i.e., `CppTemplate`, `CppTemplateKernel`, `CppTemplateBuffer`. The MicroGemm micro-kernel abstraction that can be used by Cpp GEMM templates.
2. Initial FP32 gemm template
This involves a GEMM template implementation `CppPackedGemmTemplate` that supports GEMM with constant weight (`B`) requiring `N` to be a multiple of register blocking while allows the static or dynamic sizes for the `M` (batch dim) of `A`. The `B` matrix would be prepacked. This is a typical setting for inference workloads. The template handles the thread decomposition (via `thread_blocking`) and cache blocking (via `cache_blocking`). Then it invokes `CppMicroGemm` which handles register blocking, instruction selection, and other CPU architecture-specific optimizations. A `CppMicroGemmFP32Vec` micro-kernel implementation is provided for fp32 matmuls implemented with ATen vec abstraction.
3. Correctness and performance
The changes have been validated with fp32 inference on the three benchmark suites (torchbench, huggingface and timm_models) with both static shape and dynamic shapes. Since it is an initial implementation, we are still working on further performance improves with follow-up PRs including the optimizations in kernels as well as fusions. The perf gains are only observed from a selective number of models compared to the ATen kernels which are implemented with MKL. The perf gains are more obvious with dynamic shapes since MKL only supports packed gemm for static shapes. Below are details.

Static shapes
| Benchmark | torchbench | huggingface | timm_models |
|------------|-------------|--------------|--------------|
| Multi-threaded (baseline) | 1.47x | 1.36x | 1.91x |
| Multi-threaded (max-autotune) | 1.47x | 1.36x | 1.92x |
| Single-threaded (baseline) | 1.56x | 1.19x | 1.51x |
| Single-threaded (max-autotune) | 1.56x | 1.19x | 1.52x |

Key models being sped up:
drq: 1.14x
soft_act: 1.12
cait_m36_384: 1.18x

Dynamic shapes
| Benchmark | torchbench | huggingface | timm_models |
| --- | --- | --- | --- |
| Multi-threaded (baseline) | 1.43x | 1.28x | 1.85x |
| Multi-threaded (max-autotune) | 1.47x | 1.28x | 1.85x |
| Single-threaded (baseline) | 1.55x | 1.20x | 1.51x |
| Single-threaded (max-autotune) | 1.56x | 1.19x | 1.53x |

Key models being sped up:
BERT_pytorch: 1.22x
pyhpc_turbulent: 1.13x
soft_actor_critic: 1.77x
BlenderbotForCausalLM: 1.09x
cait_m36_384: 1.17x

Differential Revision: [D57585365](https://our.internmc.facebook.com/intern/diff/D57585365)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124021
Approved by: https://github.com/jansel
2024-05-23 07:39:29 +00:00
aa6de76181 Fix silu test for flexattention (#126641)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126641
Approved by: https://github.com/ezyang, https://github.com/drisspg
ghstack dependencies: #126615, #126446
2024-05-23 05:45:07 +00:00
36e70572d0 [Dynamo] make bytecode of resume function resemble natural bytecode (#126630)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126630
Approved by: https://github.com/williamwen42
2024-05-23 05:06:33 +00:00
2c90b99267 Revert "reset dynamo cache before each test (#126586)"
This reverts commit 43f2f43eb3b6d8cbe8eb7f45acb50376092f1a16.

Reverted https://github.com/pytorch/pytorch/pull/126586 on behalf of https://github.com/clee2000 due to broke tests on inductor? test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_CTCLoss_cuda_float64 43f2f43eb3 https://github.com/pytorch/pytorch/actions/runs/9200644034/job/25308511495 ([comment](https://github.com/pytorch/pytorch/pull/126586#issuecomment-2126228689))
2024-05-23 04:54:28 +00:00
b1e214ceb1 Revert "don't check memory format for empty tensors (#126593)"
This reverts commit 12dee4f2046d07db97cddc7b3c5bdf06fc304ae3.

Reverted https://github.com/pytorch/pytorch/pull/126593 on behalf of https://github.com/clee2000 due to broke tests on inductor? test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_CTCLoss_cuda_float64 43f2f43eb3 https://github.com/pytorch/pytorch/actions/runs/9200644034/job/25308511495 ([comment](https://github.com/pytorch/pytorch/pull/126586#issuecomment-2126228689))
2024-05-23 04:54:28 +00:00
df4b7cb5f7 Reapply "Skip test_memory_format_nn_BatchNorm2d in inductor (#125970)" (#126594)
This reverts commit ce6e36bf8b524c3f4b07605c5b3af2b7d5ba8fd9.

Reverted https://github.com/pytorch/pytorch/pull/126594 on behalf of https://github.com/clee2000 due to broke tests on inductor? test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_CTCLoss_cuda_float64 43f2f43eb3 https://github.com/pytorch/pytorch/actions/runs/9200644034/job/25308511495 ([comment](https://github.com/pytorch/pytorch/pull/126586#issuecomment-2126228689))
2024-05-23 04:54:28 +00:00
4f14282e35 Revert "[inductor][cpp] GEMM template (infra and fp32) (#124021)"
This reverts commit 2ac33a9f663269e6060246337c776a20c3b7c858.

Reverted https://github.com/pytorch/pytorch/pull/124021 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think it has a land race and failing in trunk 2ac33a9f66 ([comment](https://github.com/pytorch/pytorch/pull/124021#issuecomment-2126016522))
2024-05-23 01:13:29 +00:00
657d39e44c Revert "[inductor][cpp] epilogue support for gemm template (#126019)"
This reverts commit 57108d9a4990f6b2ed3578cee58354ab01505dd3.

Reverted https://github.com/pytorch/pytorch/pull/126019 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think it has a land race and failing in trunk 2ac33a9f66 ([comment](https://github.com/pytorch/pytorch/pull/124021#issuecomment-2126016522))
2024-05-23 01:13:29 +00:00
205f08140e Revert "[inductor][cpp] bf16/fp16 gemm template computed with fp32 w/o epilogue fusion (#126068)"
This reverts commit 57c185b4c765c522a7f2908a773d128c66def190.

Reverted https://github.com/pytorch/pytorch/pull/126068 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think it has a land race and failing in trunk 2ac33a9f66 ([comment](https://github.com/pytorch/pytorch/pull/124021#issuecomment-2126016522))
2024-05-23 01:13:29 +00:00
2b57652278 Update requests to 2.32.2 (#126805)
To address CVE-2024-35195 (though it does not really affect PyTorch, only CI)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126805
Approved by: https://github.com/atalman, https://github.com/kit1980, https://github.com/seemethere, https://github.com/Skylion007
2024-05-23 00:21:28 +00:00
eqy
ebbd431d9e [CPU] Bump test_complex_2d thresholds for LBFGS on complex64 (#126358)
Is this supposed to be bitwise identical? Wasn't sure how to interpret the comment but it seems to be giving mismatches like:
```
Mismatched elements: 1 / 2 (50.0%)
Greatest absolute difference: 4.6372413635253906e-05 at index (1,) (up to 1e-05 allowed)
Greatest relative difference: 3.4600801882334054e-05 at index (1,) (up to 1.3e-06 allowed)

To execute this test, run the following from the base repo dir:
     python test/test_optim.py -k test_complex_2d_LBFGS_cpu_complex64
```

on Neoverse-N2 SBSA ARM CPUs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126358
Approved by: https://github.com/lezcano, https://github.com/janeyx99
2024-05-23 00:16:45 +00:00
57c185b4c7 [inductor][cpp] bf16/fp16 gemm template computed with fp32 w/o epilogue fusion (#126068)
As part of #125683, this PR adds the initial bf16/fp16 gemm template support with micro-gemm implemented with fused type casting and fp32 computation. It doesn't provide epilogue fusion support yet which will be added in the next PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126068
Approved by: https://github.com/jansel
ghstack dependencies: #124021, #126019
2024-05-23 00:12:38 +00:00
57108d9a49 [inductor][cpp] epilogue support for gemm template (#126019)
As part of #125683, this PR adds the epilogue support for c++ gemm template by reusing the c++ vector codegen on sub-slices of tensors. This is implemented by retracing the epilogue IR nodes with new ranges and offsets. The new `codegen_loop_bodies` and `codegen_functions` methods are added to c++ vector codegen for this purpose. This is leveraged by the `store_output` method of the template kernel for epilogue codegen and store to the final result.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126019
Approved by: https://github.com/jansel
ghstack dependencies: #124021
2024-05-23 00:07:52 +00:00
2ac33a9f66 [inductor][cpp] GEMM template (infra and fp32) (#124021)
This PR adds the Cpp template infrastructure and the initial FP32 gemm template. See RFC https://github.com/pytorch/pytorch/issues/125683 for more background info.
1. Cpp template infrastructure
Similar template abstractions as the CUTLASS template, i.e., `CppTemplate`, `CppTemplateKernel`, `CppTemplateBuffer`. The MicroGemm micro-kernel abstraction that can be used by Cpp GEMM templates.
2. Initial FP32 gemm template
This involves a GEMM template implementation `CppPackedGemmTemplate` that supports GEMM with constant weight (`B`) requiring `N` to be a multiple of register blocking while allows the static or dynamic sizes for the `M` (batch dim) of `A`. The `B` matrix would be prepacked. This is a typical setting for inference workloads. The template handles the thread decomposition (via `thread_blocking`) and cache blocking (via `cache_blocking`). Then it invokes `CppMicroGemm` which handles register blocking, instruction selection, and other CPU architecture-specific optimizations. A `CppMicroGemmFP32Vec` micro-kernel implementation is provided for fp32 matmuls implemented with ATen vec abstraction.
3. Correctness and performance
The changes have been validated with fp32 inference on the three benchmark suites (torchbench, huggingface and timm_models) with both static shape and dynamic shapes. Since it is an initial implementation, we are still working on further performance improves with follow-up PRs including the optimizations in kernels as well as fusions. The perf gains are only observed from a selective number of models compared to the ATen kernels which are implemented with MKL. The perf gains are more obvious with dynamic shapes since MKL only supports packed gemm for static shapes. Below are details.

Static shapes
| Benchmark | torchbench | huggingface | timm_models |
|------------|-------------|--------------|--------------|
| Multi-threaded (baseline) | 1.47x | 1.36x | 1.91x |
| Multi-threaded (max-autotune) | 1.47x | 1.36x | 1.92x |
| Single-threaded (baseline) | 1.56x | 1.19x | 1.51x |
| Single-threaded (max-autotune) | 1.56x | 1.19x | 1.52x |

Key models being sped up:
drq: 1.14x
soft_act: 1.12
cait_m36_384: 1.18x

Dynamic shapes
| Benchmark | torchbench | huggingface | timm_models |
| --- | --- | --- | --- |
| Multi-threaded (baseline) | 1.43x | 1.28x | 1.85x |
| Multi-threaded (max-autotune) | 1.47x | 1.28x | 1.85x |
| Single-threaded (baseline) | 1.55x | 1.20x | 1.51x |
| Single-threaded (max-autotune) | 1.56x | 1.19x | 1.53x |

Key models being sped up:
BERT_pytorch: 1.22x
pyhpc_turbulent: 1.13x
soft_actor_critic: 1.77x
BlenderbotForCausalLM: 1.09x
cait_m36_384: 1.17x

Differential Revision: [D57585365](https://our.internmc.facebook.com/intern/diff/D57585365)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124021
Approved by: https://github.com/jansel
2024-05-22 23:59:12 +00:00
e3db9ba37a [FSDP2] Added test for manual reshard with reshard_after_forward=False (#126892)
This test shows that we could always set `reshard_after_forward=False` but manually insert calls to `module.reshard()` to implement the resharding after forward. This is useful for advanced PP schedules.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126892
Approved by: https://github.com/wanchaol
ghstack dependencies: #126887
2024-05-22 23:35:06 +00:00
203f2641e9 [FSDP2] Used CommDebugMode for comm. count test (#126887)
simplify the test :)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126887
Approved by: https://github.com/wanchaol
2024-05-22 23:35:06 +00:00
69325e4de6 [FSDP] Warned on wrapping ModuleList/ModuleDict (#124764)
This partially addresses https://github.com/pytorch/pytorch/issues/113794.

To avoid being BC breaking, we just issue an warning when wrapping `ModuleList` or `ModuleDict`. We want to add this warning since this is a common pitfall.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124764
Approved by: https://github.com/wanchaol
2024-05-22 23:34:52 +00:00
b0e849870e Change error message when nn module inlining is enabled for MiscTests.test_map_side_effects (#126444)
#fix https://github.com/pytorch/pytorch/issues/126355

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126444
Approved by: https://github.com/anijain2305
2024-05-22 23:24:03 +00:00
17186bd5b6 [inductor] make conv lowering work with dynamic shapes (#126823)
Fix an issue reported by internal user that conv lowering does not work well with dynamic shapes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126823
Approved by: https://github.com/jansel
2024-05-22 23:15:29 +00:00
14c5c753de [inductor] use smaller RBLOCK for expensive reduction kernels (#126477)
Triton sometimes uses less registers for more expensive kernel which results in worse perf ( https://github.com/pytorch/pytorch/issues/126463 ). This may make inductor end up with a sub-optimal config. Use a smaller max RBLOCK if the reduction potentially need many registers.

Will run perf test..

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126477
Approved by: https://github.com/jansel
2024-05-22 22:47:10 +00:00
ce6e36bf8b Revert "Skip test_memory_format_nn_BatchNorm2d in inductor (#125970)" (#126594)
This reverts commit 0a9c6e92f8d1a35f33042c8dab39f23b7f39d6e7.

enable the test since it's fixed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126594
Approved by: https://github.com/huydhn
ghstack dependencies: #126586, #126593
2024-05-22 22:43:09 +00:00
12dee4f204 don't check memory format for empty tensors (#126593)
Fix https://github.com/pytorch/pytorch/issues/125967 . The test actually fail for empty 4D or 5D tensors when checking for memory format.

I'm not exactly sure what recent inductor change cause the failure, but it may be not that important to maintain strides for an empty tensor. (?)

I just skip the check for empty tensor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126593
Approved by: https://github.com/ezyang
ghstack dependencies: #126586
2024-05-22 22:43:09 +00:00
43f2f43eb3 reset dynamo cache before each test (#126586)
In https://github.com/pytorch/pytorch/issues/125967, we found test results depend on test order. The root cause is due to earlier tests populate dynamo cache and affect the later tests.

This PR clear dynamo cache before each unit test so we get more deterministic result for unit test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126586
Approved by: https://github.com/jansel
2024-05-22 22:43:09 +00:00
08c260bc29 [pipelining] Test schedules against manual stage (#126735)
Added manual stage in test_schedule.py so that we can test various schedules against it.

In this file we now have:
- test_schedule_with_tracer
- test_schedule_with_manual
- test_grad_with_tracer
- test_grad_with_manual

Tested schedules are:
- ScheduleGPipe
- Schedule1F1B

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126735
Approved by: https://github.com/wconstab, https://github.com/H-Huang
ghstack dependencies: #126812, #126721
2024-05-22 21:54:27 +00:00
6a539e80dd Update descriptor fields to resolve fft precision issue (#125328)
Fixes #124096
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125328
Approved by: https://github.com/kulinseth, https://github.com/malfet
2024-05-22 21:48:49 +00:00
5ccc634603 [CI] Pin uv==0.1.45 for lintrunner (#126908)
e4623de4cf/1
```

2024-05-22T19:10:48.5974515Z + python3 -m pip install uv
2024-05-22T19:10:48.5975198Z Collecting uv
2024-05-22T19:10:48.5976496Z   Downloading uv-0.1.45-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (32 kB)
2024-05-22T19:10:48.5977828Z Downloading uv-0.1.45-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.8 MB)
2024-05-22T19:10:48.5986243Z [?25l   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.0/12.8 MB ? eta -:--:--
2024-05-22T19:10:48.5988326Z    ━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━ 6.8/12.8 MB 205.8 MB/s eta 0:00:01
2024-05-22T19:10:48.5990300Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 12.8/12.8 MB 215.1 MB/s eta 0:00:01
2024-05-22T19:10:48.5991645Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 12.8/12.8 MB 215.1 MB/s eta 0:00:01
2024-05-22T19:10:48.5992724Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12.8/12.8 MB 97.8 MB/s eta 0:00:00
2024-05-22T19:10:48.5993443Z [?25hInstalling collected packages: uv
2024-05-22T19:10:48.5993950Z Successfully installed uv-0.1.45
2024-05-22T19:10:48.5994363Z + CACHE_DIRECTORY=/tmp/.lintbin
2024-05-22T19:10:48.5994772Z + [[ -d /tmp/.lintbin ]]
2024-05-22T19:10:48.5995157Z + cp -r /tmp/.lintbin .
2024-05-22T19:10:48.5995497Z + lintrunner init
2024-05-22T19:10:48.5995839Z + [[ 1 == \1 ]]
```
vs
```

2024-05-22T20:33:53.5563991Z + python3 -m pip install uv
2024-05-22T20:33:53.5564921Z Collecting uv
2024-05-22T20:33:53.5566259Z   Downloading uv-0.2.1-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (32 kB)
2024-05-22T20:33:53.5568142Z Downloading uv-0.2.1-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.9 MB)
2024-05-22T20:33:53.5570253Z [?25l   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.0/12.9 MB ? eta -:--:--
2024-05-22T20:33:53.5571889Z    ━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━ 7.0/12.9 MB 208.8 MB/s eta 0:00:01
2024-05-22T20:33:53.5573716Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 12.9/12.9 MB 206.7 MB/s eta 0:00:01
2024-05-22T20:33:53.5575478Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 12.9/12.9 MB 206.7 MB/s eta 0:00:01
2024-05-22T20:33:53.5577240Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12.9/12.9 MB 101.6 MB/s eta 0:00:00
2024-05-22T20:33:53.5578531Z [?25hInstalling collected packages: uv
2024-05-22T20:33:53.5579316Z Successfully installed uv-0.2.1
2024-05-22T20:33:53.5580033Z + CACHE_DIRECTORY=/tmp/.lintbin
2024-05-22T20:33:53.5580640Z + [[ -d /tmp/.lintbin ]]
2024-05-22T20:33:53.5581229Z + cp -r /tmp/.lintbin .
2024-05-22T20:33:53.5581799Z + lintrunner init
2024-05-22T20:33:53.5603302Z Traceback (most recent call last):
2024-05-22T20:33:53.5604857Z   File "/home/ec2-user/actions-runner/_work/pytorch/pytorch/test-infra/.github/scripts/run_with_env_secrets.py", line 101, in <module>
2024-05-22T20:33:53.5605805Z     main()
2024-05-22T20:33:53.5606687Z   File "/home/ec2-user/actions-runner/_work/pytorch/pytorch/test-infra/.github/scripts/run_with_env_secrets.py", line 97, in main
2024-05-22T20:33:53.5607762Z     run_cmd_or_die(f"docker exec -t {container_name} /exec")
2024-05-22T20:33:53.5608949Z   File "/home/ec2-user/actions-runner/_work/pytorch/pytorch/test-infra/.github/scripts/run_with_env_secrets.py", line 38, in run_cmd_or_die
2024-05-22T20:33:53.5610107Z     raise RuntimeError(f"Command {cmd} failed with exit code {exit_code}")
2024-05-22T20:33:53.5611328Z RuntimeError: Command docker exec -t e551764bdba0c87c2fc392fba9ea265e8821a552915b36010f18299d8035b304 /exec failed with exit code 1
2024-05-22T20:33:53.5626540Z ##[error]Process completed with exit code 1.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126908
Approved by: https://github.com/huydhn
2024-05-22 21:41:21 +00:00
a30baec0c3 [Docs] Fix NumPy + backward example (#126872)
We were calling backward on a tensor not a scalar...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126872
Approved by: https://github.com/albanD
2024-05-22 21:29:31 +00:00
e4623de4cf typing scheduler.py [2/2]: Apply types (#126656)
Add `# mypy: disallow-untyped-defs` to scheduler.py and then fix the resulting fallout.

We probably should eventually add a new node between BaseSchedulerNode and all the non-FusedSchedulerNode types to indicate the split between nodes that have a valid `self.node` and ones that don't. That would cause a lot of the `assert self.node is not None` churn to go away - but was a bigger change because a lot of code makes assumptions about types that aren't reflected in the types themselves.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126656
Approved by: https://github.com/eellison
2024-05-22 20:33:31 +00:00
3591bce6c7 Add usage explanation in torch.dot ducment (#125908)
Fixes #125842

Add unsupported declaration on <code>torch.dot</code>, avoid misused like:

```python
>>> t1, t2 = torch.tensor([0,1]), torch.tensor([2,3])
>>> torch.dot(input=t1, other=t2)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: dot() missing 1 required positional arguments: "tensor"
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125908
Approved by: https://github.com/albanD
2024-05-22 20:33:12 +00:00
0939b68980 Support dtype kwarg in _foreach_norm (#125665)
Fixes #125040

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125665
Approved by: https://github.com/janeyx99
2024-05-22 20:27:50 +00:00
d62b025efc [TorchElastic] Option for sharing TCPStore created by rdzv handlers (#125743)
Summary:

1. Define explicit `use_agent_store` on rdzv handlers. Handlers that set is true can share the store.
2. Instead of agent coordinating master_add/master_port values, the logic is now encapsulated by a *rdzv_handler* where `RendezvousInfo` will have `RendezvousStoreInfo` object that handlers must return.
    - Depending on the implementation they can either:
         - point to existing store (and expected to `use_agent_store` as true - point 1). Client code will rely on `TORCHELASTIC_USE_AGENT_STORE` env variable to know if the store is shared.
         - build args that `torch.distributed.init_process_group` can bootstrap by creating new store.

Additional points:

- When TCPStore is shared, it should be wrapped in PrefixStore to qualify/scope namespace for other usecases.
- `next_rendezvous` signature changed to return instance of `RendezvousInfo` instead of a (store, rank, world_size) tuple for extensibility purposes.

Why:
- Reduce moving parts
   - easier to swap implementation
   - improve tractability
   - addressing perf/debug-ability will benefit all usecases
   -
Test Plan: CI

Differential Revision: D57055235

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125743
Approved by: https://github.com/d4l3k
2024-05-22 18:24:11 +00:00
fde1e8af7a [dtensor] implement distributed topk operator (#126711)
as titled. Implemented the topk operator in DTensor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126711
Approved by: https://github.com/wz337
ghstack dependencies: #126710
2024-05-22 18:11:56 +00:00
af633e4a7b [dtensor] remove unused failed_reason (#126710)
as titled, this field is not actively used, so removing it

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126710
Approved by: https://github.com/wz337
2024-05-22 18:11:56 +00:00
a8195f257e [custom_op] use new python custom ops API on prims ops (#124665)
Also ads a non-decorator version of `custom_op`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124665
Approved by: https://github.com/zou3519
2024-05-22 17:48:33 +00:00
db0b74bbc5 [CUDA Caching Allocator] Allow division of 0 (#126833)
Summary: Division of 0 means disabling roundup.

Test Plan: CI

Differential Revision: D57651410

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126833
Approved by: https://github.com/banitag1
2024-05-22 17:40:39 +00:00
d4ec18bdad Prevent partitioner from ever saving views (#126446)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126446
Approved by: https://github.com/anijain2305
ghstack dependencies: #126615
2024-05-22 17:28:46 +00:00
51e707650f Fix flexattention not realizing inputs before lowering (also refactored runtime estimation) (#126615)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126615
Approved by: https://github.com/yanboliang, https://github.com/drisspg, https://github.com/xmfan
2024-05-22 17:28:46 +00:00
3e826c477a [pipelining] Add pipeline stage test (#126721)
Test tracer's and manual's stage creation by using a basic schedule (GPipe).

(Migrated from https://github.com/pytorch/PiPPy/blob/main/test/test_pipeline_stage.py)

Test command:
```
$ python test_stage.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126721
Approved by: https://github.com/wconstab, https://github.com/H-Huang
ghstack dependencies: #126812
2024-05-22 16:24:51 +00:00
403012b50a [pipelining] expose APIs per pytorch rule (#126812)
Rule is enforced by #126103.

The rule:
- If `torch.a.b` defines a public class `C` (i.e. to be exposed in torch API namespace), then `torch.a.b` must be a public path, i.e. no `_`.
- `torch.a.b` should ideally have an `__all__` that defines what should be imported from this file when it is imported.
- All other definitions in `torch.a.b` that you don't want to expose should have a `_` prefix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126812
Approved by: https://github.com/wconstab
2024-05-22 16:21:13 +00:00
599e684ad6 [AOTI] Disable stack allocation for OSS (#125732)
Summary: Stack allocation is for certain small CPU models, but its coverage still needs improvement, so default to OFF for OSS.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125732
Approved by: https://github.com/chenyang78
ghstack dependencies: #126720, #126801
2024-05-22 15:33:24 +00:00
ff617ab6c8 [AOTI] Fix an int array codegen issue (#126801)
Summary: fixes https://github.com/pytorch/pytorch/issues/126779. When an int array contains symbol expression, we can't declare it with constexpr.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126801
Approved by: https://github.com/chenyang78
ghstack dependencies: #126720
2024-05-22 15:33:24 +00:00
19cd4484ec [AOTI] Add more fallback ops (#126720)
Summary: These ops are either in either unit tests or TorchBench. Fixes https://github.com/pytorch/pytorch/issues/122050

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126720
Approved by: https://github.com/chenyang78
2024-05-22 15:33:24 +00:00
0d17aae242 Teach FakeTensor to fill in item_memo when converting scalar CPU tensor (#126245)
This PR requires a little justification, but let's start with what it does first:

1. When you have a 0d CPU scalar int64/float64 tensor input to a graph, we will preallocate a backed SymInt/SymFloat corresponding to what you would get if you call item() on this tensor. This means you can freely change your input to be a Python int/float or a Tensor with an item() call and end up with exactly the same level of expressivity (specifically, you can guard on the internal SymInt/SymFloat no matter what). By default, the source of the backed SymInt/SymFloat is `L['tensor'].item()`, but if you have promoted a float input into a Tensor, we will cancel out `torch.as_tensor(L['float']).item()` into just `L['float']`.
2. We switch wrap_symfloat to use this, instead of hand crafting the new SymNodeVariable. Everything works out, except that we carefully pass the item() result to tracked fakes (and not the fake Tensor argument)

OK, so why do this at all? There is some marginal benefit where now some item() calls on scalar inputs can be guarded on, but IMO this is a pretty marginal benefit, and if it was the only reason, I wouldn't do this. The real reason for this is that I need to be able to propagate fake tensors through the graphs that are produced by Dynamo, and if I am doing the old custom wrap_symfloat logic, there's no way I can do this, because ordinarily an item() call will cause an unbacked SymInt when I reallocate.

The other obvious way to solve the problem above is to make a HOP alternative that item() that "bakes in" the backed SymInt its supposed to return. But this strategy seems more parsimonious, and it does have the marginal benefit I mentioned above. The main downside is that what I have to do next, is make it so that when I run tensor computation, I also apply the equivalent operations to the SymInt/SymFloat as well. That's next PR.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126245
Approved by: https://github.com/eellison
ghstack dependencies: #126637
2024-05-22 15:25:38 +00:00
86ad101370 Enable pickling torch._C.Generator (#126271)
Fixes #71398

Add `__reduce__` and `__setstate__` methods for `torch._C.Generator`.

`__reduce__` returns a tuple of 3 values:

1. `torch.Generator` itself.
2. A one-element tuple containing the `torch.device` to create the `Generator` with, since this cannot be changed after the object is created.
3. The state, a three-element tuple: the initial seed, the offset (or `None` if a CPU `Generator`), and the RNG state tensor.

`__setstate__` calls `manual_seed`, `set_offset` (if not `None`), and `set_state` on each respective element of the state.

Added test demonstrating successful reserialization with cpu and cuda `Generator`s.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126271
Approved by: https://github.com/ezyang
2024-05-22 14:38:47 +00:00
ed734178ab Refresh OpOverloadPacket if a new OpOverload gets added (#126863)
If a user accesses an OpOverloadPacket, then creates a new OpOverload,
then uses the OpOverloadPacket, the new OpOverload never gets hit. This
is because OpOverloadPacket caches OpOverloads when it is constructed.

This PR fixes the problem by "refreshing" the OpOverloadPacket if a new
OpOverload gets constructed and the OpOverloadPacket exists.

Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126863
Approved by: https://github.com/albanD
2024-05-22 14:13:27 +00:00
082251e76b fix invalid call to aoti_torch_tensor_copy_ (#126668)
Fixes #123039

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126668
Approved by: https://github.com/desertfire
2024-05-22 13:02:02 +00:00
2dd2699860 Introduce ProcessGroupCudaP2P (#122163)
## Context
This stack prototypes automatic micro-pipelining of `all-gather -> matmul` and `matmul -> reduce-scatter` via Inductor. The idea originates from the paper [Overlap Communication with Dependent Computation via
Decomposition in Large Deep Learning Models](https://dl.acm.org/doi/pdf/10.1145/3567955.3567959). The implementation and some key optimizations are heavily influenced by @lw's implementation in xformers.

The stack contains several components:
- `ProcessGroupCudaP2P` - a thin wrapper around `ProcessGroupNCCL`. It in addition maintains a P2P workspace that enables SM-free, one-sided P2P communication which is needed for optimal micro-pipelining.
- `fused_all_gather_matmul` and `fused_matmul_reduce_scatter` dispatcher ops.
- Post-grad fx pass that detects `all-gather -> matmul` and `matmul -> reduce-scatter` and replaces them with the fused dispatcher ops.

To enable the prototype feature:
- Set the distributed backend to `cuda_p2p`.
- Set `torch._inductor.config._micro_pipeline_tp` to `True`.

*NOTE: the prototype sets nothing in stone w.r.t to each component's design. The purpose is to have a performant baseline with reasonable design on which each component can be further improved.*

## Benchmark
Setup:
- 8 x H100 (500W) + 3rd gen NVSwitch.
- Llama3 8B training w/ torchtitan.
- 8-way TP. Reduced the number of layers from 32 to 8 for benchmarking purpose.

Trace (baseline): https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_tmpjaz8zgx0
<img width="832" alt="image" src="https://github.com/pytorch/pytorch/assets/4156752/4addba77-5abc-4d2e-93ea-f68078587fe1">

Trace (w/ micro pipelining): https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_tmpn073b4wn
<img width="963" alt="image" src="https://github.com/pytorch/pytorch/assets/4156752/4f44e78d-8196-43ab-a1ea-27390f07e9d2">

## This PR
`ProcessGroupCudaP2P` is a thin wrapper around `ProcessGroupNCCL`. By default, it routes all collectives to the underlying `ProcessGroupNCCL`. In addition, `ProcessGroupCudaP2P` initializes a P2P workspace that allows direct GPU memory access among the members. The workspace can be used in Python to optimize intra-node communication patterns or to create custom intra-node collectives in CUDA.

`ProcessGroupCudaP2P` aims to bridge the gap where certain important patterns can be better optimized via fine-grained P2P memory access than with collectives in the latest version of NCCL. It is meant to complement NCCL rather than replacing it.
Usage:
```
    # Using ProcessGroupCudaP2P
    dist.init_process_group(backend="cuda_p2p", ...)

    # Using ProcessGroupCudaP2P while specifying ProcessGroupCudaP2P.Options
    pg_options = ProcessGroupCudaP2P.Options()
    dist.init_process_group(backend="cuda_p2p", pg_options=pg_options, ...)

    # Using ProcessGroupCudaP2P while specifying ProcessGroupNCCL.Options
    pg_options = ProcessGroupNCCL.Options()
    dist.init_process_group(backend="cuda_p2p", pg_options=pg_options, ...)

    # Using ProcessGroupCudaP2P while specifying both
    # ProcessGroupCudaP2P.Options and ProcessGroupNCCL.Options
    pg_options = ProcessGroupCudaP2P.Options()
    pg_options.nccl_options = ProcessGroupNCCL.Options()
    dist.init_process_group(backend="cuda_p2p", pg_options=pg_options, ...)

    # Down-casting the backend to access p2p buffers for cuda_p2p specific
    # optimizations
    if is_cuda_p2p_group(group):
        backend = get_cuda_p2p_backend(group)
        if required_p2p_buffer_size > backend.get_buffer_size():
            # fallback
        p2p_buffer = backend.get_p2p_buffer(...)
    else:
        # fallback
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122163
Approved by: https://github.com/wanchaol
2024-05-22 09:33:05 +00:00
8a4597980c Revert "Fix flexattention not realizing inputs before lowering (also refactored runtime estimation) (#126615)"
This reverts commit 831efeeadf5fa8d9e7f973057e634a57e3bcf04b.

Reverted https://github.com/pytorch/pytorch/pull/126615 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/126615#issuecomment-2124169157))
2024-05-22 08:23:40 +00:00
0f37fd06d9 Revert "Prevent partitioner from ever saving views (#126446)"
This reverts commit da2292ce6b37028746bf5beeae04442eef1e803d.

Reverted https://github.com/pytorch/pytorch/pull/126446 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/126615#issuecomment-2124169157))
2024-05-22 08:23:40 +00:00
d2cbbdee31 Revert "Fix silu test for flexattention (#126641)"
This reverts commit cd3a71f754a2248bcfe500de7c9860bd7d2002bf.

Reverted https://github.com/pytorch/pytorch/pull/126641 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/126615#issuecomment-2124169157))
2024-05-22 08:23:40 +00:00
4575d3be83 [Quant][onednn] fix performance regression of depth-wise qconv (#126761)
Fixes #125663

It did not handle groups correctly in the original implementation.

Test plan:
Functionality is covered by UT.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126761
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5
2024-05-22 07:53:11 +00:00
aede940975 [inductor] Fix cuda compilation under fbcode remote execution (#126408)
Differential Revision: D57390072

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126408
Approved by: https://github.com/desertfire
2024-05-22 07:51:35 +00:00
edea2b81b5 [ONNX] Adds Support for Some Bitwise Ops in Onnx Exporter (#126229)
Addresses #126194

Adds support for
- "aten::bitwise_right_shift"
- "aten::bitwise_left_shift"
- "aten::bitwise_and"

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126229
Approved by: https://github.com/justinchuby
2024-05-22 07:47:43 +00:00
b516de8cac [halide-backend] Add HalideCodeCache (#126416)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126416
Approved by: https://github.com/shunting314
ghstack dependencies: #126631, #126655
2024-05-22 06:52:50 +00:00
d937d0db0f [SAC] fix ignored ops in eager mode to recompute (#126751)
as titled. I found that there're some issues in the eager mode SAC where
sometimes we would have recompute pop from storage of ops that are
missing, these ops are detach ops. So this PR refactors the two modes,
so that they would always recompute ignored ops
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126751
Approved by: https://github.com/yf225
2024-05-22 06:47:22 +00:00
3b0f6cce5c [pytree] freeze attributes of TreeSpec (#124011)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124011
Approved by: https://github.com/zou3519
2024-05-22 05:57:00 +00:00
6edf989e2f [CUDA Caching Allocator] Round to nearest 512 bytes boundary if number of divisions=1 (#126830)
Summary: This diff fixes an issue when the number of divisions=1, resulting in unaligned memory accesses.

Reviewed By: 842974287

Differential Revision: D57648763

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126830
Approved by: https://github.com/842974287
2024-05-22 04:57:24 +00:00
ae66c94eaa Capture dtype in Flight Recorder (#126581)
Summary:
Capture dtype in flight recorder.
Mismatched dtypes can lead to hangs.

Newly added logs to job show mismatching DTYPE of op, which affects data
size.  Even though the sizes match and we don't see the dtype on the FR
log.

We end up capturing the type as follows:
```
{'entries': [{'record_id': 0, 'pg_id': 0, 'process_group': ('0', 'default_pg'), 'collective_seq_id': 1, 'p2p_seq_id': 0, 'op_id': 1, 'profiling_name': 'nccl:all_reduce', 'time_created_ns': 1715989097552775261, 'duration_ms': 6.697696208953857, 'input_sizes': [[3, 4]], 'input_dtypes': [6], 'output_sizes': [[3, 4]], 'output_dtypes': [6], 'state': 'completed', 'time_discovered_started_ns': 1715989097593778240, 'time_discovered_completed_ns': 1715989097593778461, 'retired': True,
```
Notice the new fields:
input_dtypes: [6]
output_dtypes: [6]

Test Plan:
unit tests

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/issues/126554

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126581
Approved by: https://github.com/wconstab
2024-05-22 03:38:09 +00:00
7530cfe7e4 [dynamo][flaky tests] test_conv_empty_input_* (#126790)
Run CI, maybe fixes https://github.com/pytorch/pytorch/issues/126178

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126790
Approved by: https://github.com/mikaylagawarecki
2024-05-22 03:14:21 +00:00
ac1f0befcf Remove redundant serialization code (#126803)
After https://github.com/pytorch/pytorch/pull/123308, we no longer need separate serialization path to handle different types that exist in the nn_module metadata. This PR cleans up the redundant code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126803
Approved by: https://github.com/angelayi
2024-05-22 03:14:17 +00:00
608a11c496 [pipelining] Retire PIPPY_VERBOSITY in favor of TORCH_LOGS=pp (#126828)
https://github.com/pytorch/pytorch/pull/126499/ established:

`TORCH_LOGS=pp` --> info
`TORCH_LOGS=-pp` --> warn
`TORCH_LOGS=+pp` --> debug

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126828
Approved by: https://github.com/wconstab
2024-05-22 02:52:58 +00:00
e3c96935c2 Support CUDA_INC_PATH env variable when compiling extensions (#126808)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126808
Approved by: https://github.com/amjames, https://github.com/ezyang
2024-05-22 02:44:32 +00:00
5fa7aefb49 [pipelining] Do not print loss (#126829)
`loss` is a tensor, printing it would induce a GPU-CPU sync, which would slow down the program more than regular debug overhead.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126829
Approved by: https://github.com/wconstab
2024-05-22 02:32:04 +00:00
e6f655697b [AOTI] Fix unsupported type of output=s1 (#126797)
Fixes #123036

In unit test `DynamicShapesCudaWrapperCudaTests.test_scaled_dot_product_attention_cuda_dynamic_shapes_cuda_wrapper`, computed buffer buf3 is compiled to a fallback kernel `aoti_torch_cuda__scaled_dot_product_flash_attention`. It has 9 outputs whose types are `[MultiOutput, MultiOutput, None, None, s1, s1, MultiOutput, MultiOutput,MultiOutput]`. The type `s1` here is passed from [generate_output](acfe237a71/torch/_inductor/ir.py (L5658)).

They type check for Symbol is missing for fallback kernel output generation. This PR fixes this issue by checking `output.is_Symbol`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126797
Approved by: https://github.com/desertfire
2024-05-22 02:15:43 +00:00
a379ed6e98 Fix SobolEngine default dtype handling (#126781)
- Change default dtype argument to `None` and fetch it value via `torch.get_default_dtype()` call if not defined
- Fix bug in first draw handling logic, that would ignore dtype in favor of default one due to type promotion
- Add regression tests

Fixes https://github.com/pytorch/pytorch/issues/126478
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126781
Approved by: https://github.com/Skylion007, https://github.com/albanD
2024-05-22 01:55:48 +00:00
28f29e074b Dont mutate tensor stride in place in cudnn conv (#126786)
Fix for https://github.com/pytorch/pytorch/issues/126241.

Within the cudnn convolution, we were in-place updating the strides of the tensor to disambiguate for size-1 dims and contiguous and channels last tensors. Instead of mutating the tensors stride, just use a temporary. Inside cudnn it is then copied: d7ccb5b3c4/include/cudnn_frontend_Tensor.h (L201-L203).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126786
Approved by: https://github.com/ezyang, https://github.com/shunting314, https://github.com/eqy
2024-05-22 01:53:44 +00:00
66c23cb021 Add micro-benchmark framework and multi_layer_norm as an example (#126754)
```micro_benchmark.py``` output csv example (all numbers are fake, just for demo)
```
name,metric,target,actual
multi_layer_norm,inference_time(s),20,19.87
multi_layer_norm,memory_bandwidth(GB/s),108,108.04
llama2-int8, token_per_sec,155,156
llama2-int8,memory_bandwidth(GB/s),92,92.7
```
Expected dashboard looks like:
```
| name             | metric                 | target | actual | change |
|------------------|------------------------|--------|--------|--------|
| multi_layer_norm | inference_time(s)      | 20     | 19.87  | 99%    |
|                  | memory_bandwidth(GB/s) | 108    | 108.04 | 101%   |
| llama2-int8      | token_per_sec          | 155    | 156    | 100%   |
|                  | memory_bandwidth(GB/s) | 92     | 92.7   | 101%   |

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126754
Approved by: https://github.com/Chillee
2024-05-22 01:27:37 +00:00
636e79991c [FSDP2] Fixed 2D clip grad norm test (#126497)
This fixes https://github.com/pytorch/pytorch/issues/126484.

We change from transformer to MLP stack since transformer seems to introduce slight numeric differences when using TP. We include a sequence parallel layer norm module in the MLP stack to exercise `(S(0), R)` placement.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126497
Approved by: https://github.com/weifengpy, https://github.com/wz337
2024-05-22 00:29:13 +00:00
25ea32567e [caffe2][1/n] migrate global Static Initializer (#126688)
Summary:
Caffe2 lib has 200+ global static initializer usage, which are papar-cut reference to startup perf. Detail in this post https://fb.workplace.com/groups/arglassesperf/permalink/623909116287154.
Kick off a stack to migirate all usage of global static initializer in caffe2.

Test Plan: TODO: Please advise how can i test this change?

Differential Revision: D57531083

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126688
Approved by: https://github.com/ezyang
2024-05-22 00:16:06 +00:00
10a5c1b26c [Dynamo][TVM] Fix tvm backend interface (#126529)
Fixes #126528

The repro in the above issue works fine with this PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126529
Approved by: https://github.com/xmfan
2024-05-21 23:31:15 +00:00
1e818db547 [torchbench] Fix torchao benchmarking script (#126736)
As the title says.

Test Plan:

```
python benchmarks/dynamo/torchbench.py --only BERT_pytorch --bfloat16 --quantization int8dynamic --performance --inference --print-memory

cuda eval  BERT_pytorch
[XZ Debug] Torch grad status: False
memory: eager: 0.82 GB, dynamo: 0.92 GB, ratio: 0.89
running benchmark: 100%
1.001x
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126736
Approved by: https://github.com/jerryzh168, https://github.com/huydhn
2024-05-21 23:15:12 +00:00
9dba1aca0e [inductor] Relax type annotations for statically_known_* (#126655)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126655
Approved by: https://github.com/Skylion007, https://github.com/shunting314
ghstack dependencies: #126631
2024-05-21 23:12:42 +00:00
c08afbb3da [inductor] Add kernel_code logging artifact (#126631)
This is useful for some compile errors where we don't finish outputting the full graph.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126631
Approved by: https://github.com/shunting314
2024-05-21 23:12:42 +00:00
4e921593a4 [c10d]skip nan tests for lower versions of CUDA (#126701)
Summary:
We found that the UNIT tests would hang only in one test,
linux-focal-cuda11.8-py3.9-gcc9 / test (multigpu, 1, 1,
linux.g5.12xlarge.nvidia.gpu),
in which DSA would still be raised, but somehow the process would cause
errors like:
P1369649418

Test Plan:
Run CI tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126701
Approved by: https://github.com/wconstab
ghstack dependencies: #126409
2024-05-21 22:25:29 +00:00
f6ffe32a9d [AOTInductor] Automatic detection for buffer mutation and binary linking (#126706)
Summary: Instead of a explicit config for users to determine buffer mutation, we automatically detect whether there's buffer mutation in the model and determine which section constants would be placed. If constants are too large and doesn't fit within section, we error out directly.

Test Plan: Existing tests for buffer mutation and large weight linking

Differential Revision: D57579800

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126706
Approved by: https://github.com/desertfire
2024-05-21 21:49:13 +00:00
fed536dbcf [DTensor][Optim] Add support for fused_adam and fused_adamw when lr is a tensor (#126750)
Fixes #126670

In this PR, we update the following:
1. lr is an kwarg. Add support to automatically turn on implict replication for kwarg. We only did this for arg previously.
2. add associated tensor_lr ops in pointwises.py
3. add associated unit test in test_optimizers.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126750
Approved by: https://github.com/wanchaol, https://github.com/msaroufim
2024-05-21 21:38:05 +00:00
7ee74d986a Enable UFMT format on test/typing files (#126038)
Fixes some files in #123062

Run lintrunner on files:
test/typing/**/*

```
$ lintrunner -a --take UFMT --all-files
ok No lint issues.
Successfully applied all patches.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126038
Approved by: https://github.com/shink, https://github.com/ezyang
2024-05-21 21:37:07 +00:00
1cc9354cb0 Unify the dtype to VecMask<float, N> in ops.masked (#126662)
**Summary**
Fix issue: https://github.com/pytorch/pytorch/issues/126449. For `ops.masked` in CPP backend, when input dtype is `bool`, we actually load it as `VecMask<float, N>`. So, we should unify the type of `other` and `mask` to the same as  `VecMask<float, N>` to invoke `blendv` method.

**Test Plan**
```
clear && python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_ops_masked_with_bool_input
clear && PYTORCH_ALL_SAMPLES=1 python -u -m pytest -s -v test/inductor/test_torchinductor_opinfo.py -k test_comprehensive__chunk_cat_cpu_bool
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126662
Approved by: https://github.com/isuruf, https://github.com/jgong5, https://github.com/peterbell10
2024-05-21 20:52:25 +00:00
fd7293db71 Bump rexml from 3.2.5 to 3.2.8 in /ios/TestApp (#126455)
Bumps [rexml](https://github.com/ruby/rexml) from 3.2.5 to 3.2.8.
- [Release notes](https://github.com/ruby/rexml/releases)
- [Changelog](https://github.com/ruby/rexml/blob/master/NEWS.md)
- [Commits](https://github.com/ruby/rexml/compare/v3.2.5...v3.2.8)

---
updated-dependencies:
- dependency-name: rexml
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-05-21 13:47:12 -07:00
fe0a36fd7c Fix a link in the compiler backend doc (#126079)
The core aten is the core subset of aten and seems the corrent link to replace the broken link.

Fixes #125961

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126079
Approved by: https://github.com/svekars
2024-05-21 20:16:04 +00:00
5325a6de64 [dtensor] remove output_ prefix from OpStrategy properties (#126359)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126359
Approved by: https://github.com/wanchaol, https://github.com/XilunWu
2024-05-21 19:54:29 +00:00
c73c9457aa Add guard_size_oblivious to vector_norm (#126772)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126772
Approved by: https://github.com/lezcano, https://github.com/Skylion007
ghstack dependencies: #126771
2024-05-21 19:53:21 +00:00
97eef61474 Don't assume compare_arg is fx.Node (#126771)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126771
Approved by: https://github.com/Skylion007
2024-05-21 19:53:21 +00:00
fc594ed219 Remove lint from retryable_workflows (#126806)
Related to https://github.com/pytorch/test-infra/pull/4934

Lint workflow now uses Docker, so there should not be network-related errors for pip installing stuff.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126806
Approved by: https://github.com/seemethere, https://github.com/ZainRizvi, https://github.com/huydhn
2024-05-21 19:47:23 +00:00
4e6673e244 Remove MAX_STACK_ENTRY from _build_table (#126583)
Summary:
As reported by this issue: https://github.com/pytorch/pytorch/issues/83584

We already store the entries in evt.stack so there is no need to cap the limit when we output the table to 5

Test Plan: Regression testing should cover this. We have unit tests to check the stack already.

Differential Revision: D57513565

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126583
Approved by: https://github.com/nmacchioni
2024-05-21 18:52:04 +00:00
0c76018714 [inductor] Don't inherit __future__ flags from the calling scope when compile -ing generated modules (#126454)
This file includes `from __futures__ import annotations` which interacts with `compile` by causing type annotations to be populated as strings. Triton does not parse the string annotation correctly. Avoid this behavior by passing `dont_inherit=True` to `compile`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126454
Approved by: https://github.com/peterbell10
2024-05-21 18:51:13 +00:00
cyy
7428fd19fe Remove outdated options from setup.py (#125988)
Since the recent removal of Caffe2 files.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125988
Approved by: https://github.com/ezyang
2024-05-21 18:48:23 +00:00
b40fb2de59 [AOTI] Fix a codegen issue when .item() is used for kernel arg (#126575)
Summary: fixes https://github.com/pytorch/pytorch/issues/126574 . Pass kernel argument type information into generate_args_decl, so it can generate the argument declaration instead of relying on string matching.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126575
Approved by: https://github.com/chenyang78
ghstack dependencies: #126369
2024-05-21 18:20:20 +00:00
5e2de16a6f [AOTI] Codegen None as empty tensor (#126369)
Summary: When None denotes an optional tensor, we codegen NULL to represent it; but when None is for actual tensor type, we need to codegen an empty tensor for it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126369
Approved by: https://github.com/chenyang78
2024-05-21 18:20:20 +00:00
ac51920656 Reapply "c10d: add Collectives abstraction (#125978)" (#126695)
This reverts commit d9c3485146913324ab4b3e211d2a4517e138f4af.

Reapplies #125978.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126695
Approved by: https://github.com/c-p-i-o
2024-05-21 18:00:09 +00:00
d8f5627a88 prune back configs (#126570)
We had a previous PR that added configs for an internal model. Running the below script on output from autotuning, we can prune back the added configs with negligible perf loss: P1365917790.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126570
Approved by: https://github.com/nmacchioni
2024-05-21 17:44:32 +00:00
85fd76f76d Add test coverage for fp16 matrix-vector specialized kernel (#126700)
Summary: This kernel is special-cased on ARM because it's important for LLMs, so let's have test coverage.

Test Plan: Ran locally and it passes. Intentionally broke fp16_gemv_trans and saw it fail, confirming it provides coverage.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126700
Approved by: https://github.com/malfet
2024-05-21 17:23:16 +00:00
bae3b17fd9 Tweak a comment and fix spelling (#126681)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126681
Approved by: https://github.com/Skylion007
2024-05-21 17:19:06 +00:00
0756f9f5fd Remove debug breakpoint (#126756)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126756
Approved by: https://github.com/BowenBao, https://github.com/Skylion007
2024-05-21 17:04:50 +00:00
140ab89c02 typing scheduler.py [1/2]: Bug fix (#126610)
Found while getting scheduler.py to typecheck - split off to make reviewing easier.

1. is_template: I'm pretty sure this is a bug.  Based on the definition of `is_template` I'm pretty sure we want to return the node's `get_template_node()`, not the node itself.

2. can_free: It seems that this was intended to b a raise, not a return.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126610
Approved by: https://github.com/eellison
2024-05-21 16:59:37 +00:00
ac2c547838 [TD] Upload names of failures to s3 for pytest cache (#126315)
Some tests don't get run through pytest and pytest crashes when a test segfaults, so in both caess, the pytest cache won't have an entry (similar to https://github.com/pytorch/test-infra/pull/5205).

Instead, manually upload/download an extra file that lists the failing test files

Technically this would be more general than the pytest cache
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126315
Approved by: https://github.com/ZainRizvi
2024-05-21 16:29:31 +00:00
4a7b46be3d small changes to padding (#126716)
Add cost of writing padding 0s to benchmark, skip dimension that can be squeezed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126716
Approved by: https://github.com/shunting314
2024-05-21 16:09:32 +00:00
980f5ac049 Revert "[Quant][PT2E] enable qlinear post op fusion for dynamic quant & qat (#122667)"
This reverts commit 3642e51ea527e23ded10afc266f298b0cb5350c8.

Reverted https://github.com/pytorch/pytorch/pull/122667 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/122667#issuecomment-2122642317))
2024-05-21 13:45:07 +00:00
b36e01801b [3.12, inductor] re-enable AsyncCompile.warm_pool for 3.12 (#126724)
Somehow working now? Fixes https://github.com/pytorch/pytorch/issues/124192 and https://github.com/pytorch/pytorch/issues/125979.

Still getting the warning
```
/home/williamwen/local/installs/python3.12/debug/install/lib/python3.12/multiprocessing/popen_fork.py:66: DeprecationWarning: This process (pid=2360707) is multi-threaded, use of fork() may lead to deadlocks in the child.
  self.pid = os.fork()
```
though

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126724
Approved by: https://github.com/masnesral, https://github.com/jansel
2024-05-21 08:50:13 +00:00
cyy
faa72dca41 Remove QNNPACK submodule (#126657)
QNNPACK has integrated into ATEN for a long time and removing it from third party causing no build issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126657
Approved by: https://github.com/ezyang
2024-05-21 07:25:24 +00:00
7d34cfd28a Update torch-xpu-ops pin (ATen XPU implementation) (#126744)
Regular bi-weekly pin update. New 85 ATen operators are implemented in XPU backend.
https://github.com/intel/torch-xpu-ops/blob/release/2.4/yaml/xpu_functions.yaml

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126744
Approved by: https://github.com/EikanWang
2024-05-21 07:21:52 +00:00
4b23c4fc5d [Pipelining] Clean up function names in 1f1b schedule (#126582)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126582
Approved by: https://github.com/kwen2501
ghstack dependencies: #126539
2024-05-21 06:50:02 +00:00
8c9d332953 [c10d] fix excepthook crash on exc after destroy_process_group (#126739)
fixes #126379

This is the easy fix.  An additional fix that I did not do is to
deregister the excepthook (or rather, restore the orignal one) when
calling dist.destroy_process_group.  This might be a bit complicated in
practice, so landing as is for now.

Also, couldn't figure out a clean way to test this.  assertRaisesRegex
wasn't getting a string value, probably becuase of the stderr
redirection done via the excepthook in the first place.

Output from the original repro is cleaned up with the fix:

```
[rank0]: Traceback (most recent call last):
[rank0]:   File "/data/users/whc/pytorch/except.py", line 6, in <module>
[rank0]:     raise ZeroDivisionError
[rank0]: ZeroDivisionError
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126739
Approved by: https://github.com/yf225
2024-05-21 06:39:18 +00:00
e363a8a222 Revert "[pipelining] Add pipeline stage test (#126721)"
This reverts commit b948b1ad7a9cf61c9692506c60c295fd40e00f43.

Reverted https://github.com/pytorch/pytorch/pull/126721 on behalf of https://github.com/clee2000 due to The test_public_bindings failure is real, you just got unlucky since it was also broken on trunk for a different reason ([comment](https://github.com/pytorch/pytorch/pull/126721#issuecomment-2121725408))
2024-05-21 04:40:05 +00:00
dc2560f073 [Pipelining] Add debug logs for batch p2p ops (#126539)
logs from torchtitan:

<img width="2878" alt="image" src="https://github.com/pytorch/pytorch/assets/4984825/4039c85f-0bf1-4924-92fa-2c55e8e4da2a">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126539
Approved by: https://github.com/kwen2501, https://github.com/H-Huang
2024-05-21 03:54:46 +00:00
b96d9090d2 [C10D] make get_node_local_rank() accept fallback_rank (#126737)
Addresses follow up comments on #123992 and allows the use case of
writing code that checks `get_node_local_rank(fallback_rank=0)` and
runs correctly whether in the presence of a launcher (e.g. torchrun),
or run locally on a single device.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126737
Approved by: https://github.com/shuqiangzhang
2024-05-21 03:38:02 +00:00
c1b90a4e8a [Dynamo] Treat integers stored on nn.Modules as dynamic (#126466)
Fixes #115711

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126466
Approved by: https://github.com/jansel
2024-05-21 03:31:20 +00:00
a83e745356 [BE] split seq_id to collective_seq_id and p2p_seq_id (#125727)
Summary:
Split out `seq_id` into `collective_seq_id` and `p2p_seq_id`. The main idea here is that collectives that go to all machines should have identical `collective_seq_id` and therefore it makes it easier to spot if one of machines isn't handling a collective operation.
Next, we can attempt to match up p2p operations to ensure that the sender(s)/receivers(s) are in sync.

Resolves issue: https://github.com/pytorch/pytorch/issues/125173

Test Plan:
Unit tests.

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125727
Approved by: https://github.com/zdevito
2024-05-21 03:26:49 +00:00
eqy
5f64086d08 [NT][SDPA] Bump tolerances for test_sdpa_with_packed_in_proj_cuda_bfloat16 (#126356)
Current tolerances fail on RTX 6000 (Ada) with `Mismatched elements: 2 / 144 (1.4%)`

```
AssertionError: Tensor-likes are not close!

Mismatched elements: 2 / 144 (1.4%)
Greatest absolute difference: 0.002197265625 at index (5, 0, 0) (up to 0.001 allowed)
Greatest relative difference: 0.08203125 at index (3, 0, 0) (up to 0.016 allowed)

To execute this test, run the following from the base repo dir:
     python test/test_nestedtensor.py -k test_sdpa_with_packed_in_proj_cuda_bfloat16

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

----------------------------------------------------------------------
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126356
Approved by: https://github.com/drisspg
2024-05-21 03:25:30 +00:00
40cc616909 Fix caching allocator of out-of-tree device is destructed before the … (#126677)
…destruction of tensors cached by autocast

## Root Cause
For out-of-tree device extension it is loaded after torch (different .so), so the global variable `cached_casts` may be constructed before caching allocator and then destructed in reversed order when exit.

## Fix
Lazily initialize `cached_casts` to correct the order.

## How to Reproduce && Test
Modify the testcase `TestAutocastGPU.test_cast_cache_is_global` in test/test_autocast.py  to run on your out-of-tree device. You will see following failure in the end of test.
```bash
----------------------------------------------------------------------
Ran 1 test in 4.812s

OK
free: 0x30080ff44000400
terminate called after throwing an instance of 'c10::Error'
  what():  invalid device pointer: 0x30080ff44000400
Exception raised from free at /projs/framework/betterman/code/pytorch_new/catch/torch_mlu/csrc/framework/core/caching_allocator.cpp:1609 (most recent call first):
frame #0: <unknown function> + 0x118fe1 (0x7ffaef4d3fe1 in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so)
frame #1: <unknown function> + 0x11b1c4 (0x7ffaef4d61c4 in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so)
frame #2: <unknown function> + 0x117677 (0x7ffaef4d2677 in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so)
frame #3: <unknown function> + 0x11a2bf (0x7ffaef4d52bf in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so)
frame #4: <unknown function> + 0x11a186 (0x7ffaef4d5186 in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so)
frame #5: <unknown function> + 0x119fde (0x7ffaef4d4fde in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so)
frame #6: <unknown function> + 0x119d2e (0x7ffaef4d4d2e in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so)
frame #7: <unknown function> + 0x119be0 (0x7ffaef4d4be0 in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so)
frame #8: <unknown function> + 0x119977 (0x7ffaef4d4977 in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so)
frame #9: <unknown function> + 0x119313 (0x7ffaef4d4313 in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so)
frame #10: <unknown function> + 0x118b4c (0x7ffaef4d3b4c in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so)
frame #11: c10::Error::Error(c10::SourceLocation, std::string) + 0x34 (0x7ffaef4d27c4 in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so)
frame #12: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x7f (0x7ffaef4d04ed in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so)
frame #13: torch_mlu::MLUCachingAllocator::Native::NativeCachingAllocator::free(void*) + 0xe6 (0x7ff9a8eeb112 in /projs/framework/betterman/code/pytorch_new/catch/torch_mlu/csrc/lib/libtorch_mlu.so)
frame #14: torch_mlu::MLUCachingAllocator::Native::local_raw_delete(void*) + 0x3b (0x7ff9a8ed9480 in /projs/framework/betterman/code/pytorch_new/catch/torch_mlu/csrc/lib/libtorch_mlu.so)                                                                                                                         frame #15: std::unique_ptr<void, void (*)(void*)>::~unique_ptr() + 0x50 (0x7ffb0a5ea322 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_python.so)
frame #16: <unknown function> + 0x1269890 (0x7ffb0a5e4890 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_python.so)
frame #17: <unknown function> + 0x1269928 (0x7ffb0a5e4928 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_python.so)
frame #18: <unknown function> + 0x127572c (0x7ffb0a5f072c in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_python.so)
frame #19: <unknown function> + 0x1275758 (0x7ffb0a5f0758 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_python.so)
frame #20: <unknown function> + 0xb9bc7 (0x7ffaef474bc7 in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so)
frame #21: <unknown function> + 0xb97bc (0x7ffaef4747bc in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so)
frame #22: <unknown function> + 0xdbc50 (0x7ffaef496c50 in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so)
frame #23: c10::TensorImpl::~TensorImpl() + 0x82 (0x7ffaef49157e in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so)
frame #24: c10::TensorImpl::~TensorImpl() + 0x1c (0x7ffaef4915aa in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so)
frame #25: <unknown function> + 0x2f596d9 (0x7ffaf24fc6d9 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_cpu.so)
frame #26: <unknown function> + 0x2f589c2 (0x7ffaf24fb9c2 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_cpu.so)
frame #27: <unknown function> + 0x2f57b92 (0x7ffaf24fab92 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_cpu.so)
frame #28: <unknown function> + 0x2f5c228 (0x7ffaf24ff228 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_cpu.so)
frame #29: <unknown function> + 0x30f3f70 (0x7ffaf2696f70 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_cpu.so)
frame #30: <unknown function> + 0x30f3f90 (0x7ffaf2696f90 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_cpu.so)
frame #31: <unknown function> + 0x30f5004 (0x7ffaf2698004 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_cpu.so)                                                                                                                                                                                frame #32: <unknown function> + 0x30f5024 (0x7ffaf2698024 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_cpu.so)
frame #33: <unknown function> + 0x31207f0 (0x7ffaf26c37f0 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_cpu.so)
frame #34: <unknown function> + 0x3120814 (0x7ffaf26c3814 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_cpu.so)
frame #35: <unknown function> + 0x30f51e8 (0x7ffaf26981e8 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_cpu.so)
frame #36: <unknown function> + 0x30f5148 (0x7ffaf2698148 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_cpu.so)
frame #37: <unknown function> + 0x316ecea (0x7ffaf2711cea in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_cpu.so)
frame #38: <unknown function> + 0x468a7 (0x7ffb0c9ed8a7 in /lib/x86_64-linux-gnu/libc.so.6)
frame #39: on_exit + 0 (0x7ffb0c9eda60 in /lib/x86_64-linux-gnu/libc.so.6)
<omitting python frames>
frame #47: __libc_start_main + 0xf3 (0x7ffb0c9cb083 in /lib/x86_64-linux-gnu/libc.so.6)

Aborted (core dumped)

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126677
Approved by: https://github.com/ezyang
2024-05-21 03:20:17 +00:00
51c07f9f69 [dynamo] Allow asserts to fail (#126661)
Currently if an assertion is statically known to be false, dynamo converts it to
`_assert_async` which inductor currently ignores. Instead this graph breaks to
raise the original assertion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126661
Approved by: https://github.com/ezyang
2024-05-21 02:42:13 +00:00
d777685ef9 Script for choosing template configurations (#126560)
This adds logging that will mark any invocation of a matmul for a particular input shapes, and record every template configs performance on it. Then, we can parse that into a script which will minimize the total mm execution time given N allowed templates. And in future, other experiments..

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126560
Approved by: https://github.com/nmacchioni, https://github.com/jansel
2024-05-21 02:28:39 +00:00
d30cdc4321 [ROCm] amdsmi library integration (#119182)
Adds monitoring support for ROCm using amdsmi in place of pynvml.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119182
Approved by: https://github.com/jeffdaily, https://github.com/malfet, https://github.com/xw285cornell
2024-05-21 01:59:26 +00:00
b948b1ad7a [pipelining] Add pipeline stage test (#126721)
Test tracer's and manual's stage creation by using a basic schedule (GPipe).

(Migrated from https://github.com/pytorch/PiPPy/blob/main/test/test_pipeline_stage.py)

Test command:
```
$ python test_stage.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126721
Approved by: https://github.com/wconstab, https://github.com/H-Huang
2024-05-21 01:22:10 +00:00
31ba6ee49b Traceable wrapper subclass support for deferred runtime asserts (#126198)
The padded dense -> jagged conversion op has the signature:
```
_fbgemm_dense_to_jagged_forward(Tensor dense, Tensor[] offsets, SymInt? total_L=None) -> Tensor
```

when `total_L` is not specified, the meta registration has a data-dependent output shape (based on `offsets[0][-1]`). Returning an unbacked SymInt here should work in theory, but traceable wrapper subclass support is missing in later code to handle deferred runtime asserts. This PR fixes this.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126198
Approved by: https://github.com/ezyang
2024-05-21 01:21:46 +00:00
82b4528788 [cudagraph] fix verbose graph logging (#126694)
According to the [doc](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html#group__CUDART__TYPES_1g0907ca7a1e7d0211b71ee49c5403072b):

> enum cudaGraphDebugDotFlags
> CUDA Graph debug write options
>
> Values
> cudaGraphDebugDotFlagsVerbose = 1<<0
> Output all debug data as if every debug flag is enabled
> cudaGraphDebugDotFlagsKernelNodeParams = 1<<2
> Adds cudaKernelNodeParams to output
> cudaGraphDebugDotFlagsMemcpyNodeParams = 1<<3
> Adds cudaMemcpy3DParms to output
> cudaGraphDebugDotFlagsMemsetNodeParams = 1<<4
> Adds cudaMemsetParams to output
> cudaGraphDebugDotFlagsHostNodeParams = 1<<5
> Adds cudaHostNodeParams to output
> cudaGraphDebugDotFlagsEventNodeParams = 1<<6
> Adds cudaEvent_t handle from record and wait nodes to output
> cudaGraphDebugDotFlagsExtSemasSignalNodeParams = 1<<7
> Adds cudaExternalSemaphoreSignalNodeParams values to output
> cudaGraphDebugDotFlagsExtSemasWaitNodeParams = 1<<8
> Adds cudaExternalSemaphoreWaitNodeParams to output
> cudaGraphDebugDotFlagsKernelNodeAttributes = 1<<9
> Adds cudaKernelNodeAttrID values to output
> cudaGraphDebugDotFlagsHandles = 1<<10
> Adds node handles and every kernel function handle to output
> cudaGraphDebugDotFlagsConditionalNodeParams = 1<<15
> Adds cudaConditionalNodeParams to output
>

`1 << 10` is not the most verbose flag. it is just one flag to add node handles and every kernel function handle to output. `1 << 0` is the most verbose flag, under the name `cudaGraphDebugDotFlagsVerbose`.

Here is an example of graph, dumped with `1 << 10`:

```dot
digraph dot {
subgraph cluster_1 {
label="graph_1" graph[style="dashed"];
"graph_1_node_0"[style="solid" shape="rectangle" label="0
MEM_ALLOC
node handle: 0x000055D2889750F0
"];

"graph_1_node_1"[style="bold" shape="octagon" label="1
_Z3addPhS_S_m
node handle: 0x000055D288979A20
func handle: 0x000055D288978D40
"];

"graph_1_node_2"[style="solid" shape="trapezium"label="2
MEMCPY
node handle: 0x000055D28897A130
(DtoH,1024)
"];

"graph_1_node_3"[style="solid" shape="rectangle" label="3
MEM_FREE
node handle: 0x000055D2889890C0
"];

"graph_1_node_0" -> "graph_1_node_1";
"graph_1_node_1" -> "graph_1_node_2";
"graph_1_node_2" -> "graph_1_node_3";
}
}
```

The same graph dumped with `1 << 0`:

```dot
digraph dot {
subgraph cluster_1 {
label="graph_1" graph[style="dashed"];
"graph_1_node_0"[style="solid" shape="record" label="{
MEM_ALLOC
| {{ID | node handle} | {0 (topoId: 3) | 0x000055D2889750F0}}
| {{{poolProps | {allocType | handleTypes | {location | {type | id}}} | {PINNED | NONE | DEVICE | 0}}}}
| {{bytesize | dptr} | {1024 | 0x0000000A02000000}}
}"];

"graph_1_node_1"[style="bold" shape="record" label="{KERNEL
| {ID | 1 (topoId: 2) | _Z3addPhS_S_m\<\<\<4,256,0\>\>\>}
| {{node handle | func handle} | {0x000055D288979A20 | 0x000055D288978D40}}
| {accessPolicyWindow | {base_ptr | num_bytes | hitRatio | hitProp | missProp} | {0x0000000000000000 | 0 | 0.000000 | N | N}}
| {cooperative | 0}
| {priority | 0}
}"];

"graph_1_node_2"[style="solid" shape="record" label="{
MEMCPY
| {{ID | node handle} | {2 (topoId: 1) | 0x000055D28897A130}}
| {kind | DtoH (DEVICE to HOST PAGEABLE)}
| {{srcPtr | dstPtr} | {pitch | ptr | xsize | ysize | pitch | ptr | xsize | ysize} | {0 | 0x0000000A02000000 | 0 | 0 | 0 | 0x000055D287CA6DB0 | 0 | 0}}
| {{srcPos | {{x | 0} | {y | 0} | {z | 0}}} | {dstPos | {{x | 0} | {y | 0} | {z | 0}}} | {Extent | {{Width | 1024} | {Height | 1} | {Depth | 1}}}}
}"];

"graph_1_node_3"[style="solid" shape="record" label="{
MEM_FREE
| {{ID | node handle} | {3 (topoId: 0) | 0x000055D2889890C0}}
| {{dptr} | {0x0000000A02000000}}
}"];

"graph_1_node_0" -> "graph_1_node_1" [headlabel=0];
"graph_1_node_1" -> "graph_1_node_2" [headlabel=0];
"graph_1_node_2" -> "graph_1_node_3" [headlabel=0];
}
}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126694
Approved by: https://github.com/eqy, https://github.com/eellison
2024-05-21 00:55:15 +00:00
4644611b14 [cprofile] log manifold link instead of raw data to trace_structured (#126451)
Internal D57459752 returns manifold URL and this PR adds to tlparse payload

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126451
Approved by: https://github.com/jamesjwu
2024-05-21 00:44:55 +00:00
b85f9d7fa2 Add symbolic_shape_specialization structured trace (#126450)
This is typically the information you want when diagnosing why something
overspecialized in dynamic shapes.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126450
Approved by: https://github.com/albanD
2024-05-21 00:34:05 +00:00
cd3a71f754 Fix silu test for flexattention (#126641)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126641
Approved by: https://github.com/ezyang, https://github.com/drisspg
ghstack dependencies: #126615, #126446
2024-05-20 23:40:56 +00:00
da2292ce6b Prevent partitioner from ever saving views (#126446)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126446
Approved by: https://github.com/anijain2305
ghstack dependencies: #126615
2024-05-20 23:40:56 +00:00
831efeeadf Fix flexattention not realizing inputs before lowering (also refactored runtime estimation) (#126615)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126615
Approved by: https://github.com/yanboliang, https://github.com/drisspg, https://github.com/xmfan
2024-05-20 23:40:56 +00:00
14dc8d4f63 Protect codecache against cache failures (#126696)
When there's a manifold, memcache or filesystem related issues or network outages, we should not completely fail to compile but instead fallback to cold start.

Differential Revision: [D57573835](https://our.internmc.facebook.com/intern/diff/D57573835/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126696
Approved by: https://github.com/aorenste
2024-05-20 22:22:41 +00:00
6f1935b0b5 doc: torch.utils.data.Sampler: __len__ is optional (#125938)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125938
Approved by: https://github.com/andrewkho, https://github.com/xmfan
2024-05-20 22:20:36 +00:00
74b053d7c4 Pass model path to observer (#126503)
Summary: Passing model path to observer so that they can get additional info if needed.

Test Plan: contbuild & OSS CI

Differential Revision: D57475129

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126503
Approved by: https://github.com/kirklandsign
2024-05-20 22:17:56 +00:00
acfe237a71 Fix C++ compilation error for tensor array in abi_compatible mode (#126412)
Fixes #122048

There is a compilation error https://github.com/pytorch/pytorch/issues/122048  when  the element type in an array is tensor. It is because `val_to_arg_str does` not take arg type as input, and always generate an int array.

This PR change the underlying `codegen_int_array_var` to `codegen_var_array` by adding type checks and corresponding code generations.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126412
Approved by: https://github.com/desertfire
2024-05-20 20:57:50 +00:00
3d4f1c3083 [export] Make error name private (#126715)
Fixes CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126715
Approved by: https://github.com/clee2000
2024-05-20 20:50:11 +00:00
d28868c7e8 Change skipIfs to xfails in test_mps.py for test_isin (#125412)
Follow-up to #124896 to move the added test to use expectedFailure instead of skip.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125412
Approved by: https://github.com/kulinseth
2024-05-20 20:23:53 +00:00
8bca0847c2 Revert "[TD] Upload names of failures to s3 for pytest cache (#126315)"
This reverts commit 655038687afd19a4a4c9371b77ff046fd6c84be1.

Reverted https://github.com/pytorch/pytorch/pull/126315 on behalf of https://github.com/clee2000 due to broke inductor ([comment](https://github.com/pytorch/pytorch/pull/126315#issuecomment-2121133045))
2024-05-20 20:15:08 +00:00
2813f0672a fix huggingface models input issue in torchbench (#126579)
Fixes https://github.com/pytorch/benchmark/issues/2263.

According to https://github.com/pytorch/pytorch/blob/main/benchmarks/dynamo/common.py#L509, example_inputs are formatted as dictionaries for HuggingFace models. However, this forward_pass function passes all inputs to mod with *, which may only pass the input_ids key in HuggingFace model's example inputs.

To reproduce, run the following command.
```bash
python pytorch/benchmarks/dynamo/torchbench.py --performance --inference -dcuda --only=hf_Bert --output=torchbench_inference.csv
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126579
Approved by: https://github.com/xuzhao9
2024-05-20 19:10:46 +00:00
11c2d127ec [AOTInductor] Add config to allow buffer mutation (#126584)
Summary:
Add an additional config to allow buffer mutation.
For data that's greater than 2GB, we would need to set it as read-only, otherwise overflow would occur.
This is a temporary solution since it won't handle cases that requires mutable data greater than 2GB.

Test Plan: Included in commit.

Differential Revision: D57514729

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126584
Approved by: https://github.com/chenyang78
2024-05-20 18:16:00 +00:00
2068dadbe8 [torchbench] Add torchao to PT2 Benchmark Runner (#126469)
Summary:
X-link: https://github.com/pytorch/benchmark/pull/2268

Support torchao performance and accuracy tests in PT2 Benchmark Runner, using the inductor backend as the baseline.

Test Plan:
```
$ buck2 run mode/opt //caffe2/benchmarks/dynamo:torchbench -- --only BERT_pytorch --bfloat16 --quantization int8dynamic --performance --inference --print-memory

loading model: 0it [00:50, ?it/s]
cuda eval  BERT_pytorch
memory: eager: 0.75 GB, dynamo: 0.75 GB, ratio: 1.00
running benchmark: 100%
1.003x
```

Reviewed By: jerryzh168

Differential Revision: D57463273

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126469
Approved by: https://github.com/huydhn
2024-05-20 17:53:44 +00:00
022adf8c5e Fix bug for comptime.get_local for cells/closures (#126637)
I wasn't paying enough attention and didn't notice that LOAD_DEREF is
defined differently for InliningInstructionTranslator.  Match it up with
the code there.

This also fixes comptime.print(), which was broken, because closing over
an argument turned it into a cell rather than a regular local.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126637
Approved by: https://github.com/yanboliang
2024-05-20 17:51:28 +00:00
f9de510121 [dynamo] Graph break on set_num_threads (#126623)
Fixes #125364

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126623
Approved by: https://github.com/yanboliang
2024-05-20 17:44:32 +00:00
89c1cfe144 [export] Allow modules to be created in the forward (#125725)
Fixes the error in non-strict export when we're tracing a module that initializes another module in its forward function. This appears in [many huggingface models](https://github.com/search?q=repo%3Ahuggingface%2Ftransformers+CrossEntropyLoss%28%29&type=code&fbclid=IwAR285uKvSevJM6SDbXmb4-monj4iH7wf8opkvnec-li7sKpn4lUMjIvbGKc). It's probably not good practice to do this, but since it appears in so many places, and strict-export supports this, we will also support this.

The approach we'll take for these cases is that we will inline the call to the module. Parameters and buffers initialized as constants (with `torch.tensor`) will be represented as constant tensors, and those initialized with tensor factory functions (`torch.ones`) will show up as an operator in the graph. The module stack for the ops in the inlined module will reflect the toplevel's module stack.

One issue is that strict-export seems to segfault when there is an `nn.Parameter` call in the constructor (https://github.com/pytorch/pytorch/issues/126109). Non-strict export will succeed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125725
Approved by: https://github.com/ydwu4
2024-05-20 17:42:20 +00:00
655038687a [TD] Upload names of failures to s3 for pytest cache (#126315)
Some tests don't get run through pytest and pytest crashes when a test segfaults, so in both caess, the pytest cache won't have an entry (similar to https://github.com/pytorch/test-infra/pull/5205).

Instead, manually upload/download an extra file that lists the failing test files

Technically this would be more general than the pytest cache
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126315
Approved by: https://github.com/ZainRizvi
2024-05-20 17:36:30 +00:00
8c38d0cd64 [inductor] Fix edge case in JIT vs. AOT fusion after finalizing MultiTemplateBuffer (#126622)
# Context
Here's a peripheral scenario causing the JIT-pass and AOT-pass to pick different fusions.
```py
# JIT -- buf3 is a MultiTemplateBuffer
V.graph.buffers = [buf0, buf1, buf2, buf3, buf4]
                                ^          ^
# JIT pass calls finalize_multi_template_buffers()
V.graph.buffers = [buf0, buf1, buf2, buf4, *buf3*]

# AOT, note proximity_score(buf2, buf4) is "better" for fusion than JIT
V.graph.buffers = [buf0, buf1, buf2, buf4, *buf3*]
                                ^    ^
```

It happens like this:
* JIT starts with the original set nodes using V.graph.buffers
* In JIT, finalize_multi_template_buffers() is called which can change the order of the buffers.
* This makes the order of buffers/scheduler nodes different.
* Now, each node's min/max-order is different than before.
* As a result, the proximity between two nodes is different. ad67553c5c/torch/_inductor/scheduler.py (L2316-L2335)

# Error
```
$ TORCH_LOGS="+fusion" python test/inductor/test_max_autotune.py -k test_jit_fusion_matches_aot_fusion
======================================================================
FAIL: test_jit_fusion_matches_aot_fusion (__main__.TestMaxAutotune)
----------------------------------------------------------------------
Traceback (most recent call last):
  ...
  File "/data/users/colinpeppler/pytorch/torch/_inductor/graph.py", line 1718, in compile_to_fn
    code, linemap = self.codegen_with_cpp_wrapper()
  File "/data/users/colinpeppler/pytorch/torch/_inductor/graph.py", line 1618, in codegen_with_cpp_wrapper
    return self.codegen()
  File "/data/users/colinpeppler/pytorch/torch/_inductor/graph.py", line 1636, in codegen
    self.scheduler.codegen()
  File "/data/users/colinpeppler/pytorch/torch/_dynamo/utils.py", line 210, in time_wrapper
    r = func(*args, **kwargs)
  File "/data/users/colinpeppler/pytorch/torch/_inductor/scheduler.py", line 2602, in codegen
    self.get_backend(device).codegen_node(node)  # type: ignore[possibly-undefined]
  File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/cuda_combined_scheduling.py", line 66, in codegen_node
    return self._triton_scheduling.codegen_node(node)
  File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton.py", line 3377, in codegen_node
    return self.codegen_node_schedule(node_schedule, buf_accesses, numel, rnumel)
  File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton.py", line 3602, in codegen_node_schedule
    final_kernel.call_kernel(final_kernel.kernel_name)
  File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton.py", line 3055, in call_kernel
    grid = wrapper.generate_default_grid(name, grid)
  File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/cpp_wrapper_cuda.py", line 174, in generate_default_grid
    params is not None
AssertionError: cuda kernel parameters for triton_poi_fused_add_0 should already exist at this moment, only found dict_keys(['Placeholder.DESCRIPTIVE_NAME', 'triton_poi_fused_add_mul_0', 'triton_poi_fused_pow_1'])
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126622
Approved by: https://github.com/chenyang78
ghstack dependencies: #125982
2024-05-20 16:58:08 +00:00
7aa853a54e [CI] Install sccache on XLA build job (#126117)
XLA build job uses a docker image from XLA, which doesn't have sccache installed.  The XLA build job just builds pytorch, XLA gets built during the test job.  The pytorch build was taking 1+hrs, with a warm cache it takes <30min
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126117
Approved by: https://github.com/malfet
2024-05-20 16:39:14 +00:00
3642e51ea5 [Quant][PT2E] enable qlinear post op fusion for dynamic quant & qat (#122667)
**Description**
Add fusion path for dynamic quant and for QAT.
The following patterns can be matched for static quant with QAT cases:
`qx -> qlinear -> add -> optional relu -> optional type convert -> optional quant`

The following patterns can be matched for dynamic quant cases:
`qx -> qlinear -> add -> optional relu`

**Test plan**
python test/inductor/test_mkldnn_pattern_matcher.py -k test_qlinear
python test/inductor/test_cpu_cpp_wrapper.py -k test_qlinear
python test/test_quantization.py -k test_linear_unary
python test/test_quantization.py -k test_linear_binary

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122667
Approved by: https://github.com/jgong5
2024-05-20 15:55:18 +00:00
2f53747ec6 Speedup bf16 gemm fallback on ARM (#126592)
By dispatching it to multiple threads and using vectorized dot operation (with fp16 to fp32 upcasts via left shift)

This bumps stories110M eval from 22 to 55 tokens/sec using bfloat16

TODO:
 - Refactor tinygemm template and use it here

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126592
Approved by: https://github.com/mikekgfb
2024-05-20 12:39:51 +00:00
cb69c51b6f Revert " Updated test_graph_optims and test_graph_scaling_fused_optimizers to use new OptimizerInfo infrastructure (#125127)"
This reverts commit cf35a591b95220aa1bfcc04ff8a943efd1d6d6eb.

Reverted https://github.com/pytorch/pytorch/pull/125127 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/125127#issuecomment-2120337584))
2024-05-20 12:14:22 +00:00
7100a72950 [inductor] Fix ops.scan for non-commutative operators (#126633)
`tl.associative_scan` supports non-commutative combine functions but `tl.reduce`
doesn't. This effects non-persistent scans, where we use the reduction from
the previous loop iterations as the base for future iterations.

Here I work around this by taking the last element of the scan output and using
that as the reduced value. This is done using a trick where we create a
mask that is 1 at the desired element and 0 elsewhere, then sum over it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126633
Approved by: https://github.com/Chillee, https://github.com/lezcano
2024-05-20 10:27:17 +00:00
d9c3485146 Revert "c10d: add Collectives abstraction (#125978)"
This reverts commit 4b2ae2ac338f3a0de340c9711b03989b8ce66fc6.

Reverted https://github.com/pytorch/pytorch/pull/125978 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/125978#issuecomment-2119858015))
2024-05-20 07:40:41 +00:00
53f73cdeb6 Revert "Add symbolic_shape_specialization structured trace (#126450)"
This reverts commit da1fc85d60fcf0bd1e8638d643a7c0c6560c3a5f.

Reverted https://github.com/pytorch/pytorch/pull/126450 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/126450#issuecomment-2119798075))
2024-05-20 06:59:58 +00:00
5ad2f10034 Revert "[inductor] Load python modules using importlib (#126454)"
This reverts commit faa26df72e2a3ff08f9dd564bb50756916826854.

Reverted https://github.com/pytorch/pytorch/pull/126454 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/126454#issuecomment-2119771267))
2024-05-20 06:41:11 +00:00
cf35a591b9 Updated test_graph_optims and test_graph_scaling_fused_optimizers to use new OptimizerInfo infrastructure (#125127)
This PR is meant to address issue #123451, more specifically, the ```test_graph_optims``` and ```test_graph_scaling_fused_optimizers``` functions in ```test_cuda.py``` have been updated so that they now use the new OptimizerInfo infrastructure.

Lintrunner passed:
```
$ lintrunner test/test_cuda.py
ok No lint issues.
```
Tests passed:
```
>python test_cuda.py -k test_graph_optims
Ran 19 tests in 7.463s

OK (skipped=9)

>python test_cuda.py -k test_graph_scaling_fused_optimizers
Ran 6 tests in 2.800s

OK (skipped=3)
```
Both the functions have been moved to the newly created TestCase class ```TestCudaOptims```. The test is mostly the same except the ```@optims``` decorator is used at the top of the function to implicitly call the function using each of the optimizers mentioned in the decorator instead of explicitly using a for loop to iterate through each of the optimizers.

I was unable to use the ```_get_optim_inputs_including_global_cliquey_kwargs``` to get all kwargs for each of the optimizers since some of the kwargs that are used in the original ```test_graph_optims``` function are not being returned by the new OptimizerInfo infrastructure, more specifically, for the ```torch.optim.rmsprop.RMSprop``` optimizer, the following kwargs are not returned whenever ```_get_optim_inputs_including_global_cliquey_kwargs``` is called:
```
{'foreach': False, 'maximize': True, 'weight_decay': 0}
{ 'foreach': True, 'maximize': True, 'weight_decay': 0}
```
I ran into the same issue for ```test_graph_scaling_fused_optimizers```, for the ```torch.optim.adamw.AdamW``` optimizer, whenever ```optim_info.optim_inputs_func(device=device)``` was called, the following kwarg was not returned:
```
{'amsgrad': True}
```

Due to this issue, I resorted to using a dictionary to store the kwargs for each of the optimizers, I am aware that this is less than ideal. I was wondering whether I should use the OptimizerInfo infrastructure to get all the kwargs regardless of the fact that it lacks some kwargs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125127
Approved by: https://github.com/janeyx99
2024-05-20 06:20:45 +00:00
5fb11cda4f [compiled autograd] Better cache miss logging (#126602)
- log only first node key cache miss
- log existing node key sizes
- log which node's collected sizes became dynamic
e.g.
```
DEBUG:torch._dynamo.compiled_autograd.__compiled_autograd_verbose:Cache miss due to new autograd node: torch::autograd::GraphRoot (NodeCall 0) with key size 39, previous key sizes=[]
...
DEBUG:torch._dynamo.compiled_autograd.__compiled_autograd_verbose:Cache miss due to new autograd node: torch::autograd::AccumulateGrad (NodeCall 5) with key size 32, previous key sizes=[21]
...
DEBUG:torch._dynamo.compiled_autograd.__compiled_autograd_verbose:Cache miss due to dynamic shapes: collected size idx 0 of torch::autograd::GraphRoot (NodeCall 0)
DEBUG:torch._dynamo.compiled_autograd.__compiled_autograd_verbose:Cache miss due to dynamic shapes: collected size idx 2 of SumBackward0 (NodeCall 1)
DEBUG:torch._dynamo.compiled_autograd.__compiled_autograd_verbose:Cache miss due to dynamic shapes: collected size idx 4 of SumBackward0 (NodeCall 1)
DEBUG:torch._dynamo.compiled_autograd.__compiled_autograd_verbose:Cache miss due to dynamic shapes: collected size idx 2 of ReluBackward0 (NodeCall 2)
DEBUG:torch._dynamo.compiled_autograd.__compiled_autograd_verbose:Cache miss due to dynamic shapes: collected size idx 9 of AddmmBackward0 (NodeCall 3)
DEBUG:torch._dynamo.compiled_autograd.__compiled_autograd_verbose:Cache miss due to dynamic shapes: collected size idx 2 of torch::autograd::AccumulateGrad (NodeCall 5)
DEBUG:torch._dynamo.compiled_autograd.__compiled_autograd_verbose:Cache miss due to dynamic shapes: collected size idx 2 of ReluBackward0 (NodeCall 6)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126602
Approved by: https://github.com/jansel
ghstack dependencies: #126144, #126146, #126148, #126483
2024-05-19 23:49:52 +00:00
be67985bd7 [compiled autograd] log in cpp using python logger (#126483)
Internal infra may not preserve python and c++ log ordering e.g. MAST logs: https://fburl.com/mlhub/38576cxn, all the `[python_compiled_autograd.cpp] Creating cache entry [...]` logs of the entire run are at the beginning of the file

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126483
Approved by: https://github.com/jansel
ghstack dependencies: #126144, #126146, #126148
2024-05-19 23:49:52 +00:00
cyy
574ae9afb8 [Submodule] Remove third-party onnx-tensorrt (#126542)
It seems that tensorrt is not used by the C++ code, may be due to the removal of Caffe2.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126542
Approved by: https://github.com/ezyang
2024-05-19 22:34:24 +00:00
cyy
853081a8e7 Replace torch.library.impl_abstract with torch.library.register_fake (#126606)
To remove the disrupting warning
```
      warnings.warn("torch.library.impl_abstract was renamed to "
                    "torch.library.register_fake. Please use that instead; "
                    "we will remove torch.library.impl_abstract in a future "
                    "version of PyTorch.",
                    DeprecationWarning, stacklevel=2)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126606
Approved by: https://github.com/ezyang
2024-05-19 13:21:39 +00:00
5ea956a61f Update hf_BirdBird periodic-dynamo-benchmarks results (#126414)
can't repro this regression. also nothing in the faulty PR range would cause it only for 1 model. the job is still causing noise, so we should mute it. I think just updating the graph break count is better than skipping the model here since it's still passing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126414
Approved by: https://github.com/ezyang
2024-05-19 10:58:07 +00:00
c4dfd783f4 UFMT torch.utils._sympy.functions (#126553)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126553
Approved by: https://github.com/lezcano, https://github.com/Skylion007
ghstack dependencies: #126511
2024-05-19 10:35:48 +00:00
7dae7d3ca5 Remove unnecessary implementations from MockHandler (#126511)
Dead implementations are confusing and can cause bugs when people
accidentally hit them.  Better for it to be missing.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126511
Approved by: https://github.com/peterbell10, https://github.com/lezcano
2024-05-19 04:43:54 +00:00
71b6459edc Revert "[Dynamo] Treat integers stored on nn.Modules as dynamic (#126466)"
This reverts commit 6bb9d6080d33c817fcbf9e5ae8a59b76812a53d2.

Reverted https://github.com/pytorch/pytorch/pull/126466 on behalf of https://github.com/huydhn due to Sorry for reverting your change but the ONNX test failure looks legit, not flaky, as it starts failing in trunk 6bb9d6080d ([comment](https://github.com/pytorch/pytorch/pull/126466#issuecomment-2119078245))
2024-05-19 02:52:11 +00:00
e3230f87aa Cached required_fw_nodes creation (#126613)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126613
Approved by: https://github.com/anijain2305
2024-05-19 01:48:52 +00:00
abc4b66124 Forward fix the failed new test from D57474327 (#126596)
Summary: TSIA.  The two looks the same to me, but buck was failing with the following error when `with torch._inductor.utils.fresh_inductor_cache()` is used:

```
_________________________ ReproTests.test_issue126128 __________________________

self = <caffe2.test.dynamo.test_repros.ReproTests testMethod=test_issue126128>

    def test_issue126128(self):
        def fn():
            x = torch.randn(1, 10)
            y = torch.randn(10, 1)
            return torch.mm(x, y).sum()

        def fn2():
            x = torch.randn(10, 100)
            y = torch.randn(100, 10)
            return torch.mm(x, y).sum()

>       with torch._inductor.utils.fresh_inductor_cache():
E       AttributeError: module 'torch._inductor' has no attribute 'utils'
```

Test Plan: `buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/dynamo:test_dynamo -- --exact 'caffe2/test/dynamo:test_dynamo - test_repros.py::ReproTests::test_issue126128'`

Differential Revision: D57516676

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126596
Approved by: https://github.com/xmfan
2024-05-18 23:56:03 +00:00
ad67553c5c Updated test_torch.py to use new OptimizerInfo infrastructure (#125538)
Fixes #123451 (only addresses test_torch.py cases)

This PR solves the specific task to update `test_grad_scaling_autocast` and `test_params_invalidated_with_grads_invalidated_between_unscale_and_step` in `test/test_torch.py` to use the new OptimizerInfo infrastructure.

I have combined tests that call `_grad_scaling_autocast_test` into one called `test_grad_scaling_autocast` and used `_get_optim_inputs_including_global_cliquey_kwargs` to avoid hard-coded configurations.

```
$ lintrunner test/test_cuda.py
ok No lint issues.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125538
Approved by: https://github.com/janeyx99
2024-05-18 15:42:45 +00:00
99af1b3ab0 Refactor variables / function names related to non-strict export (#126458)
Improve variable and function naming for better clarity: `non strict` --> `aten`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126458
Approved by: https://github.com/angelayi
2024-05-18 06:05:14 +00:00
6bb9d6080d [Dynamo] Treat integers stored on nn.Modules as dynamic (#126466)
Fixes #115711

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126466
Approved by: https://github.com/jansel
2024-05-18 05:02:16 +00:00
a44d0cf227 [Traceable FSDP2] Change from register_multi_grad_hook to per-tensor backward hook (#126350)
As discussed with Andrew before, under compile we will register per-tensor backward hook instead of multi-grad hook, because it's difficult for Dynamo to support `register_multi_grad_hook` (or anything `.grad_fn` related). We expect both to have the same underlying behavior, ~~and we will add integration test (in subsequent PR) to show that compile and eager has same numerics.~~

As discussed below, we will change eager path to use per-tensor backward hook as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126350
Approved by: https://github.com/awgu
2024-05-18 04:44:29 +00:00
d4704dcacc Map float8 types to uint8 for allgather (#126556)
# Summary
Different take on this one:
https://github.com/pytorch/pytorch/issues/126338

We should probably not allow this mapping for 'compute' ops e.g. reductions

### Corresponding fp8 PR
https://github.com/pytorch-labs/float8_experimental/pull/263

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126556
Approved by: https://github.com/wanchaol
2024-05-18 03:19:16 +00:00
bf099a08f0 [2/N] Non-Tensor: Scalar Support: Add scalar to the cache for eager-through-torch.compile (#124070)
Add scalar information to the kernel configuration.

#### Additional Context
Currently, the input parameters are orchestrated by input order in the kernel configuration and loaded/mapped to the kernel at runtime. For example, the cache order of the input parameters of `torch.add(a, b, alpha=2.0)` is `a' first, followed by `b` and then `alpha`. The same order is for cache loading.

However, the orchestration mechanism does not support kwargs because the order of kwargs is useless. For example, the `out` of `aten::gelu.out(Tensor self, *, str approximate='none', Tensor(a!) out) -> Tensor(a!)` may be before `approximate`. We will support it with subsequent PRs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124070
Approved by: https://github.com/jansel, https://github.com/jgong5
2024-05-18 03:08:37 +00:00
c1767d8626 Faster(?) FP16 gemv kernel (#126297)
Differential Revision: [D57369266](https://our.internmc.facebook.com/intern/diff/D57369266/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D57369266/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126297
Approved by: https://github.com/malfet
2024-05-18 03:03:03 +00:00
b98decfc38 [halide-backend] Refactor codegen/triton.py into codegen/simd.py (#126415)
This PR is primarily just moving stuff around.  It creates a new
common baseclass for TritonCodegen and the (upcoming) HalideCodegen.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126415
Approved by: https://github.com/shunting314
2024-05-18 02:43:42 +00:00
cyy
74b99438f2 [Submodule] Remove third-party CUB (#126540)
Because it was updated 4 years ago, and now all supported CUDA versions provide CUB.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126540
Approved by: https://github.com/Skylion007
2024-05-18 02:28:17 +00:00
1191168c45 [pipelining] Follow improvements in export.unflatten (#126217)
Previously, we make a copy of `torch.export.unflatten` in pippy/_unflatten.py.

But it turns out to be too hard to track bug fixes and improvements in upstream version. For example, `torch.export.unflatten` recently added support for tied parameters, which is something pipelining needs.

Now that we moved into pytorch, we make a reference to `torch.export.unflatten` instead of maintaining a copy.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126217
Approved by: https://github.com/H-Huang
2024-05-18 02:24:01 +00:00
661ecedbd0 gitmodules: switch cpp-httplib to https (#126580)
Fixes issue introduced in https://github.com/pytorch/pytorch/pull/126470#issuecomment-2118374811

Test plan:

CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126580
Approved by: https://github.com/PaliC, https://github.com/jeffdaily
2024-05-18 01:31:28 +00:00
224f2bef9f [C10D] Add __repr__ to P2POp class (#126538)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126538
Approved by: https://github.com/Skylion007, https://github.com/kwen2501, https://github.com/c-p-i-o
ghstack dependencies: #126419
2024-05-18 00:58:57 +00:00
bcee6f708a [Pipelining] Fix 1f1b schedule (#126419)
This schedule was running fine locally but failing (hanging) on CI.

After analysis (https://fburl.com/gdoc/xt80h1gd), it seems like the
schedule was not correct previously but may still work depending on the
runtime.

The fix bundles together fwd-recv(s->s+1) and bwd-send(s+1->s) into one
coalesced group so they would not block each other.

Design drawing
<img width="803" alt="image" src="https://github.com/pytorch/pytorch/assets/4984825/906a9a66-39ae-4a6a-bc1a-18b77eaaa784">

Flight recorder traces show the same coalescing pattern as designed
<img width="1013" alt="image" src="https://github.com/pytorch/pytorch/assets/4984825/ab10646e-eaef-4191-83dd-73f448876c27">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126419
Approved by: https://github.com/c-p-i-o, https://github.com/kwen2501
2024-05-18 00:58:57 +00:00
41fb4bcc73 [AOTI] Flag to include aoti sources when building lite interpreter (#126572)
Summary:
Added USE_LITE_AOTI cmake flag, which is turned OFF by default.
When it is turned on, the AOTI sources  (inductor_core_resources) are included when building lite interpreter

Test Plan:
```
ANDROID_ABI=arm64-v8a ./scripts/build_android.sh -DUSE_LITE_AOTI=ON
```

Differential Revision: D57394078

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126572
Approved by: https://github.com/malfet
2024-05-18 00:39:42 +00:00
2863c76b1f [torch-distributed] Make log directory creation idempotent (#126496)
Summary:
https://docs.python.org/3/library/os.html#os.makedirs
> If exist_ok is False (the default), a FileExistsError is raised if the target directory already exists.

Test Plan: Existing tests

Differential Revision: D57471577

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126496
Approved by: https://github.com/d4l3k
2024-05-18 00:17:13 +00:00
0d5ba547ec Tool for scouting exportability in one shot (#126471)
Summary:
Tool for scouting exportability issues in one shot.

- Collect sample inputs for all submodules by running eager inference with forward_pre_hook.
- Start from root module, recursively try exporting child modules, if current module export fails.

Limitations:
- only works for nn.module that contains tree-like submodules structure. this doesn't work for flatten GraphModule.

TODO: support dynamic_dims

Sample output: https://docs.google.com/spreadsheets/d/1jnixrqBTYbWO_y6AaKA13XqOZmeB1MQAMuWL30dGoOg/edit?usp=sharing

```
exportability_report =
        {
            '': UnsupportedOperatorException(func=<OpOverload(op='testlib.op_missing_meta', overload='default')>),
            'submod_1': UnsupportedOperatorException(func=<OpOverload(op='testlib.op_missing_meta', overload='default')>),
            'submod_2': None
        }
```

Test Plan: buck2 run mode/dev-nosan fbcode//caffe2/test:test_export -- -r TestExportTools

Differential Revision: D57466486

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126471
Approved by: https://github.com/zhxchen17
2024-05-18 00:10:46 +00:00
54bc55c515 Remove dist_ prefix from TORCH_LOGS shortcuts (#126499)
e.g. dist_ddp -> ddp

'distributed' shortcut remains unchained

Feedback has been that it is not appealing to have the dist_ prefix,
and the main reason for it was to keep the distributed shortcuts grouped
together in the help menu.  It's nice to have shorter shortcuts.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126499
Approved by: https://github.com/XilunWu, https://github.com/kwen2501
ghstack dependencies: #126322
2024-05-18 00:07:30 +00:00
93844a31b3 Fix aarch64 debug build with GCC (#126290)
By working around GCCs quirks in instantiating templates that require immediate values.
Provide alternative implementation for scaling the output if compiled without any optimizations (both GCC and clang define `__OPTIMIZE__` if invoked with anything but `-O0`)

Test plan (after change was reverted): ssh into aarch64 runner and rebuild given file with `-O0`

Fixes https://github.com/pytorch/pytorch/issues/126283

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126290
Approved by: https://github.com/atalman, https://github.com/seemethere
2024-05-17 23:47:08 +00:00
d54c28e7fc Added error checks for invalid inputs on thnn_conv2d (#121906)
Fixes #121188
Prevent Segmentation Fault in 'torch._C._nn.thnn_conv2d'

Previously, calling 'torch._C._nn.thnn_conv2d' with invalid arguments for padding, stride, and kernel_size would result in a segmentation fault. This issue has been resolved by implementing argument validation (using Torch Check). Now, when invalid arguments are detected, a runtime error is raised with a debug message detailing the correct format.

Additionally, this commit includes tests to cover the three referenced cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121906
Approved by: https://github.com/janeyx99
2024-05-17 23:41:48 +00:00
173b1d811d [dynamo] Sourceless builder - ordered dict and re.pattern (#126468)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126468
Approved by: https://github.com/Skylion007
2024-05-17 23:24:55 +00:00
faa26df72e [inductor] Load python modules using importlib (#126454)
The `compile` + `exec` workflow is susceptible to behavior drifting from
a "normal" import use importlib instead to avoid this.

In particular here annotations were being stored as strings due to
`from __futures__ import annotations` in the scope calling `compile`.
Triton cares about annotations on global variables and this makes it
much easier to reliably code-gen them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126454
Approved by: https://github.com/peterbell10
2024-05-17 23:13:07 +00:00
d7de4c9d80 Fix issue of lowering nn.linear ops with kwargs (#126331)
Summary: Support kwarg bias for nn.linear quantization

Differential Revision: D57403190

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126331
Approved by: https://github.com/ZhengkaiZ, https://github.com/huydhn
2024-05-17 21:50:55 +00:00
c26f6548f9 [AOTI] config target platform (#126306)
Test Plan: AOTI compile stories15M for Android

Differential Revision: D57392830

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126306
Approved by: https://github.com/desertfire
2024-05-17 21:42:19 +00:00
09fd771485 Disable vulkan test batch_norm_invalid_inputs (#126571)
Fails flakily ex https://github.com/pytorch/pytorch/actions/runs/9130802617/job/25109131748
https://github.com/pytorch/pytorch/actions/runs/9125548571/job/25092535707

First bad I can find is 538877d204

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126571
Approved by: https://github.com/SS-JIA
2024-05-17 21:11:07 +00:00
bed1c600bb Experimental prototype for converting torch.jit.trace modules to export (#124449)
Differential Revision: [D56440613](https://our.internmc.facebook.com/intern/diff/D56440613)

We want to do this for following reasons:
1. There is current limitation in export tracing for torch.jit.trace d modules that cannot be easily upstreamed
2. We need to run internal CI regularly to understand feature gaps and continuously track them
3. Multiple people will be working on this prototype so it is better to have a checked in version so we don't always run into merge conflicts.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124449
Approved by: https://github.com/angelayi, https://github.com/avikchaudhuri
2024-05-17 20:42:42 +00:00
30b70b1a63 [ROCm] enable faster_load_save for Fused_SGD (#125456)
Reopen due to rebase error. Fixes https://github.com/pytorch/pytorch/issues/117599

The reported hang test : `test_cuda.py::TestCuda::test_grad_scaling_autocast_fused_optimizers` is passing with this PR

HSA Async copy / host wait on completion signal is resolved in MultiTensorApply.cuh

```
:4:command.cpp              :347 : 8881368803196 us: [pid:1268211 tid:0x7f5af80d7180] Command (InternalMarker) enqueued: 0xc4e2070
:4:rocvirtual.cpp           :556 : 8881368803201 us: [pid:1268211 tid:0x7f5af80d7180] Host wait on completion_signal=0x7f5967df3e00
:3:rocvirtual.hpp           :66  : 8881368803209 us: [pid:1268211 tid:0x7f5af80d7180] Host active wait for Signal = (0x7f5967df3e00) for -1 ns
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125456
Approved by: https://github.com/jeffdaily, https://github.com/eqy, https://github.com/janeyx99
2024-05-17 20:36:47 +00:00
d782e43464 Revert "[FSDP2] Fixed 2D clip grad norm test (#126497)"
This reverts commit 3f289063117673650db868c978bf3cb8125a22dc.

Reverted https://github.com/pytorch/pytorch/pull/126497 on behalf of https://github.com/jeanschmidt due to reverting to check if might have introduced inductor cuda 12 issues ([comment](https://github.com/pytorch/pytorch/pull/126497#issuecomment-2118338716))
2024-05-17 20:29:20 +00:00
95b2766864 [BE][Ez]: Use NotADirectoryError in tensorboard writer (#126534)
Slightly improve exception typing for tensorboard wrriter
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126534
Approved by: https://github.com/ezyang
2024-05-17 19:52:13 +00:00
90a5aeea79 [distributed] Add cpp-httplib to pytorch (#126470)
Adds https://github.com/yhirose/cpp-httplib such that we are able to use https for host to host communication in distributed (specifically torchrun)

Todo: We likely need to add cpp-httplib somewhere in the build (cmake/bazel) but first we should write the code for it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126470
Approved by: https://github.com/d4l3k, https://github.com/Skylion007
2024-05-17 19:45:08 +00:00
eb0b16db92 Initial implementation of AdaRound (#126153)
Summary:
This is an implementation of AdaRound from a paper https://arxiv.org/abs/2004.10568

This algorithm is going to be used by multiple people, hence we need make it official implementation.

Differential Revision: D57227565

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126153
Approved by: https://github.com/jerryzh168, https://github.com/huydhn
2024-05-17 19:44:50 +00:00
875221dedf Revert "Fix aarch64 debug build with GCC (#126290)"
This reverts commit 91bf952d10e9524a9b078900d9807efa5d252f5c.

Reverted https://github.com/pytorch/pytorch/pull/126290 on behalf of https://github.com/huydhn due to There seems to be a mis-match closing curly bracket here and it breaks some internal build in D57474505 ([comment](https://github.com/pytorch/pytorch/pull/126290#issuecomment-2118246756))
2024-05-17 19:30:02 +00:00
f89500030b Revert "Remove redundant serialization code (#126249)"
This reverts commit aab448e381366d4cf499145adffe9fcb1ac2b28d.

Reverted https://github.com/pytorch/pytorch/pull/126249 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing sigmoid/frontend:serialization_test internally ([comment](https://github.com/pytorch/pytorch/pull/126249#issuecomment-2118233656))
2024-05-17 19:19:02 +00:00
de42af4b00 Add coms metadata to execution trace (ET) (#126317)
Add Execution Trace communication collective meta data.
For specification see https://github.com/pytorch/pytorch/issues/124674

New fields look like
```
    {
      "id": 80, "name": "record_param_comms", "ctrl_deps": 79,
      "inputs": {"values": [[[78,74,0,100,4,"cuda:0"]],21,["0","default_pg"],0,"allreduce",[],[],0,1,2], "shapes": [[[100]],[],[[],[]],[],[],[],[],[],[],[]], "types": ["GenericList[Tensor(float)]","Int","Tuple[String,String]","Int","String","GenericList[]","GenericList[]","Int","Int","Int"]},                             "outputs": {"values": [[[78,74,0,100,4,"cuda:0"]]], "shapes": [[[100]]], "types": ["GenericList[Tensor(float)]"]},
      "attrs": [{"name": "rf_id", "type": "uint64", "value": 53},{"name": "fw_parent", "type": "uint64", "value": 0},{"name": "seq_id", "type": "int64", "value": -1},{"name": "scope", "type": "uint64", "value": 0},{"name": "tid", "type": "uint64", "value": 2},{"name": "fw_tid", "type": "uint64", "value": 0},{"name": "op_schema", "type": "string", "value": ""},{"name": "kernel_backend", "type": "string", "value": ""},{"name": "kernel_file", "type": "string", "value": ""},
  {"name": "collective_name", "type": "string", "value": "allreduce"},
  {"name": "dtype", "type": "string", "value": "Float"},
  {"name": "in_msg_nelems", "type": "uint64", "value": 100},
  {"name": "out_msg_nelems", "type": "uint64", "value": 100},
  {"name": "in_split_size", "type": "string", "value": "[]"},
  {"name": "out_split_size", "type": "string", "value": "[]"},
  {"name": "global_rank_start", "type": "uint64", "value": 0},
  {"name": "global_rank_stride", "type": "uint64", "value": 1},
  {"name": "pg_name", "type": "string", "value": "0"},
  {"name": "pg_desc", "type": "string", "value": "default_pg"},
  {"name": "pg_size", "type": "uint64", "value": 2}]
 }
```

## Unit Test
Added a new unit test to check the execution trace collected has right attributes

`touch /tmp/barrier && TEMP_DIR="/tmp" BACKEND="nccl" WORLD_SIZE="2" python test/distributed/test_distributed_spawn.py -v TestDistBackendWithSpawn.test_ddp_profiling_execution_trace`

```
STAGE:2024-05-08 17:39:10 62892:62892 ActivityProfilerController.cpp:316] Completed Stage: Warm Up
STAGE:2024-05-08 17:39:10 62893:62893 ActivityProfilerController.cpp:316] Completed Stage: Warm Up
STAGE:2024-05-08 17:39:11 62892:62892 ActivityProfilerController.cpp:322] Completed Stage: Collection
STAGE:2024-05-08 17:39:11 62893:62893 ActivityProfilerController.cpp:322] Completed Stage: Collection
STAGE:2024-05-08 17:39:11 62892:62892 ActivityProfilerController.cpp:326] Completed Stage: Post Processing
STAGE:2024-05-08 17:39:11 62893:62893 ActivityProfilerController.cpp:326] Completed Stage: Post Processing
[rank1]:[W508 17:39:12.329544411 reducer.cpp:1399] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model
indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[rank0]:[W508 17:39:12.329626774 reducer.cpp:1399] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model
indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[rank0]:[W508 17:39:12.339239982 execution_trace_observer.cpp:825] Enabling Execution Trace Observer
[rank1]:[W508 17:39:12.339364516 execution_trace_observer.cpp:825] Enabling Execution Trace Observer
STAGE:2024-05-08 17:39:12 62892:62892 ActivityProfilerController.cpp:316] Completed Stage: Warm Up
STAGE:2024-05-08 17:39:12 62893:62893 ActivityProfilerController.cpp:316] Completed Stage: Warm Up
[rank1]:[W508 17:39:12.352452400 execution_trace_observer.cpp:837] Disabling Execution Trace Observer
STAGE:2024-05-08 17:39:12 62893:62893 ActivityProfilerController.cpp:322] Completed Stage: Collection
[rank0]:[W508 17:39:12.354019014 execution_trace_observer.cpp:837] Disabling Execution Trace Observer
STAGE:2024-05-08 17:39:12 62893:62893 ActivityProfilerController.cpp:326] Completed Stage: Post Processing
STAGE:2024-05-08 17:39:12 62892:62892 ActivityProfilerController.cpp:322] Completed Stage: Collection
STAGE:2024-05-08 17:39:12 62892:62892 ActivityProfilerController.cpp:326] Completed Stage: Post Processing
Execution trace saved at /tmp/tmpy01ngc3w.et.json
Execution trace saved at /tmp/tmptf8543k4.et.json
ok

----------------------------------------------------------------------
```

Also run profilerunit test
`touch /tmp/barrier && TEMP_DIR="/tmp" BACKEND="nccl" WORLD_SIZE="2" python test/distributed/test_distributed_spawn.py -v TestDistBackendWithSpawn.test_ddp_profiling_torch_profiler`

```
STAGE:2024-05-08 18:24:22 1926775:1926775 ActivityProfilerController.cpp:316] Completed Stage: Warm Up
STAGE:2024-05-08 18:24:22 1926774:1926774 ActivityProfilerController.cpp:316] Completed Stage: Warm Up
STAGE:2024-05-08 18:24:24 1926774:1926774 ActivityProfilerController.cpp:322] Completed Stage: Collection
STAGE:2024-05-08 18:24:24 1926775:1926775 ActivityProfilerController.cpp:322] Completed Stage: Collection
STAGE:2024-05-08 18:24:24 1926774:1926774 ActivityProfilerController.cpp:326] Completed Stage: Post Processing
STAGE:2024-05-08 18:24:24 1926775:1926775 ActivityProfilerController.cpp:326] Completed Stage: Post Processing
[rank1]:[W508 18:24:24.508622236 reducer.cpp:1399] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[rank0]:[W508 18:24:24.508622241 reducer.cpp:1399] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
STAGE:2024-05-08 18:24:24 1926774:1926774 ActivityProfilerController.cpp:316] Completed Stage: Warm Up
STAGE:2024-05-08 18:24:24 1926775:1926775 ActivityProfilerController.cpp:316] Completed Stage: Warm Up
STAGE:2024-05-08 18:24:24 1926774:1926774 ActivityProfilerController.cpp:322] Completed Stage: Collection
STAGE:2024-05-08 18:24:24 1926775:1926775 ActivityProfilerController.cpp:322] Completed Stage: Collection
STAGE:2024-05-08 18:24:24 1926774:1926774 ActivityProfilerController.cpp:326] Completed Stage: Post Processing
STAGE:2024-05-08 18:24:24 1926775:1926775 ActivityProfilerController.cpp:326] Completed Stage: Post Processing
Trace saved to /tmp/tmpdrw_cmcu.json
Trace saved to /tmp/tmpnio7ec9j.json
ok

----------------------------------------------------------------------
Ran 1 test in 19.772s

OK
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126317
Approved by: https://github.com/yoyoyocmu, https://github.com/sanrise
2024-05-17 19:08:55 +00:00
6931f781c2 [quant][pt2e] Allow multi users without output observers (#126487)
Summary: The PT2E quantization flow does not support unquantized
outputs yet. To work around this, users may wish to remove the
output observer from their graphs. However, this fails currently
in some cases because the `PortNodeMetaForQDQ` pass is too
restrictive, for example:

```
conv -> obs -------> output0
         \\-> add -> output1
```

Previously we expected conv to always have exactly 1 user,
which is the observer. When the observer is removed, however,
conv now has 2 users, and this fails the check.

```
conv -------> output0
  \\-> add -> output1
```

This commit relaxes the error into a warning to enable
this workaround.

Test Plan:
python test/test_quantization.py TestQuantizePT2E.test_multi_users_without_output_observer

Reviewers: jerryzh168

Subscribers: jerryzh168, supriyar

Differential Revision: [D57472601](https://our.internmc.facebook.com/intern/diff/D57472601)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126487
Approved by: https://github.com/tarun292
2024-05-17 18:48:21 +00:00
ecd9a4e5c3 Enable FX graph cache for huggingface and timm benchmarks (#126205)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126205
Approved by: https://github.com/eellison
2024-05-17 18:36:05 +00:00
66dc8fb7ff Allow tensor subclasses and add torch.serialization.add_safe_globals that allows users to allowlist classes for weights_only load (#124331)
#### Conditions for allowlisting tensor subclasses
We allow tensor subclasses types that
(1) Do not override `__setstate__`, `__getattr__`, `__setattr__`, `__get__`, `__set__` or `__getattribute__` of `torch.Tensor` (`torch.Tensor` does not have a definition of `__getattr__`, `__get__` or `__set__` so we check that these are `None`)
(2) Use the generic `tp_alloc`
(3) Are in a module that *has been imported by the user*
to be pushed onto the stack as strings by `GLOBAL` instructions, while storing the type in a dict

The strings will be converted to the classes as appropriate when executing `REBUILD` with `_rebuild_from_type_v2`

*Note that we use `inspect.getattr_static(sys.modules[module], name)` to get the class/function as this method claims to have no code execution.

The rationale for the 3 conditions above is as follows:

The rebuild func provided by `Tensor.__reduce_ex__` is `torch._tensor._rebuild_from_type_v2`, which is defined as such (note the call to `getattr`, `Tensor.__setstate__` and the call to `as_subclass` as well as the call to `_set_obj_state` which calls `setattr`)

4e66aaa010/torch/_tensor.py (L57-L71)

`as_subclass` is implemented with a call to `THPVariable_NewWithVar`

that will eventually call `tp_alloc` here
4e66aaa010/torch/csrc/autograd/python_variable.cpp (L2053)

The `func` arg to `_rebuild_from_type_v2` for wrapper subclasses is `Tensor.rebuild_wrapper_subclass`, which will similarly call into `THPVariable_NewWithVar` and hit the above `tp_alloc`

**Note that we do not call `tp_init` or `tp_new` (i.e. `cls.__init__` or `cls.__new__`) when unpickling**

### How do we check something is a tensor subclass/constraints around imports

In order to check whether `bla` is a tensor subclass in the bytecode `GLOBAL module.name`, we need to do an `issubclass` check, which entails converting the global string to the appropriate type. We *do not* arbitrarily import modules but will perform this check as long as the given subclass (given by `module.name`) has already been imported by the user (i.e. `module in sys.modules` and `issubclass(getattr(sys[modules], name), torch.Tensor)`

This PR also allowlisted  `torch._utils._rebuild_wrapper_subclass` and `torch.device` (used by `_rebuild_wrapper_subclass`)

### API for allow listing
This PR also added `torch.serialization.{add/get/clear}_safe_globals` that enables user to allowlist globals they have deemed safe and manipulate this list (for example they could allowlist a tensor subclass with a custom `__setstate__` if they have checked that this is safe).

Next steps:
- Add testing and allowlist required classes for all in-core tensor subclasses (e.g. `DTensor`, `FakeTensor` etc.)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124331
Approved by: https://github.com/albanD
2024-05-17 17:56:57 +00:00
31ea8290e7 Workflow for uploading additional test stats on workflow dispatch (#126080)
This kind of an experiment for uploading test stats during the run, and also for test dashboard stuff so it can re calculate the info

Add workflow that is callable via workflow dispatch for uploading additional test stats
Adds script that only calculates the additional info

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126080
Approved by: https://github.com/ZainRizvi
2024-05-17 17:29:44 +00:00
6bcf15669e [inductor] fix unbacked case in pointwise + reduction vertical fusion (#125982)
```
$ INDUCTOR_TEST_DISABLE_FRESH_CACHE=1 python test/inductor/test_unbacked_symints.py -k test_vertical_pointwise_reduction_fusion

  File "/data/users/colinpeppler/pytorch/torch/_inductor/scheduler.py", line 1953, in fuse_nodes_once
    for node1, node2 in self.get_possible_fusions():
  File "/data/users/colinpeppler/pytorch/torch/_inductor/scheduler.py", line 2010, in get_possible_fusions
    check_all_pairs(node_grouping)
  File "/data/users/colinpeppler/pytorch/torch/_inductor/scheduler.py", line 1997, in check_all_pairs
    if self.can_fuse(node1, node2):
  File "/data/users/colinpeppler/pytorch/torch/_inductor/scheduler.py", line 2252, in can_fuse
    return self.get_backend(device).can_fuse_vertical(node1, node2)
  File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/cuda_combined_scheduling.py", line 39, in can_fuse_vertical
    return self._triton_scheduling.can_fuse_vertical(node1, node2)
  File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton.py", line 3237, in can_fuse
    if not all(
  File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton.py", line 3238, in <genexpr>
    TritonKernel.is_compatible((numel2, rnumel2), n.get_ranges())
  File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton.py", line 1543, in is_compatible
    cls._split_iteration_ranges(groups, lengths)
  File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton.py", line 1507, in _split_iteration_ranges
    while current_group < len(remaining) and sv.size_hint(remaining[current_group]) == 1:
  File "/data/users/colinpeppler/pytorch/torch/_inductor/sizevars.py", line 442, in size_hint
    return int(out)
  File "/home/colinpeppler/local/miniconda3/envs/pytorch/lib/python3.10/site-packages/sympy/core/expr.py", line 320, in __int__
    raise TypeError("Cannot convert symbols to int")
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
TypeError: Cannot convert symbols to int
```

Where the unbacked symints show up at.
```
> /data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton.py(1506)_split_iteration_ranges()
(Pdb) print(groups)
(1, 512*u0)
(Pdb) print(lengths)
([u0, 32, 16], [])
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125982
Approved by: https://github.com/jansel
2024-05-17 17:06:24 +00:00
7e9a037b47 [Perf] Vectorize more dtype for int4mm (#126512)
It used to be vectorized only for f16, but no reason not to do the same for bf16 or f32

Spiritual followup of https://github.com/pytorch/pytorch/pull/125290

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126512
Approved by: https://github.com/Skylion007
2024-05-17 16:34:19 +00:00
81277baa0c Remove removed ruff rule TRY200 (#126256)
My TOML linter is complaining that "TRY200" is not acceptable for the `tool.ruff.lint` schema.

From the ruff docs: https://docs.astral.sh/ruff/rules/reraise-no-cause/

> This rule has been removed and its documentation is only available for historical reasons.
>
> This rule is identical to [B904](https://docs.astral.sh/ruff/rules/raise-without-from-inside-except/) which should be used instead.

and we are currently explicitly ignoring B904.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126256
Approved by: https://github.com/Skylion007
2024-05-17 16:31:05 +00:00
402170b22f Early return in _recursive_build if obj is a Tensor (#125639)
Fix issue #125551

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125639
Approved by: https://github.com/ezyang
2024-05-17 15:53:37 +00:00
7e166e8057 [optim] Fix: wrong ASGD implementation (#126375)
This PR is based on #125440, additionally merging the latest main branch and fixing the lint failures from #126361.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126375
Approved by: https://github.com/janeyx99
2024-05-17 15:46:39 +00:00
078e530446 Delete refactored function, move changes over (#126407)
Oops, in https://github.com/pytorch/pytorch/pull/125610 I moved this function to runtime_wrappers.py, but forgot to delete the old one. https://github.com/pytorch/pytorch/pull/126234 then modified it which would do nothing, so I'm applying the change correctly now and deleting the function as I intended.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126407
Approved by: https://github.com/eellison
2024-05-17 15:28:18 +00:00
ab307a8992 Default to env variable instead of config value for precompile parallelism (#126333)
Previously, we would default to the config `compile_threads`. That controls the number of forks we use for async compile. It defaults to 1 in fbcode because fork() has known issues with safety. In precompilation, we are using threads, which have no safety issues and should strictly improve compile time. there isn't really any reason to reduce except for testing, and it doesn't make sense to share the same value as for determining forks.

This changes so we default it to use as many threads as needed unless the env variable is set.

Differential Revision: [D57473023](https://our.internmc.facebook.com/intern/diff/D57473023)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126333
Approved by: https://github.com/nmacchioni
2024-05-17 14:58:55 +00:00
3f28906311 [FSDP2] Fixed 2D clip grad norm test (#126497)
This fixes https://github.com/pytorch/pytorch/issues/126484.

We change from transformer to MLP stack since transformer seems to introduce slight numeric differences when using TP. We include a sequence parallel layer norm module in the MLP stack to exercise `(S(0), R)` placement.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126497
Approved by: https://github.com/weifengpy, https://github.com/wz337
2024-05-17 13:38:31 +00:00
55033ab43a Update ops handler documentation some more (#126480)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126480
Approved by: https://github.com/peterbell10
ghstack dependencies: #126292, #126299
2024-05-17 13:31:44 +00:00
cyy
4ed93d6e0c [Submodule] Remove zstd dependency (#126485)
After searching in the codebase, it seems that zstd is not in use now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126485
Approved by: https://github.com/ezyang
2024-05-17 12:49:23 +00:00
6c503f1dbb save the reciprocal of weights for welford_reduce (#125148)
Save the reciprocal of weights for welford_reduce to avoid redundant divisions for improving performance, and `weight_recps` will be inserted into the generated vec kernel.

Generated code:

- Before:

```
for(long x1=static_cast<long>(0L); x1<static_cast<long>(1024L); x1+=static_cast<long>(16L))
{
    auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x1 + (1024L*x0)), 16);
    tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0);
}
```

- After::

```
static WeightRecp<at::vec::Vectorized<float>> weight_recps(64);
for(long x1=static_cast<long>(0L); x1<static_cast<long>(1024L); x1+=static_cast<long>(16L))
{
    auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x1 + (1024L*x0)), 16);
    tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0, &weight_recps);
}
```

Performance:

- Single core:

Op | shape | eager/ms | inductor/ms | optimized inductor/ms
-- | -- | -- | -- | --
layernorm | (56, 384, 1024) | 16.825 | 22.338 | 15.208
var | (56, 384, 1024) | 21.752 | 13.258 | 13.102

- 4 cores:

Op | shape | eager/ms | inductor/ms | optimized inductor/ms
-- | -- | -- | -- | --
layernorm | (56, 384, 1024) | 4.249 | 5.899 | 4.223
var | (56, 384, 1024) | 5.3152 | 3.278 | 2.163

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125148
Approved by: https://github.com/jgong5, https://github.com/peterbell10
2024-05-17 08:20:12 +00:00
8619fe6214 variable search spaces for gemm autotuning (#126220)
add a switch to change the gemm autotuning search space between the default (the current set of hardcoded configs) and an exhaustive search space that enumerates all block sizes in [16, 32, 64, 128, 256], stages in [1, 2, 3, 4, 5], and warps in [2, 4, 6]

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126220
Approved by: https://github.com/eellison
2024-05-17 08:09:53 +00:00
45f2d09452 [Quant][Inductor] Enable lowering of qlinear-binary(-unary) fusion for X86Inductor (#122593)
**Description**
Lower the qlinear binary post op pattern to Inductor. Use post op sum (in-place) if the extra input has the same dtype as output. Otherwise, it uses binary add.

**Supported linear-binary(-unary) patterns**
```
    linear(X)   extra input
           \   /
            Add
             |
        Optional(relu)
             |
             Y

1. int8-mixed-fp32
+---+---------------+-----------+------------------------------+---------+
| # | Add type      | Quant out | Pattern                      | Post op |
+---+---------------+-----------+------------------------------+---------+
| 1 | In-/out-place | Yes       | linear + fp32 -> (relu) -> q | add     |
+---+---------------+-----------+------------------------------+---------+
| 2 | In-/out-place | No        | linear + fp32 -> (relu)      | sum     |
+---+---------------+-----------+------------------------------+---------+

2. int8-mixed-bf16
+---+----------+---------------+-----------+--------------------------------------------------+---------+
| # | X2 dtype | Add type      | Quant out | Pattern                                          | Post op |
+---+----------+---------------+-----------+--------------------------------------------------+---------+
| 1 | BF16     | In-/out-place | Yes       | linear + bf16 -> (relu) -> to_fp32 -> q          | add     |
+---+----------+---------------+-----------+--------------------------------------------------+---------+
| 2 | BF16     | In-/out-place | No        | linear + bf16 -> (relu)                          | sum     |
+---+----------+---------------+-----------+--------------------------------------------------+---------+
| 3 | FP32     | Out-place     | Yes       | linear + fp32 -> (relu) -> q                     | add     |
|   |          | In-place right|           |                                                  |         |
+---+----------+---------------+-----------+--------------------------------------------------+---------+
| 4 | FP32     | Out-place     | No        | linear + fp32 -> (relu)                          | sum     |
|   |          | In-place right|           |                                                  |         |
+---+----------+---------------+-----------+--------------------------------------------------+---------+
| 5 | FP32     | In-place left | Yes       | linear + fp32 -> to_bf16 -> relu -> to_fp32 -> q | add     |
+---+----------+---------------+-----------+--------------------------------------------------+---------+
| 6 | FP32     | In-place left | No        | linear + fp32 -> to_bf16 -> (relu)               | add     |
+---+----------+---------------+-----------+--------------------------------------------------+---------+
```
Note
(1) The positions of linear and the extra input can be swapped.
(2) we don't insert q-dq before the extra input of linear-add by recipe. But if q-dq is found at the
extra input, we don't match that pattern because we cannot match all these patterns in 3 passes.

**Test plan**
python test/inductor/test_mkldnn_pattern_matcher.py -k test_qlinear_add
python test/inductor/test_cpu_cpp_wrapper.py -k test_qlinear_add

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122593
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/eellison
2024-05-17 07:46:48 +00:00
2edaae436a Fix cummax and cummin lowering for empty case (#126461)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126461
Approved by: https://github.com/peterbell10
2024-05-17 07:08:32 +00:00
15ca562f86 [DTensor] Turn on foreach implementation for clip_grad_norm_ for DTensor by default (#126423)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126423
Approved by: https://github.com/awgu
2024-05-17 06:57:52 +00:00
f9a7033194 Refactor partitioner and clean it up (#126318)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126318
Approved by: https://github.com/anijain2305
2024-05-17 06:15:00 +00:00
5756b53dd8 [XPU] call empty_cache for dynamo tests (#126377)
When running a batch of models, lacking `empty_cache()` would result in OOM for subsequent models.

This PR unifies the `empty_cache` call for both CUDA and XPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126377
Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/desertfire
2024-05-17 06:05:51 +00:00
9edf54df4d [dtensor] refactor view ops to use OpStrategy (#126011)
As titled. Some ops require adjustment of output shape argument. In rule-based sharding prop, global output shape was inferred in the rule (in `view_ops.py`). In strategy-based sharding prop, it is now obtained from propagated out_tensor_meta (in `sharding_prop.py`).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126011
Approved by: https://github.com/wanchaol, https://github.com/XilunWu
2024-05-17 05:39:21 +00:00
a0df40f195 Add dist_pp shortcut to TORCH_LOGS (#126322)
distributed log category already includes pipelining since its under the
torch.distributed umbrella.

So both TORCH_LOGS=distributed and TORCH_LOGS=dist_pp will enable PP
logs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126322
Approved by: https://github.com/kwen2501
2024-05-17 05:32:15 +00:00
4b2ae2ac33 c10d: add Collectives abstraction (#125978)
This adds a new `Collectives` API for doing distributed collectives operations. This is intended to replace the [current Elastic store abstraction](https://github.com/pytorch/pytorch/blob/main/torch/distributed/elastic/utils/store.py) with more performant and debugable primitives.

Design doc: https://docs.google.com/document/d/147KcKJXEHvk1Q6tISLbJVvLejHg_1kIhBQeu-8RQxhY/edit

The standard implementation is using `StoreCollectives` but other more performant backends will be added in a follow up PR.

Test plan:

```
python test/distributed/test_collectives.py -v
```

This tests both functionality using multiple threads as well as timeout behavior.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125978
Approved by: https://github.com/shuqiangzhang
2024-05-17 05:09:11 +00:00
a8c41e0678 dont pad 0 dim mm inputs (#126475)
Otherwise you get an error in constant_pad_nd.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126475
Approved by: https://github.com/huydhn
ghstack dependencies: #125772, #125773, #125780
2024-05-17 05:03:27 +00:00
88582195fd [FSDP2][Test] Fix _test_clip_grad_norm (#126457)
Fixes #ISSUE_NUMBER
We need to compare ref_total_norm to total_norm.full_tensor().
Example:
```
iter_idx:0, rank:0,\
ref_total_norm=tensor(1052.5934, device='cuda:0'),\
total_norm=DTensor(local_tensor=482.0861511230469, device_mesh=DeviceMesh([0, 1]), placements=(_NormPartial(reduce_op='sum', norm_type=2.0),)),\
total_norm.full_tensor()=tensor(1052.5934, device='cuda:0')
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126457
Approved by: https://github.com/awgu
2024-05-17 04:29:21 +00:00
1a27e24ff5 Make inductor scheduler graph extension configurable (#125578)
This patch makes the inductor scheduler graph extension configurable.
It enables ease of debugging by changing the graph format (dot, png, etc.).

Particularly, it's very convenient to work with the graph interactively using tools like https://github.com/tintinweb/vscode-interactive-graphviz

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125578
Approved by: https://github.com/Chillee
2024-05-17 04:19:23 +00:00
da1fc85d60 Add symbolic_shape_specialization structured trace (#126450)
This is typically the information you want when diagnosing why something
overspecialized in dynamic shapes.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126450
Approved by: https://github.com/albanD
2024-05-17 02:01:21 +00:00
d2f5a8ac99 [doc] expose torch.Tensor.xpu API to doc (#126383)
# Motivation
The doc string related `torch.Tensor.xpu` has been added [here](d61a81a9e7/torch/_tensor_docs.py (L1434)) but not expose it to public doc, like [torch.Tensor.cuda](https://pytorch.org/docs/stable/generated/torch.Tensor.cuda.html#torch.Tensor.cuda). This PR intends to expose the document of `torch.Tensor.xpu` to public doc.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126383
Approved by: https://github.com/albanD
2024-05-17 01:19:03 +00:00
776b878917 [easy] Fix typing for map_location docs in torch.load (#125473)
Currently it incorrectly has `Callable[[Tensor, str], Tensor]` as a possible type signature, this should be `Callable[[Storage, str], Storage]`

<img width="716" alt="Screenshot 2024-05-03 at 12 09 54 PM" src="https://github.com/pytorch/pytorch/assets/35276741/b8946f95-8297-445f-a9d9-570b8a3caab1">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125473
Approved by: https://github.com/albanD
2024-05-17 01:15:25 +00:00
697ed6f5b3 [DeviceMesh] Supported N groups in from_group (#126258)
**Overview**
This PR supports constructing an ND mesh with `from_group()` by passing in `group: List[ProcessGroup]` and `mesh: Union[torch.Tensor, "ArrayLike"]` together. The `ndim` of the device mesh returned from `from_group()` is equal to the number of `ProcessGroup`s passed. If the `ndim` is greater than 1, then the `mesh` argument is required (since there is no simple way to recover the `mesh` tensor from the process groups otherwise).

This PR also adds `mesh_dim_names` as an argument to forward to the device mesh for convenience.

<details>
<summary> Old Approach </summary>

**Overview**
- This PR mainly adds `mesh_shape` to `from_group()` so that the user can construct an ND (N > 1) device mesh from a process group. This is to unblock HSDP, where we can pass the overall data parallel process group to `from_group()` with `mesh_shape = (replicate_dim_size, shard_dim_size)` and `from_group()` will construct subgroups for the user. (The user can then get the subgroups from the submeshes.)
    - Constructing the 2D `DeviceMesh` from an existing shard process group and replicate process group is hard because we cannot easily recover the array of ranks in their parent group on each rank in general.
- This PR also adds `mesh_dim_names` to `from_group()` so that the user can name the mesh dimensions of the constructed device mesh.

</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126258
Approved by: https://github.com/wanchaol
2024-05-17 01:03:21 +00:00
1018a68e31 [export] Delete predispatch tests (#126459)
Deleting predispatch tests as we moved export to predispatch already
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126459
Approved by: https://github.com/tugsbayasgalan
2024-05-17 00:48:32 +00:00
8bb7a2f46d Fix documentation for register_fake_class (#126422)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126422
Approved by: https://github.com/angelayi
2024-05-17 00:45:21 +00:00
762ce6f062 Add Lowering for FlexAttention Backwards (#125515)
# Summary
#### What does this PR do?
It enables Inductor to actually generate the fused flex attention kernel for the backwards

I did some other things along the way:
- Abstract out the 'build_subgraph_buffer' subroutine and make it reusable between flex attention and flex_attention backwards. In total we need too build 3 subgraphs for fwd + bwd. 1 for the fwd graph and then 2 in the bwd. The FAv2 algorithm recomputes the parts of the forward (more efficiently since we already have the row_max via logsumexp), therefore we need to inline both the fwd graph and the joint graph in the bwds kernel.
- The version of the backwards kernel is from a somewhat older version of the triton tutorial implementation. I think that we should update in a follow up to a newer version. Notably the blocks need to be square for this to work as currently implemented. I am sure there are many opportunities for optimization.
- I didnt correctly register the decomp table + IndexMode when I landed: https://github.com/pytorch/pytorch/pull/123902, this remedies that.
- The rel_bias helper func was reversed in terms of causality. I updated and then add a test specific for "future causal" attention.
- This PRs but the main point that I think still needs to be worked out is the store_output call. I have it hacked up to be 'fake' but I dont think we want to land that and likely want to just have a mutated 'dq' and a stored_output 'dk'
- I also needed to update the `TritonTemplateKernel` to actually accept multiple subgraphs (modifications)
- I updated the benchmark to also profile bwds performance

### Benchmark Numbers:
_The current implementation is not parallelizing over ctx length in the bwd_
FWD Speedups

| Type    |   Speedup | shape              | score_mod   | dtype          |
|---------|-----------|--------------------|-------------|----------------|
| Average |     0.991 |                    |             |                |
| Max     |     1.182 | (16, 16, 4096, 64) | noop        | torch.bfloat16 |
| Min     |     0.796 | (2, 16, 512, 256)  | head_bias   | torch.bfloat16 |

BWD Speedups

| Type    |   Speedup | shape              | score_mod   | dtype          |
|---------|-----------|--------------------|-------------|----------------|
| Average |     0.291 |                    |             |                |
| Max     |     0.652 | (8, 16, 512, 64)   | head_bias   | torch.bfloat16 |
| Min     |     0.073 | (2, 16, 4096, 128) | head_bias   | torch.bfloat16 |

<details>

<summary>Full Data</summary>

| shape               | score_mod     | dtype          |   fwd_eager_time |   fwd_compiled_time |   bwd_eager_time |   bwd_compiled_time |   fwd_speedup |   bwd_speedup |
|---------------------|---------------|----------------|------------------|---------------------|------------------|---------------------|---------------|---------------|
| (2, 16, 512, 64)    | noop          | torch.bfloat16 |           19.936 |              19.092 |           57.851 |             193.564 |         1.044 |         0.299 |
| (2, 16, 512, 64)    | causal_mask   | torch.bfloat16 |           19.955 |              19.497 |           57.662 |             206.278 |         1.024 |         0.280 |
| (2, 16, 512, 64)    | relative_bias | torch.bfloat16 |           19.455 |              21.297 |           57.674 |             195.219 |         0.913 |         0.295 |
| (2, 16, 512, 64)    | head_bias     | torch.bfloat16 |           19.958 |              21.289 |           57.674 |             193.859 |         0.938 |         0.298 |
| (2, 16, 512, 128)   | noop          | torch.bfloat16 |           28.157 |              28.615 |           82.831 |             454.211 |         0.984 |         0.182 |
| (2, 16, 512, 128)   | causal_mask   | torch.bfloat16 |           28.154 |              28.444 |           83.091 |             432.083 |         0.990 |         0.192 |
| (2, 16, 512, 128)   | relative_bias | torch.bfloat16 |           28.722 |              27.897 |           83.175 |             446.789 |         1.030 |         0.186 |
| (2, 16, 512, 128)   | head_bias     | torch.bfloat16 |           28.299 |              27.673 |           83.052 |             459.179 |         1.023 |         0.181 |
| (2, 16, 512, 256)   | noop          | torch.bfloat16 |           41.167 |              50.504 |          175.019 |            1083.545 |         0.815 |         0.162 |
| (2, 16, 512, 256)   | causal_mask   | torch.bfloat16 |           41.656 |              51.933 |          175.078 |            1171.176 |         0.802 |         0.149 |
| (2, 16, 512, 256)   | relative_bias | torch.bfloat16 |           41.697 |              50.722 |          175.159 |            1097.312 |         0.822 |         0.160 |
| (2, 16, 512, 256)   | head_bias     | torch.bfloat16 |           41.690 |              52.387 |          175.184 |            1097.336 |         0.796 |         0.160 |
| (2, 16, 1024, 64)   | noop          | torch.bfloat16 |           39.232 |              37.454 |          127.847 |             612.430 |         1.047 |         0.209 |
| (2, 16, 1024, 64)   | causal_mask   | torch.bfloat16 |           39.930 |              39.599 |          127.755 |             665.359 |         1.008 |         0.192 |
| (2, 16, 1024, 64)   | relative_bias | torch.bfloat16 |           39.417 |              41.304 |          127.902 |             614.990 |         0.954 |         0.208 |
| (2, 16, 1024, 64)   | head_bias     | torch.bfloat16 |           39.965 |              42.034 |          127.953 |             613.273 |         0.951 |         0.209 |
| (2, 16, 1024, 128)  | noop          | torch.bfloat16 |           63.964 |              71.024 |          226.510 |            1637.669 |         0.901 |         0.138 |
| (2, 16, 1024, 128)  | causal_mask   | torch.bfloat16 |           63.843 |              72.451 |          226.750 |            1558.949 |         0.881 |         0.145 |
| (2, 16, 1024, 128)  | relative_bias | torch.bfloat16 |           64.301 |              70.487 |          226.651 |            1610.063 |         0.912 |         0.141 |
| (2, 16, 1024, 128)  | head_bias     | torch.bfloat16 |           64.033 |              71.394 |          226.676 |            1668.511 |         0.897 |         0.136 |
| (2, 16, 1024, 256)  | noop          | torch.bfloat16 |          129.348 |             141.390 |          507.337 |            4405.175 |         0.915 |         0.115 |
| (2, 16, 1024, 256)  | causal_mask   | torch.bfloat16 |          129.538 |             145.680 |          507.178 |            4768.874 |         0.889 |         0.106 |
| (2, 16, 1024, 256)  | relative_bias | torch.bfloat16 |          129.438 |             142.782 |          507.004 |            4401.002 |         0.907 |         0.115 |
| (2, 16, 1024, 256)  | head_bias     | torch.bfloat16 |          129.058 |             146.242 |          507.547 |            4434.251 |         0.883 |         0.114 |
| (2, 16, 4096, 64)   | noop          | torch.bfloat16 |          481.606 |             409.120 |         1440.890 |           14147.269 |         1.177 |         0.102 |
| (2, 16, 4096, 64)   | causal_mask   | torch.bfloat16 |          480.227 |             438.847 |         1434.419 |           14973.386 |         1.094 |         0.096 |
| (2, 16, 4096, 64)   | relative_bias | torch.bfloat16 |          480.831 |             458.104 |         1432.935 |           14193.253 |         1.050 |         0.101 |
| (2, 16, 4096, 64)   | head_bias     | torch.bfloat16 |          480.749 |             452.497 |         1437.040 |           14084.869 |         1.062 |         0.102 |
| (2, 16, 4096, 128)  | noop          | torch.bfloat16 |          872.534 |             848.275 |         2600.895 |           35156.849 |         1.029 |         0.074 |
| (2, 16, 4096, 128)  | causal_mask   | torch.bfloat16 |          872.647 |             868.279 |         2587.581 |           31919.531 |         1.005 |         0.081 |
| (2, 16, 4096, 128)  | relative_bias | torch.bfloat16 |          871.484 |             827.644 |         2593.989 |           34805.634 |         1.053 |         0.075 |
| (2, 16, 4096, 128)  | head_bias     | torch.bfloat16 |          871.422 |             856.437 |         2602.482 |           35708.591 |         1.017 |         0.073 |
| (2, 16, 4096, 256)  | noop          | torch.bfloat16 |         1904.497 |            1758.183 |         6122.416 |           66754.593 |         1.083 |         0.092 |
| (2, 16, 4096, 256)  | causal_mask   | torch.bfloat16 |         1911.174 |            1762.821 |         6113.207 |           72759.392 |         1.084 |         0.084 |
| (2, 16, 4096, 256)  | relative_bias | torch.bfloat16 |         1911.254 |            1727.108 |         6123.530 |           66577.988 |         1.107 |         0.092 |
| (2, 16, 4096, 256)  | head_bias     | torch.bfloat16 |         1916.977 |            1801.804 |         6118.158 |           67359.680 |         1.064 |         0.091 |
| (8, 16, 512, 64)    | noop          | torch.bfloat16 |           44.984 |              43.974 |          170.276 |             262.259 |         1.023 |         0.649 |
| (8, 16, 512, 64)    | causal_mask   | torch.bfloat16 |           45.001 |              46.265 |          170.509 |             274.893 |         0.973 |         0.620 |
| (8, 16, 512, 64)    | relative_bias | torch.bfloat16 |           45.466 |              48.211 |          170.606 |             262.759 |         0.943 |         0.649 |
| (8, 16, 512, 64)    | head_bias     | torch.bfloat16 |           45.481 |              48.435 |          170.267 |             261.265 |         0.939 |         0.652 |
| (8, 16, 512, 128)   | noop          | torch.bfloat16 |           72.565 |              74.736 |          313.220 |             773.126 |         0.971 |         0.405 |
| (8, 16, 512, 128)   | causal_mask   | torch.bfloat16 |           72.015 |              75.755 |          313.311 |             775.513 |         0.951 |         0.404 |
| (8, 16, 512, 128)   | relative_bias | torch.bfloat16 |           72.105 |              74.189 |          313.806 |             769.238 |         0.972 |         0.408 |
| (8, 16, 512, 128)   | head_bias     | torch.bfloat16 |           72.005 |              74.364 |          313.509 |             775.237 |         0.968 |         0.404 |
| (8, 16, 512, 256)   | noop          | torch.bfloat16 |          138.656 |             165.453 |          663.707 |            2672.067 |         0.838 |         0.248 |
| (8, 16, 512, 256)   | causal_mask   | torch.bfloat16 |          139.096 |             172.613 |          663.593 |            2926.538 |         0.806 |         0.227 |
| (8, 16, 512, 256)   | relative_bias | torch.bfloat16 |          139.500 |             168.417 |          663.938 |            2658.629 |         0.828 |         0.250 |
| (8, 16, 512, 256)   | head_bias     | torch.bfloat16 |          139.776 |             173.549 |          662.920 |            2667.266 |         0.805 |         0.249 |
| (8, 16, 1024, 64)   | noop          | torch.bfloat16 |          134.883 |             125.004 |          484.706 |            1195.254 |         1.079 |         0.406 |
| (8, 16, 1024, 64)   | causal_mask   | torch.bfloat16 |          134.297 |             132.875 |          485.420 |            1234.953 |         1.011 |         0.393 |
| (8, 16, 1024, 64)   | relative_bias | torch.bfloat16 |          134.839 |             139.231 |          485.470 |            1198.556 |         0.968 |         0.405 |
| (8, 16, 1024, 64)   | head_bias     | torch.bfloat16 |          133.822 |             136.449 |          485.608 |            1189.198 |         0.981 |         0.408 |
| (8, 16, 1024, 128)  | noop          | torch.bfloat16 |          235.470 |             234.765 |          886.094 |            2662.944 |         1.003 |         0.333 |
| (8, 16, 1024, 128)  | causal_mask   | torch.bfloat16 |          236.305 |             241.382 |          886.293 |            2646.984 |         0.979 |         0.335 |
| (8, 16, 1024, 128)  | relative_bias | torch.bfloat16 |          236.414 |             233.980 |          885.250 |            2642.178 |         1.010 |         0.335 |
| (8, 16, 1024, 128)  | head_bias     | torch.bfloat16 |          237.176 |             239.040 |          885.754 |            2665.242 |         0.992 |         0.332 |
| (8, 16, 1024, 256)  | noop          | torch.bfloat16 |          504.445 |             517.855 |         1978.956 |            9592.906 |         0.974 |         0.206 |
| (8, 16, 1024, 256)  | causal_mask   | torch.bfloat16 |          502.428 |             536.002 |         1978.611 |           10607.342 |         0.937 |         0.187 |
| (8, 16, 1024, 256)  | relative_bias | torch.bfloat16 |          503.396 |             523.960 |         1977.993 |            9539.284 |         0.961 |         0.207 |
| (8, 16, 1024, 256)  | head_bias     | torch.bfloat16 |          503.818 |             536.014 |         1980.131 |            9576.262 |         0.940 |         0.207 |
| (8, 16, 4096, 64)   | noop          | torch.bfloat16 |         1970.139 |            1674.930 |         5750.940 |           16724.134 |         1.176 |         0.344 |
| (8, 16, 4096, 64)   | causal_mask   | torch.bfloat16 |         1959.036 |            1775.056 |         5780.512 |           17390.350 |         1.104 |         0.332 |
| (8, 16, 4096, 64)   | relative_bias | torch.bfloat16 |         1947.198 |            1773.869 |         5780.643 |           16779.699 |         1.098 |         0.345 |
| (8, 16, 4096, 64)   | head_bias     | torch.bfloat16 |         1963.935 |            1829.502 |         5780.018 |           16703.259 |         1.073 |         0.346 |
| (8, 16, 4096, 128)  | noop          | torch.bfloat16 |         3582.711 |            3362.623 |        10436.069 |           36415.565 |         1.065 |         0.287 |
| (8, 16, 4096, 128)  | causal_mask   | torch.bfloat16 |         3581.504 |            3499.472 |        10346.869 |           36164.959 |         1.023 |         0.286 |
| (8, 16, 4096, 128)  | relative_bias | torch.bfloat16 |         3589.779 |            3337.849 |        10529.621 |           36261.696 |         1.075 |         0.290 |
| (8, 16, 4096, 128)  | head_bias     | torch.bfloat16 |         3602.265 |            3436.444 |        10458.660 |           36507.790 |         1.048 |         0.286 |
| (8, 16, 4096, 256)  | noop          | torch.bfloat16 |         7695.923 |            7126.275 |        24643.009 |          140949.081 |         1.080 |         0.175 |
| (8, 16, 4096, 256)  | causal_mask   | torch.bfloat16 |         7679.939 |            7186.252 |        24538.105 |          157156.067 |         1.069 |         0.156 |
| (8, 16, 4096, 256)  | relative_bias | torch.bfloat16 |         7681.374 |            6994.832 |        24549.713 |          140077.179 |         1.098 |         0.175 |
| (8, 16, 4096, 256)  | head_bias     | torch.bfloat16 |         7679.822 |            7212.278 |        24627.823 |          140675.003 |         1.065 |         0.175 |
| (16, 16, 512, 64)   | noop          | torch.bfloat16 |           80.126 |              78.291 |          333.719 |             541.165 |         1.023 |         0.617 |
| (16, 16, 512, 64)   | causal_mask   | torch.bfloat16 |           80.065 |              81.696 |          333.779 |             551.113 |         0.980 |         0.606 |
| (16, 16, 512, 64)   | relative_bias | torch.bfloat16 |           80.138 |              86.715 |          333.364 |             542.118 |         0.924 |         0.615 |
| (16, 16, 512, 64)   | head_bias     | torch.bfloat16 |           80.415 |              85.204 |          333.294 |             536.840 |         0.944 |         0.621 |
| (16, 16, 512, 128)  | noop          | torch.bfloat16 |          134.964 |             138.025 |          607.093 |            1333.102 |         0.978 |         0.455 |
| (16, 16, 512, 128)  | causal_mask   | torch.bfloat16 |          134.192 |             141.523 |          606.269 |            1424.318 |         0.948 |         0.426 |
| (16, 16, 512, 128)  | relative_bias | torch.bfloat16 |          135.711 |             138.639 |          606.283 |            1327.974 |         0.979 |         0.457 |
| (16, 16, 512, 128)  | head_bias     | torch.bfloat16 |          135.552 |             140.555 |          607.107 |            1347.370 |         0.964 |         0.451 |
| (16, 16, 512, 256)  | noop          | torch.bfloat16 |          275.113 |             315.144 |         1301.583 |            5268.153 |         0.873 |         0.247 |
| (16, 16, 512, 256)  | causal_mask   | torch.bfloat16 |          274.867 |             328.106 |         1302.513 |            5770.594 |         0.838 |         0.226 |
| (16, 16, 512, 256)  | relative_bias | torch.bfloat16 |          276.052 |             321.770 |         1302.904 |            5241.920 |         0.858 |         0.249 |
| (16, 16, 512, 256)  | head_bias     | torch.bfloat16 |          271.409 |             328.839 |         1302.142 |            5266.037 |         0.825 |         0.247 |
| (16, 16, 1024, 64)  | noop          | torch.bfloat16 |          260.489 |             237.463 |          955.884 |            1817.558 |         1.097 |         0.526 |
| (16, 16, 1024, 64)  | causal_mask   | torch.bfloat16 |          262.378 |             254.350 |          955.280 |            1843.807 |         1.032 |         0.518 |
| (16, 16, 1024, 64)  | relative_bias | torch.bfloat16 |          261.338 |             268.253 |          956.038 |            1820.036 |         0.974 |         0.525 |
| (16, 16, 1024, 64)  | head_bias     | torch.bfloat16 |          262.153 |             264.156 |          956.023 |            1810.076 |         0.992 |         0.528 |
| (16, 16, 1024, 128) | noop          | torch.bfloat16 |          476.475 |             461.413 |         1760.578 |            4306.521 |         1.033 |         0.409 |
| (16, 16, 1024, 128) | causal_mask   | torch.bfloat16 |          473.794 |             479.178 |         1761.277 |            4619.439 |         0.989 |         0.381 |
| (16, 16, 1024, 128) | relative_bias | torch.bfloat16 |          473.839 |             463.282 |         1758.692 |            4290.562 |         1.023 |         0.410 |
| (16, 16, 1024, 128) | head_bias     | torch.bfloat16 |          472.979 |             472.896 |         1763.086 |            4367.931 |         1.000 |         0.404 |
| (16, 16, 1024, 256) | noop          | torch.bfloat16 |         1014.184 |            1026.764 |         3922.997 |           19104.147 |         0.988 |         0.205 |
| (16, 16, 1024, 256) | causal_mask   | torch.bfloat16 |         1013.217 |            1039.046 |         3928.382 |           21086.281 |         0.975 |         0.186 |
| (16, 16, 1024, 256) | relative_bias | torch.bfloat16 |         1008.519 |            1015.278 |         3922.133 |           18980.652 |         0.993 |         0.207 |
| (16, 16, 1024, 256) | head_bias     | torch.bfloat16 |         1011.360 |            1047.542 |         3931.245 |           19069.172 |         0.965 |         0.206 |
| (16, 16, 4096, 64)  | noop          | torch.bfloat16 |         3929.850 |            3325.667 |        11411.704 |           23344.280 |         1.182 |         0.489 |
| (16, 16, 4096, 64)  | causal_mask   | torch.bfloat16 |         3885.262 |            3581.544 |        11390.515 |           23725.639 |         1.085 |         0.480 |
| (16, 16, 4096, 64)  | relative_bias | torch.bfloat16 |         3865.737 |            3537.308 |        11489.901 |           23406.330 |         1.093 |         0.491 |
| (16, 16, 4096, 64)  | head_bias     | torch.bfloat16 |         3880.530 |            3665.249 |        11484.411 |           23299.496 |         1.059 |         0.493 |
| (16, 16, 4096, 128) | noop          | torch.bfloat16 |         7030.306 |            6745.715 |        20621.264 |           57464.096 |         1.042 |         0.359 |
| (16, 16, 4096, 128) | causal_mask   | torch.bfloat16 |         7095.414 |            7034.385 |        20410.656 |           61660.511 |         1.009 |         0.331 |
| (16, 16, 4096, 128) | relative_bias | torch.bfloat16 |         7084.779 |            6686.497 |        20315.161 |           57243.969 |         1.060 |         0.355 |
| (16, 16, 4096, 128) | head_bias     | torch.bfloat16 |         7075.367 |            6863.305 |        20494.385 |           58481.953 |         1.031 |         0.350 |
| (16, 16, 4096, 256) | noop          | torch.bfloat16 |        15612.741 |           14297.482 |        55306.847 |          281161.865 |         1.092 |         0.197 |
| (16, 16, 4096, 256) | causal_mask   | torch.bfloat16 |        15326.592 |           14263.878 |        55227.806 |          313063.232 |         1.075 |         0.176 |
| (16, 16, 4096, 256) | relative_bias | torch.bfloat16 |        15297.963 |           14007.379 |        54558.029 |          279529.175 |         1.092 |         0.195 |
| (16, 16, 4096, 256) | head_bias     | torch.bfloat16 |        15216.160 |           14276.027 |        55081.581 |          280996.826 |         1.066 |         0.196 |

</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125515
Approved by: https://github.com/Chillee
2024-05-17 00:41:55 +00:00
337830f657 Revert "[inductor][cpp] GEMM template (infra and fp32) (#124021)"
This reverts commit f060b0c6e608436997a1dc229c82ce26c1e6676f.

Reverted https://github.com/pytorch/pytorch/pull/124021 on behalf of https://github.com/huydhn due to Unfortunately, the new tests are still failing internally ([comment](https://github.com/pytorch/pytorch/pull/124021#issuecomment-2116415398))
2024-05-17 00:22:40 +00:00
4a5ef0b793 Revert "[inductor][cpp] epilogue support for gemm template (#126019)"
This reverts commit 7844c202b2076ec3efa23264226f3eaef11a6fcb.

Reverted https://github.com/pytorch/pytorch/pull/126019 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but the dependency PR https://github.com/pytorch/pytorch/pull/124021 is going to be revert ([comment](https://github.com/pytorch/pytorch/pull/126019#issuecomment-2116408137))
2024-05-17 00:15:00 +00:00
59ca0d8c14 Revert "[inductor][cpp] bf16/fp16 gemm template computed with fp32 w/o epilogue fusion (#126068)"
This reverts commit 927e631dc2356c0cb600dbdf9e8f84ce792a8ba1.

Reverted https://github.com/pytorch/pytorch/pull/126068 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but the dependency PR https://github.com/pytorch/pytorch/pull/124021 is going to be revert ([comment](https://github.com/pytorch/pytorch/pull/126019#issuecomment-2116408137))
2024-05-17 00:15:00 +00:00
cb3b8cd0d3 Use object identity for deepcopy memo (#126126)
Copy of #126089, with some additional fixes & tests

Partial fix for #125635: previously, the deepcopy implementation would group together any tensors with any aliasing relationship and assign them to the same tensor. This was sort of good if you have two tensors `b = a.detach()`, because then if you deepcopy `list = [a, b]` to `list2 = list.deepcopy()`, then writes to `list2[0]` will also modify `list2[1]`. But for the most part, it's bad; (1) if you have `b = a.as_strided((4, 4), (16, 1), 16)`, then it'll make `b == a` in the deepcopied implementation, which is completely wrong; and (2) even if you have `b = a.detach()`, these are still initially two different tensors which become the same tensor after the old deepcopy implementation.

The new implementation only groups together tensors that have the same identity. This is a partial fix, but it's more reasonable. What changes:
* (becomes more correct): different views of the same base tensor will no longer all become equal after deepcopying
* (still kind of wrong): views won't actually alias each other after deepcopying.
* (arguably a minor regression): equivalent views of the same tensor will no longer be copied to the same tensor - so they won't alias.

BC breaking: C++ deepcopy interface changes from accepting `IValue::HashAliasedIValueMap memo` to accepting `IValue::HashIdentityIValueMap memo`. If there are objections, we can keep the old API. However, it seems likely that users generally won't try to deepcopy from C++.

Differential Revision: [D57406306](https://our.internmc.facebook.com/intern/diff/D57406306)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126126
Approved by: https://github.com/ezyang
2024-05-17 00:06:26 +00:00
55628624b8 [c10d] add pg_name and pg_desc to logger (#126409)
Summary:
This should further improve our debuggability

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126409
Approved by: https://github.com/XilunWu
2024-05-16 23:56:19 +00:00
796dff7147 Import MKL via //third-party/mkl targets (#126371)
Summary:
This is a step towards upgrading the MKL library and using a buckified targets rather than importing from TP2.

- Add new `//third-party/mkl:mkl_xxx` targets that are currently aliases to `third-party//IntelComposerXE:mkl_xxx`.
- Switch usage of `external_deps = [("IntelComposerXE", None, "mkl_xxx")]` to `deps = ["fbsource//third-party/mkl:mkl_xxx"]`

Note that this only changes references to `mkl_xxx` references in `IntelComposerXE` but not references to "svml" or "ipp*".

Test Plan: sandcastle

Differential Revision: D57360438

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126371
Approved by: https://github.com/bertmaher
2024-05-16 22:51:26 +00:00
62403b57b9 Add prefix option to CapabilityBasedPartitioner (#126382)
Summary: Add prefix arg so that users can provide the submodule name to partitioner.

Test Plan: https://fburl.com/anp/2kue4qp9

Differential Revision: D57416926

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126382
Approved by: https://github.com/SherlockNoMad
2024-05-16 22:38:07 +00:00
c226839f5c Eliminate some C++11 checks (#126308)
Test Plan: Sandcastle

Reviewed By: palmje

Differential Revision: D57246912

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126308
Approved by: https://github.com/Skylion007
2024-05-16 22:37:45 +00:00
f17572fcf6 add 3.12 inductor CI tests (#126218)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126218
Approved by: https://github.com/huydhn, https://github.com/desertfire
2024-05-16 22:29:24 +00:00
93524cf5ff [compiled autograd] clear compiled_autograd_verbose once test is done (#126148)
verbose flag leaks into tests ran after

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126148
Approved by: https://github.com/jansel
ghstack dependencies: #126144, #126146
2024-05-16 22:23:02 +00:00
cef7756c9c [inductor] Clear cache on ctx manager exit (#126146)
FIXES https://github.com/pytorch/pytorch/issues/126128.

Right now, we only clear the cache on ctx manager enter. So state is bad unless we call fresh_inductor_cache again,  usually fine in tests.

Cue compiled autograd tests when going from TestCompiledAutograd -> TestAutogradWithCompiledAutograd.
TestCompiledAutograd uses the ctx manager, but TestAutogradWithCompiledAutograd don't

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126146
Approved by: https://github.com/jgong5, https://github.com/oulgen
ghstack dependencies: #126144
2024-05-16 22:23:02 +00:00
4cd4463c1c [compiled autograd] Fix LoggingTensor flaky test (#126144)
LoggingTensor fails consistently when root logger level is INFO or lower
By default, root logger should be WARNING
But, triton driver initialization will overwrite root logger to INFO, which causes flakiness: https://github.com/pytorch/pytorch/issues/126143

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126144
Approved by: https://github.com/jansel
2024-05-16 22:23:02 +00:00
4b7eee3450 Print export warning only once in capture_pre_autograd (#126403)
Summary: Missed this in D57163341

Test Plan: CI

Differential Revision: D57442088

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126403
Approved by: https://github.com/zhxchen17
2024-05-16 21:55:11 +00:00
e9719aec30 Fix strict default value in StateDictOptions (#125998)
Fixes #125992

The default value of the parameter `strict` should be `True`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125998
Approved by: https://github.com/fegin
2024-05-16 21:42:53 +00:00
f5abf28e41 [Traceable FSDP2] Use DTensor.from_local() in _from_local_no_grad when compile (#126346)
As discussed before, for now Dynamo is not able to support DTensor constructor, and instead we have to use `DTensor.from_local()`.

This won't affect eager and it's a compile-only change.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126346
Approved by: https://github.com/awgu
2024-05-16 21:37:00 +00:00
4f1a56cd42 Switched from parameter in can_cast to from_. (#126030)
Fixes #126012.

`from` is a reserved keyword in Python, thus we can't make the C++ impl available with `from` as function parameter. This PR changes the name to `from_` and also adjusts the docs.

If we want to preserve backwards compatibility, we can leave the C++ name as-is and only fix the docs. However, `torch.can_cast(from_=torch.int, to=torch.int)` won't work then.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126030
Approved by: https://github.com/albanD
2024-05-16 20:58:24 +00:00
82c66bc41a Make 'pytest test/inductor/test_memory_planning.py' work (#126397)
There's still another naughty direct test_* import, I'm out of patience
right now though.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126397
Approved by: https://github.com/peterbell10, https://github.com/int3
2024-05-16 20:28:20 +00:00
866ca4630c Don't install inplace_methods on MockHandler, not needed (#126398)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126398
Approved by: https://github.com/jansel, https://github.com/peterbell10
2024-05-16 20:28:05 +00:00
8f0c207e18 xpu: implement xpu serialization (#125530)
Fixes: #125529

BC-breaking note:
The deprecated "async" argument to the Storage.cuda and Storage.hpu has been removed. Use non_blocking instead.

CC: @jbschlosser, @frank-wei @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @albanD

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125530
Approved by: https://github.com/guangyey, https://github.com/albanD
2024-05-16 20:22:17 +00:00
da9bf77f0a [Dynamo] Support SET_UPDATE (#126243)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126243
Approved by: https://github.com/anijain2305, https://github.com/Skylion007, https://github.com/jansel
2024-05-16 20:05:34 +00:00
aab448e381 Remove redundant serialization code (#126249)
After https://github.com/pytorch/pytorch/pull/123308, we no longer need separate serialization path to handle different types that exist in the `nn_module` metadata. This PR cleans up the redundant code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126249
Approved by: https://github.com/angelayi
2024-05-16 19:22:20 +00:00
5862521ad1 [onnx.export] Cache SetGraphInputTypeReliable (#124912)
This PR is part of an effort to speed up torch.onnx.export (https://github.com/pytorch/pytorch/issues/121422).

- For each node that is processed in onnx.export, a check is run to see if all inputs are "reliable" (static shape, etc.). This value does not change, so it is much faster to cache it on the first computation. The caching is added to the ConstantMap state.
- Resolves (6) in #121422.
- Also see #123028 with a similar addition of a cache state.

(partial fix of #121545)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124912
Approved by: https://github.com/justinchuby
2024-05-16 18:48:56 +00:00
a0429c01ad [BE][FSDP] Remove unnecessary warnings (#126365)
As title

Differential Revision: [D57419704](https://our.internmc.facebook.com/intern/diff/D57419704/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126365
Approved by: https://github.com/awgu, https://github.com/Skylion007
ghstack dependencies: #126362
2024-05-16 17:34:01 +00:00
0dd53650dd [BE][FSDP] Change the logging level to info (#126362)
As title

Differential Revision: [D57419445](https://our.internmc.facebook.com/intern/diff/D57419445/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126362
Approved by: https://github.com/awgu, https://github.com/Skylion007
2024-05-16 17:31:06 +00:00
9fbf2696d7 [AOTI][refactor] Add aoti_torch_item as a util function (#126352)
Summary: The logic has been repeated several times in the code, so it's worth to write a common util function.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126352
Approved by: https://github.com/chenyang78
ghstack dependencies: #126181, #126182, #126183
2024-05-16 17:07:06 +00:00
0332b5812e [AOTI] Support InplaceBernoulliFallback in the ABI-compatible codegen (#126183)
Summary: Update the torchgen rule for inplace ops like bernoulli_, and update InplaceBernoulliFallback to codegen in the ABI-compatible mode. Fixes https://github.com/pytorch/pytorch/issues/121809

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126183
Approved by: https://github.com/angelayi
ghstack dependencies: #126181, #126182
2024-05-16 17:07:06 +00:00
5792bc3c3e [AOTI] Refactor some fallback op util functions (#126182)
Summary: Move some util functions for cpp kernel naming and missing arg filling from FallbackKernel to ExternKernel, since they are useful for ExternKernel in general.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126182
Approved by: https://github.com/chenyang78
ghstack dependencies: #126181
2024-05-16 17:07:00 +00:00
c5f926ab87 [AOTI][torchgen] Support at::Generator via C shim (#126181)
Summary: Support at::Generator which is used by many random number generator ops
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126181
Approved by: https://github.com/chenyang78
2024-05-16 17:06:53 +00:00
a55d63659a Add 2nd shard to ROCm trunk workflow for core distributed UTs (#121716)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121716
Approved by: https://github.com/ezyang, https://github.com/huydhn
2024-05-16 16:50:02 +00:00
f155ed6bf2 [ROCm] amax hipblaslt integration (#125921)
AMAX is coming as part of rocm6.2. This code adds that functionality

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125921
Approved by: https://github.com/eqy, https://github.com/lezcano
2024-05-16 16:40:31 +00:00
14d8e3aec0 Add distributed/_tensor/test_attention to ROCM_BLOCKLIST (#126336)
Fixes #125504
Fixes #126252
Fixes #126296
Fixes #126330

This PR doesn't really fix the RingAttentionTest tests for ROCm, but explicitly adds the whole test file to ROCM_BLOCKLIST to get a clean signal on ROCm distributed CI. We will enable these tests in a follow-up PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126336
Approved by: https://github.com/huydhn, https://github.com/pruthvistony
2024-05-16 16:38:09 +00:00
91bf952d10 Fix aarch64 debug build with GCC (#126290)
By working around GCCs quirks in instantiating templates that require immediate values.
Provide alternative implementation for scaling the output if compiled without any optimizations (both GCC and clang define `__OPTIMIZE__` if invoked with anything but `-O0`)

Fixes https://github.com/pytorch/pytorch/issues/126283

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126290
Approved by: https://github.com/atalman, https://github.com/seemethere
2024-05-16 13:41:45 +00:00
ab07867084 [FSDP2] Supported set_all_reduce_gradients=False for HSDP (#126166)
**Context**
For FSDP, gradient accumulation across microbatches has two flavors: (1) reduce-scatter or (2) no reduce-scatter. (1) incurs the collective per microbatch backward but saves gradient memory (storing the sharded gradients), while (2) avoids the communication but uses more gradient memory (storing the unsharded gradients).
- FSDP2 offers (1) without any intervention. The user should simply make sure to run the optimizer step after `K` microbatches for `K > 1`.
- FSDP2 offers (2) via `module.set_requires_gradient_sync()` (e.g. `module.set_requires_gradient_sync(is_last_microbatch)`.

For HSDP, since we reduce-scatter and then all-reduce, we have additional flexibility and get three flavors: (1) reduce-scatter and all-reduce, (2) reduce-scatter but no all-reduce, and (3) no reduce-scatter and no all-reduce. This PR adds support for (2).
- FSDP2 offers (1) without any intervention like mentioned above.
- FSDP2 offers (3) via `module.set_requires_gradient_sync()` like mentioned above.
- FSDP2 offers (2) via `module.set_requires_all_reduce()` similar to `set_requires_gradient_sync()`.

**Overview**
For HSDP, to reduce-scatter but not all-reduce during gradient accumulation, the user can do something like:
```
for microbatch_idx, microbatch in enumerate(microbatches):
    is_last_microbatch = microbatch_idx == len(microbatches) - 1
    model.set_requires_all_reduce(is_last_microbatch)
    # Run forward/backward
```

This PR also makes the minor change of making the `recurse: bool` argument in these setter methods to be kwarg only.

**Developer Notes**
We choose to implement this by saving the partial reduce output to the `FSDPParamGroup` for simplicity, where we assume that the set of parameters that receive gradients does not change across microbatches. An alternative would be to view into the partial reduce output per parameter and save the view to each parameter. We prefer to avoid this alternative for now because it introduces more complexity to do extra viewing when saving the partial reduce output to each parameter, accumulating into them, and accumulating back to the last microbatch's reduce output.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126166
Approved by: https://github.com/weifengpy, https://github.com/wanchaol
ghstack dependencies: #126067, #126070, #126161
2024-05-16 12:29:22 +00:00
c2f8c75129 [Reopen] Upgrade submodule oneDNN to v3.4.2 (#126137)
Reopen of https://github.com/pytorch/pytorch/pull/122472

## Improvements
This upgrade fixes the following issues:
- https://github.com/pytorch/pytorch/issues/120982

This upgrade brings the following new features:
- Introduced memory descriptor serialization API. This API is needed to support freezing on CPU in AOTInductor (https://github.com/pytorch/pytorch/issues/114450)

## Validation results on CPU
Original results with oneDNN v3.4.1 are here: https://github.com/pytorch/pytorch/pull/122472#issue-2201602846

Need to rerun validation and update results.

Co-authored-by: Sunita Nadampalli <nadampal@amazon.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126137
Approved by: https://github.com/jgong5, https://github.com/snadampal, https://github.com/atalman
2024-05-16 12:00:16 +00:00
691af57fbc Fix broken link of scikit-learn (#120972)
The link is broken in https://pytorch.org/docs/main/community/design.html
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120972
Approved by: https://github.com/Skylion007
2024-05-16 11:46:34 +00:00
4333e122d4 [Traceable FSDP2] Add all_gather_into_tensor out variant (#126334)
This PR adds `torch.ops._c10d_functional.all_gather_into_tensor_out`.

It's important for tracing FSDP2, because FSDP2 pre-allocates the output buffer of AllGather, and makes input buffer an alias of the output buffer, and expects both of them to be used to achieve lower memory usage. If we don't preserve this behavior and instead functionalize the AllGather op, AllGather op will then create a brand-new output buffer (instead of reusing), thus significantly increasing the memory usage.

The expectation is that we will "re-inplace" the AllGather op by switching to the out variant in Inductor post-grad stage via an FX pass, so this API is not expected to be directly used by users.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126334
Approved by: https://github.com/yifuwang, https://github.com/wanchaol
2024-05-16 10:27:06 +00:00
d61a81a9e7 Fix lint failures coming from #126035 (#126378)
MYPY somehow shows lots of local failures for me.  The issue is tracked in https://github.com/pytorch/pytorch/issues/126361.  This is only to keep trunk sane.  These two line were added by #126035 as an attempt to fix lint there, but didn't seem to help.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126378
Approved by: https://github.com/kit1980
2024-05-16 06:05:47 +00:00
0716f75cfb Revert "Add Lowering for FlexAttention Backwards (#125515)"
This reverts commit 95b9e981c3ab68fc17f78b8a6bbfd9569745ae4c.

Reverted https://github.com/pytorch/pytorch/pull/125515 on behalf of https://github.com/huydhn due to Sorry for reverting your change but the newly added test runs out of memory 95b9e981c3 ([comment](https://github.com/pytorch/pytorch/pull/125515#issuecomment-2114084869))
2024-05-16 05:52:13 +00:00
cdcba4dee5 Revert "Fix lint failures coming from #126035 (#126378)"
This reverts commit 5fa1f4c6e46d92482d99614c06b6e288cc8d6c8d.

Reverted https://github.com/pytorch/pytorch/pull/126378 on behalf of https://github.com/huydhn due to Trying to add yet another lint fix from https://hud.pytorch.org/pr/pytorch/pytorch/126357 and will reland this ([comment](https://github.com/pytorch/pytorch/pull/126378#issuecomment-2114060547))
2024-05-16 05:32:19 +00:00
58378f1224 [Doc] Add deprecated autocast comments for doc (#126062)
# Motivation
We generalize a device-agnostic API `torch.amp.autocast` in [#125103](https://github.com/pytorch/pytorch/pull/125103).  After that,
- `torch.cpu.amp.autocast(args...)` is completely equivalent to `torch.amp.autocast('cpu', args...)`, and
- `torch.cuda.amp.autocast(args...)` is completely equivalent to `torch.amp.autocast('cuda', args...)`

no matter in eager mode or JIT mode.
Base on this point, we would like to deprecate `torch.cpu.amp.autocast` and `torch.cuda.amp.autocast` to **strongly recommend** developer to use `torch.amp.autocast` that is a device-agnostic API.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126062
Approved by: https://github.com/eqy, https://github.com/albanD
2024-05-16 05:26:43 +00:00
08aa704d0c [1/N] Non-Tensor: Scalar Support: Enable aot compile to support aten operations with scalar input like alpha (#124177)
Some operations have a scalar input parameter, like `torch.add(a, b, alpha=2.0)`.  Currently, the aot compile does not support such a case because it requires the signature of the captured graph to align with the operation's signature. This means that some inputs in the captured graph may be scalar(float, int, bool, etc.). It breaks the assumption of `compile_fx_aot` as it assumes all the example inputs are tensor - 0f6ce45bcb/torch/_inductor/compile_fx.py (L1048)

This PR intends to support such cases by allowing not-aligned signature and filtering out the non-Tensor parameters.

Captured graph for `torch.add(a, b, alpha=2.0)`

```
opcode         name      target           args              kwargs
-------------  --------  ---------------  ----------------  --------------
placeholder    arg0_1    arg0_1           ()                {}
placeholder    arg1_1    arg1_1           ()                {}
call_function  add       aten.add.Tensor  (arg0_1, arg1_1)  {'alpha': 2.0}
output         output_1  output           ((add,),)         {}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124177
Approved by: https://github.com/jansel, https://github.com/desertfire, https://github.com/jgong5
2024-05-16 05:15:55 +00:00
5fa1f4c6e4 Fix lint failures coming from #126035 (#126378)
MYPY somehow shows lots of local failures for me.  The issue is tracked in https://github.com/pytorch/pytorch/issues/126361.  This is only to keep trunk sane.  These two line were added by #126035 as an attempt to fix lint there, but didn't seem to help.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126378
Approved by: https://github.com/kit1980
2024-05-16 05:12:27 +00:00
e661a42428 [Add sliding window attention bias] (#126061)
Summary:
This PR implements sliding window and updates "aten._flash_attention_forward/_flash_attention_backward" to expose the window_size_left and window_size_right arguments. With this kwarg added we can dispatch to the FAv2 impl if the necessary constraints are met.

These arguments will eventually be provided to "aten.sdpa_flash" but for now they are needed when called by xformers into their effort to directly use the Pytorch FAv2 impl instead of building their own.

Test Plan:
Use the default aten.sdpa_flash tests since we've added optional arguments set to the previous default value: -1, /*window_size_left*/

Using buck2 build --flagfile fbcode//mode/dev-nosan fbcode//caffe2/caffe2/fb/predictor/tests:inference_context_test

Differential Revision: D56938087

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126061
Approved by: https://github.com/drisspg, https://github.com/desertfire
2024-05-16 04:50:47 +00:00
8dc6f455bd [ez] fix exported diff mismatch (#126357)
Fixes the following issue:
D55803461 differs from the exported PR: #123658

⚠️ this PR needs to be skipped on diff train!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126357
Approved by: https://github.com/huydhn, https://github.com/fegin
2024-05-16 04:49:48 +00:00
6e6e44bdcc Generate runtime asserts when propagate real tensor is used (#126287)
This means that propagate real tensor is no longer unsound: if the
route we took at compile time diverges with runtime, you will get a
runtime assert.

Also add structured trace logs for these.

Also fix bug where xreplace with int range is not guaranteed to return
a sympy expression.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126287
Approved by: https://github.com/Skylion007
2024-05-16 04:45:57 +00:00
c860df5a9d [c10d] Add an option for NAN check on every collective (#125726)
Summary:
The NAN CHECK is done through device side assert without copying needed
from GPU to CPU
Test Plan:
Unit test for collectives that should experience run time error

(sqzhang_1) [sqzhang@devgpu009.cln1 ~/pytorch (38f5143e)]$  python
test/distributed/test_c10d_nccl.py ProcessGroupNCCLTest.test_nan_assert
/home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15:
checkForNaN: block: [0,0,0], thread: [0,0,0] Assertion `!isnan(val)`
failed.
/home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15:
checkForNaN: block: [0,0,0], thread: [1,0,0] Assertion `!isnan(val)`
failed.
/home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15:
checkForNaN: block: [0,0,0], thread: [2,0,0] Assertion `!isnan(val)`
failed.
/home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15:
checkForNaN: block: [0,0,0], thread: [3,0,0] Assertion `!isnan(val)`
failed.
/home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15:
checkForNaN: block: [0,0,0], thread: [4,0,0] Assertion `!isnan(val)`
failed.
/home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15:
checkForNaN: block: [0,0,0], thread: [5,0,0] Assertion `!isnan(val)`
failed.
[rank0]:[E507 17:31:56.885473996 Utils.cu:30] CUDA error during
checkForNan: device-side assert triggered

/home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15:
checkForNaN: block: [0,0,0], thread: [0,0,0] Assertion `!isnan(val)`
failed.
/home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15:
checkForNaN: block: [0,0,0], thread: [1,0,0] Assertion `!isnan(val)`
failed.
/home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15:
checkForNaN: block: [0,0,0], thread: [2,0,0] Assertion `!isnan(val)`
failed.
/home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15:
checkForNaN: block: [0,0,0], thread: [3,0,0] Assertion `!isnan(val)`
failed.
/home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15:
checkForNaN: block: [0,0,0], thread: [4,0,0] Assertion `!isnan(val)`
failed.
/home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15:
checkForNaN: block: [0,0,0], thread: [5,0,0] Assertion `!isnan(val)`
failed.
[rank1]:[E507 17:31:56.128961534 Utils.cu:30] CUDA error during
checkForNan: device-side assert triggered

.
----------------------------------------------------------------------
Ran 1 test in 7.723s

OK

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125726
Approved by: https://github.com/kwen2501
2024-05-16 04:35:15 +00:00
0214711f05 Add mode to MemoryDep to track atomic accumulates (#123223)
And allow fusion of buffers where writes are only atomic accumulates.
This allows fusing of ops like

  _unsafe_index_put(_unsafe_index_put(a, ...), ...)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123223
Approved by: https://github.com/peterbell10
2024-05-16 04:34:09 +00:00
d0dfcd2c34 fix the device type for with_comms decorator (#125798)
found by @yifuwang, it looks like we are wrongly using
self.device_type="cuda" for gloo backend, which are triggering some
flakiness. i.e. https://github.com/pytorch/pytorch/issues/125366

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125798
Approved by: https://github.com/yifuwang
2024-05-16 03:40:19 +00:00
bcc8d25e47 [dynamo] Delete extra testing of cpp guard manager (#126343)
CPP guard manager has been on for a few weeks now. This separate testing was part of phasing when the cpp guard manager was not enabled. Now this is not needed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126343
Approved by: https://github.com/williamwen42
ghstack dependencies: #126303, #126316, #126314, #126327
2024-05-16 03:30:38 +00:00
95b9e981c3 Add Lowering for FlexAttention Backwards (#125515)
# Summary
#### What does this PR do?
It enables Inductor to actually generate the fused flex attention kernel for the backwards

I did some other things along the way:
- Abstract out the 'build_subgraph_buffer' subroutine and make it reusable between flex attention and flex_attention backwards. In total we need too build 3 subgraphs for fwd + bwd. 1 for the fwd graph and then 2 in the bwd. The FAv2 algorithm recomputes the parts of the forward (more efficiently since we already have the row_max via logsumexp), therefore we need to inline both the fwd graph and the joint graph in the bwds kernel.
- The version of the backwards kernel is from a somewhat older version of the triton tutorial implementation. I think that we should update in a follow up to a newer version. Notably the blocks need to be square for this to work as currently implemented. I am sure there are many opportunities for optimization.
- I didnt correctly register the decomp table + IndexMode when I landed: https://github.com/pytorch/pytorch/pull/123902, this remedies that.
- The rel_bias helper func was reversed in terms of causality. I updated and then add a test specific for "future causal" attention.
- This PRs but the main point that I think still needs to be worked out is the store_output call. I have it hacked up to be 'fake' but I dont think we want to land that and likely want to just have a mutated 'dq' and a stored_output 'dk'
- I also needed to update the `TritonTemplateKernel` to actually accept multiple subgraphs (modifications)
- I updated the benchmark to also profile bwds performance

### Benchmark Numbers:
_The current implementation is not parallelizing over ctx length in the bwd_
FWD Speedups

| Type    |   Speedup | shape              | score_mod   | dtype          |
|---------|-----------|--------------------|-------------|----------------|
| Average |     0.991 |                    |             |                |
| Max     |     1.182 | (16, 16, 4096, 64) | noop        | torch.bfloat16 |
| Min     |     0.796 | (2, 16, 512, 256)  | head_bias   | torch.bfloat16 |

BWD Speedups

| Type    |   Speedup | shape              | score_mod   | dtype          |
|---------|-----------|--------------------|-------------|----------------|
| Average |     0.291 |                    |             |                |
| Max     |     0.652 | (8, 16, 512, 64)   | head_bias   | torch.bfloat16 |
| Min     |     0.073 | (2, 16, 4096, 128) | head_bias   | torch.bfloat16 |

<details>

<summary>Full Data</summary>

| shape               | score_mod     | dtype          |   fwd_eager_time |   fwd_compiled_time |   bwd_eager_time |   bwd_compiled_time |   fwd_speedup |   bwd_speedup |
|---------------------|---------------|----------------|------------------|---------------------|------------------|---------------------|---------------|---------------|
| (2, 16, 512, 64)    | noop          | torch.bfloat16 |           19.936 |              19.092 |           57.851 |             193.564 |         1.044 |         0.299 |
| (2, 16, 512, 64)    | causal_mask   | torch.bfloat16 |           19.955 |              19.497 |           57.662 |             206.278 |         1.024 |         0.280 |
| (2, 16, 512, 64)    | relative_bias | torch.bfloat16 |           19.455 |              21.297 |           57.674 |             195.219 |         0.913 |         0.295 |
| (2, 16, 512, 64)    | head_bias     | torch.bfloat16 |           19.958 |              21.289 |           57.674 |             193.859 |         0.938 |         0.298 |
| (2, 16, 512, 128)   | noop          | torch.bfloat16 |           28.157 |              28.615 |           82.831 |             454.211 |         0.984 |         0.182 |
| (2, 16, 512, 128)   | causal_mask   | torch.bfloat16 |           28.154 |              28.444 |           83.091 |             432.083 |         0.990 |         0.192 |
| (2, 16, 512, 128)   | relative_bias | torch.bfloat16 |           28.722 |              27.897 |           83.175 |             446.789 |         1.030 |         0.186 |
| (2, 16, 512, 128)   | head_bias     | torch.bfloat16 |           28.299 |              27.673 |           83.052 |             459.179 |         1.023 |         0.181 |
| (2, 16, 512, 256)   | noop          | torch.bfloat16 |           41.167 |              50.504 |          175.019 |            1083.545 |         0.815 |         0.162 |
| (2, 16, 512, 256)   | causal_mask   | torch.bfloat16 |           41.656 |              51.933 |          175.078 |            1171.176 |         0.802 |         0.149 |
| (2, 16, 512, 256)   | relative_bias | torch.bfloat16 |           41.697 |              50.722 |          175.159 |            1097.312 |         0.822 |         0.160 |
| (2, 16, 512, 256)   | head_bias     | torch.bfloat16 |           41.690 |              52.387 |          175.184 |            1097.336 |         0.796 |         0.160 |
| (2, 16, 1024, 64)   | noop          | torch.bfloat16 |           39.232 |              37.454 |          127.847 |             612.430 |         1.047 |         0.209 |
| (2, 16, 1024, 64)   | causal_mask   | torch.bfloat16 |           39.930 |              39.599 |          127.755 |             665.359 |         1.008 |         0.192 |
| (2, 16, 1024, 64)   | relative_bias | torch.bfloat16 |           39.417 |              41.304 |          127.902 |             614.990 |         0.954 |         0.208 |
| (2, 16, 1024, 64)   | head_bias     | torch.bfloat16 |           39.965 |              42.034 |          127.953 |             613.273 |         0.951 |         0.209 |
| (2, 16, 1024, 128)  | noop          | torch.bfloat16 |           63.964 |              71.024 |          226.510 |            1637.669 |         0.901 |         0.138 |
| (2, 16, 1024, 128)  | causal_mask   | torch.bfloat16 |           63.843 |              72.451 |          226.750 |            1558.949 |         0.881 |         0.145 |
| (2, 16, 1024, 128)  | relative_bias | torch.bfloat16 |           64.301 |              70.487 |          226.651 |            1610.063 |         0.912 |         0.141 |
| (2, 16, 1024, 128)  | head_bias     | torch.bfloat16 |           64.033 |              71.394 |          226.676 |            1668.511 |         0.897 |         0.136 |
| (2, 16, 1024, 256)  | noop          | torch.bfloat16 |          129.348 |             141.390 |          507.337 |            4405.175 |         0.915 |         0.115 |
| (2, 16, 1024, 256)  | causal_mask   | torch.bfloat16 |          129.538 |             145.680 |          507.178 |            4768.874 |         0.889 |         0.106 |
| (2, 16, 1024, 256)  | relative_bias | torch.bfloat16 |          129.438 |             142.782 |          507.004 |            4401.002 |         0.907 |         0.115 |
| (2, 16, 1024, 256)  | head_bias     | torch.bfloat16 |          129.058 |             146.242 |          507.547 |            4434.251 |         0.883 |         0.114 |
| (2, 16, 4096, 64)   | noop          | torch.bfloat16 |          481.606 |             409.120 |         1440.890 |           14147.269 |         1.177 |         0.102 |
| (2, 16, 4096, 64)   | causal_mask   | torch.bfloat16 |          480.227 |             438.847 |         1434.419 |           14973.386 |         1.094 |         0.096 |
| (2, 16, 4096, 64)   | relative_bias | torch.bfloat16 |          480.831 |             458.104 |         1432.935 |           14193.253 |         1.050 |         0.101 |
| (2, 16, 4096, 64)   | head_bias     | torch.bfloat16 |          480.749 |             452.497 |         1437.040 |           14084.869 |         1.062 |         0.102 |
| (2, 16, 4096, 128)  | noop          | torch.bfloat16 |          872.534 |             848.275 |         2600.895 |           35156.849 |         1.029 |         0.074 |
| (2, 16, 4096, 128)  | causal_mask   | torch.bfloat16 |          872.647 |             868.279 |         2587.581 |           31919.531 |         1.005 |         0.081 |
| (2, 16, 4096, 128)  | relative_bias | torch.bfloat16 |          871.484 |             827.644 |         2593.989 |           34805.634 |         1.053 |         0.075 |
| (2, 16, 4096, 128)  | head_bias     | torch.bfloat16 |          871.422 |             856.437 |         2602.482 |           35708.591 |         1.017 |         0.073 |
| (2, 16, 4096, 256)  | noop          | torch.bfloat16 |         1904.497 |            1758.183 |         6122.416 |           66754.593 |         1.083 |         0.092 |
| (2, 16, 4096, 256)  | causal_mask   | torch.bfloat16 |         1911.174 |            1762.821 |         6113.207 |           72759.392 |         1.084 |         0.084 |
| (2, 16, 4096, 256)  | relative_bias | torch.bfloat16 |         1911.254 |            1727.108 |         6123.530 |           66577.988 |         1.107 |         0.092 |
| (2, 16, 4096, 256)  | head_bias     | torch.bfloat16 |         1916.977 |            1801.804 |         6118.158 |           67359.680 |         1.064 |         0.091 |
| (8, 16, 512, 64)    | noop          | torch.bfloat16 |           44.984 |              43.974 |          170.276 |             262.259 |         1.023 |         0.649 |
| (8, 16, 512, 64)    | causal_mask   | torch.bfloat16 |           45.001 |              46.265 |          170.509 |             274.893 |         0.973 |         0.620 |
| (8, 16, 512, 64)    | relative_bias | torch.bfloat16 |           45.466 |              48.211 |          170.606 |             262.759 |         0.943 |         0.649 |
| (8, 16, 512, 64)    | head_bias     | torch.bfloat16 |           45.481 |              48.435 |          170.267 |             261.265 |         0.939 |         0.652 |
| (8, 16, 512, 128)   | noop          | torch.bfloat16 |           72.565 |              74.736 |          313.220 |             773.126 |         0.971 |         0.405 |
| (8, 16, 512, 128)   | causal_mask   | torch.bfloat16 |           72.015 |              75.755 |          313.311 |             775.513 |         0.951 |         0.404 |
| (8, 16, 512, 128)   | relative_bias | torch.bfloat16 |           72.105 |              74.189 |          313.806 |             769.238 |         0.972 |         0.408 |
| (8, 16, 512, 128)   | head_bias     | torch.bfloat16 |           72.005 |              74.364 |          313.509 |             775.237 |         0.968 |         0.404 |
| (8, 16, 512, 256)   | noop          | torch.bfloat16 |          138.656 |             165.453 |          663.707 |            2672.067 |         0.838 |         0.248 |
| (8, 16, 512, 256)   | causal_mask   | torch.bfloat16 |          139.096 |             172.613 |          663.593 |            2926.538 |         0.806 |         0.227 |
| (8, 16, 512, 256)   | relative_bias | torch.bfloat16 |          139.500 |             168.417 |          663.938 |            2658.629 |         0.828 |         0.250 |
| (8, 16, 512, 256)   | head_bias     | torch.bfloat16 |          139.776 |             173.549 |          662.920 |            2667.266 |         0.805 |         0.249 |
| (8, 16, 1024, 64)   | noop          | torch.bfloat16 |          134.883 |             125.004 |          484.706 |            1195.254 |         1.079 |         0.406 |
| (8, 16, 1024, 64)   | causal_mask   | torch.bfloat16 |          134.297 |             132.875 |          485.420 |            1234.953 |         1.011 |         0.393 |
| (8, 16, 1024, 64)   | relative_bias | torch.bfloat16 |          134.839 |             139.231 |          485.470 |            1198.556 |         0.968 |         0.405 |
| (8, 16, 1024, 64)   | head_bias     | torch.bfloat16 |          133.822 |             136.449 |          485.608 |            1189.198 |         0.981 |         0.408 |
| (8, 16, 1024, 128)  | noop          | torch.bfloat16 |          235.470 |             234.765 |          886.094 |            2662.944 |         1.003 |         0.333 |
| (8, 16, 1024, 128)  | causal_mask   | torch.bfloat16 |          236.305 |             241.382 |          886.293 |            2646.984 |         0.979 |         0.335 |
| (8, 16, 1024, 128)  | relative_bias | torch.bfloat16 |          236.414 |             233.980 |          885.250 |            2642.178 |         1.010 |         0.335 |
| (8, 16, 1024, 128)  | head_bias     | torch.bfloat16 |          237.176 |             239.040 |          885.754 |            2665.242 |         0.992 |         0.332 |
| (8, 16, 1024, 256)  | noop          | torch.bfloat16 |          504.445 |             517.855 |         1978.956 |            9592.906 |         0.974 |         0.206 |
| (8, 16, 1024, 256)  | causal_mask   | torch.bfloat16 |          502.428 |             536.002 |         1978.611 |           10607.342 |         0.937 |         0.187 |
| (8, 16, 1024, 256)  | relative_bias | torch.bfloat16 |          503.396 |             523.960 |         1977.993 |            9539.284 |         0.961 |         0.207 |
| (8, 16, 1024, 256)  | head_bias     | torch.bfloat16 |          503.818 |             536.014 |         1980.131 |            9576.262 |         0.940 |         0.207 |
| (8, 16, 4096, 64)   | noop          | torch.bfloat16 |         1970.139 |            1674.930 |         5750.940 |           16724.134 |         1.176 |         0.344 |
| (8, 16, 4096, 64)   | causal_mask   | torch.bfloat16 |         1959.036 |            1775.056 |         5780.512 |           17390.350 |         1.104 |         0.332 |
| (8, 16, 4096, 64)   | relative_bias | torch.bfloat16 |         1947.198 |            1773.869 |         5780.643 |           16779.699 |         1.098 |         0.345 |
| (8, 16, 4096, 64)   | head_bias     | torch.bfloat16 |         1963.935 |            1829.502 |         5780.018 |           16703.259 |         1.073 |         0.346 |
| (8, 16, 4096, 128)  | noop          | torch.bfloat16 |         3582.711 |            3362.623 |        10436.069 |           36415.565 |         1.065 |         0.287 |
| (8, 16, 4096, 128)  | causal_mask   | torch.bfloat16 |         3581.504 |            3499.472 |        10346.869 |           36164.959 |         1.023 |         0.286 |
| (8, 16, 4096, 128)  | relative_bias | torch.bfloat16 |         3589.779 |            3337.849 |        10529.621 |           36261.696 |         1.075 |         0.290 |
| (8, 16, 4096, 128)  | head_bias     | torch.bfloat16 |         3602.265 |            3436.444 |        10458.660 |           36507.790 |         1.048 |         0.286 |
| (8, 16, 4096, 256)  | noop          | torch.bfloat16 |         7695.923 |            7126.275 |        24643.009 |          140949.081 |         1.080 |         0.175 |
| (8, 16, 4096, 256)  | causal_mask   | torch.bfloat16 |         7679.939 |            7186.252 |        24538.105 |          157156.067 |         1.069 |         0.156 |
| (8, 16, 4096, 256)  | relative_bias | torch.bfloat16 |         7681.374 |            6994.832 |        24549.713 |          140077.179 |         1.098 |         0.175 |
| (8, 16, 4096, 256)  | head_bias     | torch.bfloat16 |         7679.822 |            7212.278 |        24627.823 |          140675.003 |         1.065 |         0.175 |
| (16, 16, 512, 64)   | noop          | torch.bfloat16 |           80.126 |              78.291 |          333.719 |             541.165 |         1.023 |         0.617 |
| (16, 16, 512, 64)   | causal_mask   | torch.bfloat16 |           80.065 |              81.696 |          333.779 |             551.113 |         0.980 |         0.606 |
| (16, 16, 512, 64)   | relative_bias | torch.bfloat16 |           80.138 |              86.715 |          333.364 |             542.118 |         0.924 |         0.615 |
| (16, 16, 512, 64)   | head_bias     | torch.bfloat16 |           80.415 |              85.204 |          333.294 |             536.840 |         0.944 |         0.621 |
| (16, 16, 512, 128)  | noop          | torch.bfloat16 |          134.964 |             138.025 |          607.093 |            1333.102 |         0.978 |         0.455 |
| (16, 16, 512, 128)  | causal_mask   | torch.bfloat16 |          134.192 |             141.523 |          606.269 |            1424.318 |         0.948 |         0.426 |
| (16, 16, 512, 128)  | relative_bias | torch.bfloat16 |          135.711 |             138.639 |          606.283 |            1327.974 |         0.979 |         0.457 |
| (16, 16, 512, 128)  | head_bias     | torch.bfloat16 |          135.552 |             140.555 |          607.107 |            1347.370 |         0.964 |         0.451 |
| (16, 16, 512, 256)  | noop          | torch.bfloat16 |          275.113 |             315.144 |         1301.583 |            5268.153 |         0.873 |         0.247 |
| (16, 16, 512, 256)  | causal_mask   | torch.bfloat16 |          274.867 |             328.106 |         1302.513 |            5770.594 |         0.838 |         0.226 |
| (16, 16, 512, 256)  | relative_bias | torch.bfloat16 |          276.052 |             321.770 |         1302.904 |            5241.920 |         0.858 |         0.249 |
| (16, 16, 512, 256)  | head_bias     | torch.bfloat16 |          271.409 |             328.839 |         1302.142 |            5266.037 |         0.825 |         0.247 |
| (16, 16, 1024, 64)  | noop          | torch.bfloat16 |          260.489 |             237.463 |          955.884 |            1817.558 |         1.097 |         0.526 |
| (16, 16, 1024, 64)  | causal_mask   | torch.bfloat16 |          262.378 |             254.350 |          955.280 |            1843.807 |         1.032 |         0.518 |
| (16, 16, 1024, 64)  | relative_bias | torch.bfloat16 |          261.338 |             268.253 |          956.038 |            1820.036 |         0.974 |         0.525 |
| (16, 16, 1024, 64)  | head_bias     | torch.bfloat16 |          262.153 |             264.156 |          956.023 |            1810.076 |         0.992 |         0.528 |
| (16, 16, 1024, 128) | noop          | torch.bfloat16 |          476.475 |             461.413 |         1760.578 |            4306.521 |         1.033 |         0.409 |
| (16, 16, 1024, 128) | causal_mask   | torch.bfloat16 |          473.794 |             479.178 |         1761.277 |            4619.439 |         0.989 |         0.381 |
| (16, 16, 1024, 128) | relative_bias | torch.bfloat16 |          473.839 |             463.282 |         1758.692 |            4290.562 |         1.023 |         0.410 |
| (16, 16, 1024, 128) | head_bias     | torch.bfloat16 |          472.979 |             472.896 |         1763.086 |            4367.931 |         1.000 |         0.404 |
| (16, 16, 1024, 256) | noop          | torch.bfloat16 |         1014.184 |            1026.764 |         3922.997 |           19104.147 |         0.988 |         0.205 |
| (16, 16, 1024, 256) | causal_mask   | torch.bfloat16 |         1013.217 |            1039.046 |         3928.382 |           21086.281 |         0.975 |         0.186 |
| (16, 16, 1024, 256) | relative_bias | torch.bfloat16 |         1008.519 |            1015.278 |         3922.133 |           18980.652 |         0.993 |         0.207 |
| (16, 16, 1024, 256) | head_bias     | torch.bfloat16 |         1011.360 |            1047.542 |         3931.245 |           19069.172 |         0.965 |         0.206 |
| (16, 16, 4096, 64)  | noop          | torch.bfloat16 |         3929.850 |            3325.667 |        11411.704 |           23344.280 |         1.182 |         0.489 |
| (16, 16, 4096, 64)  | causal_mask   | torch.bfloat16 |         3885.262 |            3581.544 |        11390.515 |           23725.639 |         1.085 |         0.480 |
| (16, 16, 4096, 64)  | relative_bias | torch.bfloat16 |         3865.737 |            3537.308 |        11489.901 |           23406.330 |         1.093 |         0.491 |
| (16, 16, 4096, 64)  | head_bias     | torch.bfloat16 |         3880.530 |            3665.249 |        11484.411 |           23299.496 |         1.059 |         0.493 |
| (16, 16, 4096, 128) | noop          | torch.bfloat16 |         7030.306 |            6745.715 |        20621.264 |           57464.096 |         1.042 |         0.359 |
| (16, 16, 4096, 128) | causal_mask   | torch.bfloat16 |         7095.414 |            7034.385 |        20410.656 |           61660.511 |         1.009 |         0.331 |
| (16, 16, 4096, 128) | relative_bias | torch.bfloat16 |         7084.779 |            6686.497 |        20315.161 |           57243.969 |         1.060 |         0.355 |
| (16, 16, 4096, 128) | head_bias     | torch.bfloat16 |         7075.367 |            6863.305 |        20494.385 |           58481.953 |         1.031 |         0.350 |
| (16, 16, 4096, 256) | noop          | torch.bfloat16 |        15612.741 |           14297.482 |        55306.847 |          281161.865 |         1.092 |         0.197 |
| (16, 16, 4096, 256) | causal_mask   | torch.bfloat16 |        15326.592 |           14263.878 |        55227.806 |          313063.232 |         1.075 |         0.176 |
| (16, 16, 4096, 256) | relative_bias | torch.bfloat16 |        15297.963 |           14007.379 |        54558.029 |          279529.175 |         1.092 |         0.195 |
| (16, 16, 4096, 256) | head_bias     | torch.bfloat16 |        15216.160 |           14276.027 |        55081.581 |          280996.826 |         1.066 |         0.196 |

</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125515
Approved by: https://github.com/Chillee
2024-05-16 03:14:27 +00:00
ae6fdfa539 Revert "Initial implementation of AdaRound (#126153)"
This reverts commit 175c18af818804ba8ef433c3eb8488d1a3d1dd9d.

Reverted https://github.com/pytorch/pytorch/pull/126153 on behalf of https://github.com/huydhn due to Sorry for reverting your change but the lint failure is legit because there are more than one lint issues, torch/optim/asgd.py is just the last one ([comment](https://github.com/pytorch/pytorch/pull/126153#issuecomment-2113902522))
2024-05-16 02:34:49 +00:00
e3c5d1b7d7 Revert "[optim] Fix: wrong ASGD implementation (#125440)"
This reverts commit 2c5ad9a3d7ea79ca897aec153a401f4b9175a717.

Reverted https://github.com/pytorch/pytorch/pull/125440 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it looks like there is a linter failure coming from this change ([comment](https://github.com/pytorch/pytorch/pull/125440#issuecomment-2113833108))
2024-05-16 02:12:29 +00:00
175c18af81 Initial implementation of AdaRound (#126153)
Summary:
This is an implementation of AdaRound from a paper https://arxiv.org/abs/2004.10568

This algorithm is going to be used by multiple people, hence we need make it official implementation.

Differential Revision: D57227565

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126153
Approved by: https://github.com/jerryzh168
2024-05-16 02:09:18 +00:00
927e631dc2 [inductor][cpp] bf16/fp16 gemm template computed with fp32 w/o epilogue fusion (#126068)
As part of #125683, this PR adds the initial bf16/fp16 gemm template support with micro-gemm implemented with fused type casting and fp32 computation. It doesn't provide epilogue fusion support yet which will be added in the next PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126068
Approved by: https://github.com/jansel
ghstack dependencies: #126019
2024-05-16 02:05:49 +00:00
059b68fbdf [DeviceMesh] Fix hash and eq not match (#123572)
Fixes #121799

We fix DeviceMesh hash such that two mesh are considered equal if they have the same mesh and same parent_mesh.
Examples can be found here: https://github.com/pytorch/pytorch/issues/121799

Also need this to unblock https://github.com/pytorch/pytorch/pull/123394

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123572
Approved by: https://github.com/xunnanxu, https://github.com/wanchaol, https://github.com/yoyoyocmu
2024-05-16 02:00:45 +00:00
1876f0fec1 [dynamo][nn module guards] Use TENSOR_MATCH, and not ID_MATCH, for numpy tensors (#126246)
Fixes speech_transformer regression here - https://hud.pytorch.org/benchmark/torchbench/inductor_no_cudagraphs?startTime=Tue%2C%2007%20May%202024%2019%3A22%3A54%20GMT&stopTime=Tue%2C%2014%20May%202024%2019%3A22%3A54%20GMT&granularity=hour&mode=training&dtype=amp&lBranch=main&lCommit=02093b6c6ae1046368e2500881d0bb5880873386&rBranch=main&rCommit=b24ad7eab55eaf660893dddae949fc714e434338

Thanks to @eellison  and @bdhirsh for isolating the regression to nn module guards.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126246
Approved by: https://github.com/jansel
ghstack dependencies: #126203
2024-05-16 01:57:59 +00:00
315389bfed Revert "Remove deprecated _aminmax operator (#125995)"
This reverts commit 0116ffae7f94f35a2c712e186a0b371959b68c64.

Reverted https://github.com/pytorch/pytorch/pull/125995 on behalf of https://github.com/huydhn due to Sorry for reverting your change but we need to reland this after I get rid of all usage of _aminmax internally in Meta ([comment](https://github.com/pytorch/pytorch/pull/125995#issuecomment-2113769497))
2024-05-16 01:45:37 +00:00
6dca1e639b [TEST][Dynamo] fix test_deviceguard.py (#126240)
The `test_device_guard.py` was improperly set up, so there were failures on multi-GPU machines. By design the `DeviceGuard` should keep `idx` the same even after it was applied.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126240
Approved by: https://github.com/jansel
2024-05-16 01:44:42 +00:00
7844c202b2 [inductor][cpp] epilogue support for gemm template (#126019)
As part of #125683, this PR adds the epilogue support for c++ gemm template by reusing the c++ vector codegen on sub-slices of tensors. This is implemented by retracing the epilogue IR nodes with new ranges and offsets. The new `codegen_loop_bodies` and `codegen_functions` methods are added to c++ vector codegen for this purpose. This is leveraged by the `store_output` method of the template kernel for epilogue codegen and store to the final result.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126019
Approved by: https://github.com/jansel
2024-05-16 01:42:29 +00:00
6065a4d46e Revert "Switched from parameter in can_cast to from_. (#126030)"
This reverts commit 06d6bb4ebabc64433224970024ada1781508197d.

Reverted https://github.com/pytorch/pytorch/pull/126030 on behalf of https://github.com/huydhn due to Sorry for reverting your change but i need to revert it to avoid a diff train conflict with https://github.com/pytorch/pytorch/pull/125995.  Please help rebase and I will reland the change ([comment](https://github.com/pytorch/pytorch/pull/126030#issuecomment-2113757469))
2024-05-16 01:42:23 +00:00
5efad4ebc1 [inductor] [FX graph cache] Ignore unbacked symints in guards expression (#126251)
Summary: Found a unit test that was causing an assertion failure during an attempt to use unbacked symints in the guards expression, but it turns out unbacked symints can't affect guards anyway, so we can just filter them out. Also in this diff: test_torchinductor_dynamic_shapes.py was not configured to exercise the codecache because the TestCase setUp method was indavertently skipping the setUp of the immediate parent class.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126251
Approved by: https://github.com/peterbell10
2024-05-16 01:35:41 +00:00
bd63300bae [dynamo][inline-inbuilt-nn-modules] Add and update test_modules.py for nlining work (#126327)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126327
Approved by: https://github.com/williamwen42
ghstack dependencies: #126303, #126316, #126314
2024-05-16 01:35:09 +00:00
7aa068f350 [dynamo][inline-inbuilt-nn-modules] Change test to not depend on id of mod instance (#126314)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126314
Approved by: https://github.com/williamwen42
ghstack dependencies: #126303, #126316
2024-05-16 01:35:09 +00:00
0f8380dd65 [Inductor][Flex-attention] Make num_head support dynamic (#126342)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126342
Approved by: https://github.com/drisspg
2024-05-16 01:33:53 +00:00
f9d107af66 [optim] add fused_adagrad support for CPU device (#124905)
Support fused_sgd_kernel support for CPU.

## Bench result:
32 core/sockets ICX
Test Scripts:
https://gist.github.com/zhuhaozhe/79e842e0a6e25d6d7fa1e4598807272c
https://gist.github.com/zhuhaozhe/b4c6998a509dcea1796dd05b3005c969
```
Tensor Size: 262144, Num Tensor 4, Num Threads: 1
_single_tensor_adagrad time: 0.2500 seconds
_fused_adagrad time: 0.0933 seconds
Tensor Size: 4194304, Num Tensor 32, Num Threads: 32
_single_tensor_adagrad time: 2.8819 seconds
_fused_adagrad time: 1.7591 seconds
```
## Test Plan:
```
python test_optim.py -k test_fused_matches_forloop
python test_optim.py -k test_fused_large_tensor
python test_optim.py -k test_can_load_older_state_dict
python test_optim.py -k test_grad_scaling_autocast_fused_optimizers
python test_torch.py -k test_grad_scaling_autocast_fused
python test_torch.py -k test_params_invalidated_with_grads_invalidated_between_unscale_and_step
```

Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124905
Approved by: https://github.com/jgong5, https://github.com/janeyx99
2024-05-16 01:11:51 +00:00
51e9bb8783 [Export] Allow ExportedProgram to take empty decomp table (#126142)
**As title.**
Still, `ep.run_decompositions()` will use `core_aten_decompositions()` by default. Cases like `ep.run_decompositions(get_decompositions([]))` will use empty table, and go with [`aot_autograd_decompositions`](04877dc430/torch/_functorch/aot_autograd.py (L456-459)) only.

**Motivation**
We didn't have a clean way to pass in an empty decomp table. Since we've made `pre_dispatch` export as default and `ep.run_decompositions` remains with `aot_export_module(..., pre_dispatch=False)`, allowing empty table would help make blank control easier.

**Testing**
CI
Also looked through all the references in fbcode. The only concern I have is whether we should update [this example](04877dc430/torch/onnx/_internal/exporter.py (L817)) or not.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126142
Approved by: https://github.com/angelayi
2024-05-16 00:31:23 +00:00
b3f1882d17 [easy][dynamo][inline-inbuilt-nn-modules] Change test to check for params (#126316)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126316
Approved by: https://github.com/williamwen42
ghstack dependencies: #126303
2024-05-16 00:20:58 +00:00
06d6bb4eba Switched from parameter in can_cast to from_. (#126030)
Fixes #126012.

`from` is a reserved keyword in Python, thus we can't make the C++ impl available with `from` as function parameter. This PR changes the name to `from_` and also adjusts the docs.

If we want to preserve backwards compatibility, we can leave the C++ name as-is and only fix the docs. However, `torch.can_cast(from_=torch.int, to=torch.int)` won't work then.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126030
Approved by: https://github.com/albanD
2024-05-16 00:09:54 +00:00
3ae118204e Make propagate_real_tensor more safe (#126281)
Internal xref: https://fb.workplace.com/groups/6829516587176185/posts/7228787720582401/

There a few improvements here, which luckily fix some xfails:

* In generally, it can be unsafe to call operations on Tensors under a `no_dispatch()` mode that is purely trying to disable ambient modes, because this ALSO disables tensor subclass handling. So we test to see if there is a tensor subclass and don't propagate real tensors if that's the case. Another acceptable outcome might be to try to only disable the ambient fake tensor mode, this would help us propagate real tensors through more exotic tensor types, but I'm not going to do it until someone asks for it.
* We're graph breaking for wrapped tensors too late. Pull it up earlier so we do it before we try to muck around with the real tensor.
* I noticed that occasionally when I do `storage.copy_(real_storage)`, the sizes mismatch. Careful code reading suggests that I should just copy in the real data when the tensor was initially allocated, so that's what I do now, eliminating the need for a storage copy.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126281
Approved by: https://github.com/Skylion007
2024-05-15 23:57:02 +00:00
b2d9b80fba Also remove compile_time_strobelight_meta frame when generating stack (#126289)
I think I also need to fix this in fbcode, leaving that for future work.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126289
Approved by: https://github.com/yanboliang
2024-05-15 23:55:37 +00:00
9c9d0c2fab Add VariableTracker.debug_repr (#126299)
Now you can print arbitrary values at compile time with
comptime.print()

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126299
Approved by: https://github.com/jansel
ghstack dependencies: #126292
2024-05-15 23:55:29 +00:00
a7af53cec1 [FSDP2] support fully_shard(model_on_meta, cpu_offload) (#126305)
support fully_shard(model_on_meta, cpu_offload) when fully_shard is placed outside of `torch.device("meta")`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126305
Approved by: https://github.com/awgu
ghstack dependencies: #126267
2024-05-15 23:29:23 +00:00
bcdd0b11ca [dynamo][inline-inbuilt-nn-modules] Bug fix - Only unspecialized nn modules (#126303)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126303
Approved by: https://github.com/mlazos, https://github.com/laithsakka
2024-05-15 23:23:12 +00:00
5cab7a7662 [dynamo] fix https://github.com/pytorch/pytorch/issues/93624 (#125945)
Fixes https://github.com/pytorch/pytorch/issues/93624 but also requires https://github.com/jcmgray/autoray/issues/20 to be fixed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125945
Approved by: https://github.com/jansel
ghstack dependencies: #125882, #125943
2024-05-15 23:22:06 +00:00
56a89fcc08 [dynamo] graph break on issubclass call with non-const args (#125943)
Fixes https://github.com/pytorch/pytorch/issues/125942

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125943
Approved by: https://github.com/jansel
ghstack dependencies: #125882
2024-05-15 23:22:06 +00:00
100e3c1205 [dynamo] graph break on const dict KeyError (#125882)
Fixes https://github.com/pytorch/pytorch/issues/125866

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125882
Approved by: https://github.com/jansel
2024-05-15 23:22:06 +00:00
b5432ad5ab Fix triton codegen main do_bench_gpu import error (#126213)
Summary:
Encountered module import error when running triton kernel file.

The cause seems to be D57215950 which changed "do_bench" to "do_bench_gpu" for torch._inductor.runtime.runtime_utils

However, in the codegen, instead we have "from triton.testing import do_bench", so the line below should be reverted back to "do_bench".

Test Plan:
LOGLEVEL=DEBUG TORCH_COMPILE_DEBUG=1 TORCHINDUCTOR_MAX_AUTOTUNE=0 CUDA_VISIBLE_DEVICES=5 TORCHINDUCTOR_PROFILE=1 TORCHINDUCTOR_PROFILE_OUTPUT='/home/adelesun/mts_profiling/outputs/profile_output.txt' TORCH_LOGS='+inductor,+schedule,output_code' TORCHINDUCTOR_UNIQUE_KERNEL_NAMES=1 TORCHINDUCTOR_BENCHMARK_KERNEL=1 TORCHINDUCTOR_CACHE_DIR='/home/adelesun/mts_profiling/code' TORCHINDUCTOR_ENABLED_METRIC_TABLES=kernel_metadata buck2 run mode/opt                 -c=python.package_style=inplace                 -c fbcode.enable_gpu_sections=true                 -c fbcode.platform=platform010                 -c fbcode.nvcc_arch=v100,a100,h100                 -c fbcode.split-dwarf=true                 caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark                 --  --local-model /home/adelesun/mts_profiling/inputs/offsite_cvr_model_526372970_793.input.predictor.disagg.gpu.merge --lower-backend AOT_INDUCTOR 2>&1 | tee /home/adelesun/mts_profiling/outputs/benchmark_output.txt

bento console --kernel=aetk --file=/home/adelesun/mts_profiling/code/op/copmbxfunzmywemwmg66lnlcx4apvn2f2vsi3glgisausgfvit4g.py

file ran successfully

Differential Revision: D57345619

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126213
Approved by: https://github.com/shunting314
2024-05-15 22:56:15 +00:00
2c5ad9a3d7 [optim] Fix: wrong ASGD implementation (#125440)
> previous: Originally, the variables `new_eta` and `new_mu` would be constructed `len(grouped_mus)` times, but each of their values is the same and won't be changed. Therefore, it can be simplified using Python list multiplication, which only constructs one tensor.

- [X] Ill assumption that every param will have the same step.
- [x] DIfferent implementation between `foreach=Ture` and `foreach=False`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125440
Approved by: https://github.com/janeyx99
2024-05-15 22:52:15 +00:00
eqy
5af4b49285 Remove expected failure in test_eager_transforms.py (#125883)
Seems to be supported now

CC @tinglvv @nWEIdia @Aidyn-A

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125883
Approved by: https://github.com/Chillee, https://github.com/Aidyn-A
2024-05-15 22:12:07 +00:00
0ca8bf4b41 Enable UFMT on test/test_datapipe.py (#124994)
Part of: #123062

Ran lintrunner on:

- `test/test_datapipe.py`

Detail:

```bash
$ lintrunner -a --take UFMT --all-files
ok No lint issues.
Successfully applied all patches.
```

Co-authored-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124994
Approved by: https://github.com/mikaylagawarecki
2024-05-15 21:58:35 +00:00
cyy
18cbaf6dbf Remove Caffe2 python code (#126035)
Follows the recent changes of Caffe2.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126035
Approved by: https://github.com/r-barnes, https://github.com/Skylion007
2024-05-15 21:51:11 +00:00
ad7316b4c2 [CI] Add AMP models in inductor cpu smoketest for performance (#125830)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125830
Approved by: https://github.com/chuanqi129, https://github.com/jgong5, https://github.com/huydhn, https://github.com/desertfire, https://github.com/atalman
2024-05-15 21:46:58 +00:00
f0d34941dd Improve Storage copy_ size mismatch error message (#126280)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126280
Approved by: https://github.com/mikaylagawarecki
2024-05-15 21:14:59 +00:00
d15920a7d0 Warn SDPA users about dropout behavior (#126294)
Fixes #124464
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126294
Approved by: https://github.com/mikaylagawarecki, https://github.com/drisspg
2024-05-15 20:58:23 +00:00
31d22858e9 [onnx.export] Avoid unnecessary copy of debug_names (#123026)
This PR is part of an effort to speed up torch.onnx.export (#121422).

- The `auto debug_names = ` infers a copy, where as `const auto& debug_names` does not.
- However, this ones requires us to be careful, since calls to `setDebugName` changes `debug_names` and invalidates the `exist_name` iterator. So if we simply change `auto` to `const auto&`, then between that line and `find` we have corrupted the iterator by calling `output[i]->setDebugName`. This change aims to be functionally equivalent to the original, which is why we first get the Value pointer, then call `output[i]->setDebugName`, and finally call `setDebugName` on the found value. It is possible functionally it is OK to simply call `output[i]->setDebugName` first and then find and the second `setDebugName`, but this would not be identical to current behavior.
- Resolves (2) in #121422.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123026
Approved by: https://github.com/justinchuby
2024-05-15 20:58:18 +00:00
90461d4986 [dynamo] Detect monkeypatching on nn module forward method (#126203)
An alternative was https://github.com/pytorch/pytorch/pull/124975. Though it was safer because it was adding guards for every inlined function, it was causing guard overhead for a few models of > 20%.  The overhead of this PR is minimal for the common unpatched case.

Fixes an internal issue - [fb.workplace.com/groups/1075192433118967/permalink/1411067766198097](https://fb.workplace.com/groups/1075192433118967/permalink/1411067766198097/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126203
Approved by: https://github.com/ezyang
2024-05-15 20:41:13 +00:00
c8130dfe84 [FSDP2] allow meta tensors during loading state dict and cpu offloading (#126267)
unit test: ``pytest test/distributed/_composable/fsdp/test_fully_shard_state_dict.py``

with meta init and cpu offloading, we have meta tensors after`model.load_state_dict(assign=True, strict=False)`. This PR avoided calling `.cpu` on meta tensors otherwise it's a runtime error

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126267
Approved by: https://github.com/awgu
2024-05-15 20:35:36 +00:00
d74c89fb10 2 rocm shards on trunk.yml (#125933)
after test removal for windows cpu + avx related configs, it's going to be the long pole for trunk

Just checked: without rocm, avg tts for trunk is 2.5 hrs last week, with rocm its about 3

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125933
Approved by: https://github.com/ZainRizvi
2024-05-15 20:22:14 +00:00
d2b2727d66 Fix public api allowlist logical merge conflict (#126321)
Skip the newly added bad API from https://github.com/pytorch/pytorch/pull/126212 to keep CI green.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126321
Approved by: https://github.com/ezyang
2024-05-15 20:21:39 +00:00
e2d18228fe [DCP] overwrites existing checkpoint by default (#125877)
Checks for existing checkpoints and overwrites, based on an `overwrite` flag

Differential Revision: [D57186174](https://our.internmc.facebook.com/intern/diff/D57186174/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125877
Approved by: https://github.com/fegin
2024-05-15 20:12:52 +00:00
b659506d82 Parametrize test_dim_reduction (#126292)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126292
Approved by: https://github.com/Skylion007
2024-05-15 19:55:37 +00:00
2086f91c4c Revert "Fix aarch64 debug build with GCC (#126290)"
This reverts commit a961e1ac83bf8831768c5a04eb7c4c18df8988d5.

Reverted https://github.com/pytorch/pytorch/pull/126290 on behalf of https://github.com/malfet due to Indeed lint is broken :/ ([comment](https://github.com/pytorch/pytorch/pull/126290#issuecomment-2113332757))
2024-05-15 19:45:57 +00:00
2978f07d0e [FSDP] Fixed docs for inter/intra node PG helpers (#126288)
1. This fixes an issue where we had 9 ranks in one node and 7 in the other.
2. This makes the notation more explicit that `[0, 7]` is `[0, 1, ..., 7]`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126288
Approved by: https://github.com/weifengpy
2024-05-15 19:45:10 +00:00
af9acc4168 Fix public binding to actually traverse modules (#126103)
The current call passes in `['/actual/path']` to os.walk which is a string pointing to no path and thus silently leads to and empty traversal.
There is an unused function just above that handles that, so I guess this is what was supposed to be called.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126103
Approved by: https://github.com/suo
2024-05-15 19:36:03 +00:00
a961e1ac83 Fix aarch64 debug build with GCC (#126290)
By working around GCCs quirks in instantiating templates that require immediate values.
Provide alternative implementation for scaling the output if compiled without any optimizations (both GCC and clang define __OPTIMIZE__ if invoked with anything but -O0)

Fixes https://github.com/pytorch/pytorch/issues/126283

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126290
Approved by: https://github.com/atalman, https://github.com/seemethere
2024-05-15 19:02:21 +00:00
196661255f Enable UFMT format on test/test_utils.py (#125996)
Fixes some files in #123062

Run lintrunner on files:
test/test_utils.py

```bash
$ lintrunner -a --take UFMT --all-files
ok No lint issues.
Successfully applied all patches.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125996
Approved by: https://github.com/ezyang
2024-05-15 18:22:57 +00:00
44efeac24e Beef up error message for pending assert failure (#126212)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126212
Approved by: https://github.com/Skylion007
2024-05-15 18:22:53 +00:00
26f6f98364 Forward fix failures for torch.export switch to predispatch (#126081)
Summary:
Fixes:
- executorch test
- torchrec test

Test Plan: CI

Differential Revision: D57282304

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126081
Approved by: https://github.com/angelayi
2024-05-15 18:13:06 +00:00
0d49c5cb06 Skip padding cost of fusible/planable inputs (#125780)
For mm inputs which are not inputs of the graph, assume that we can memory plan them in the aten.cat and exclude the padding cost in the benchmarking comparison. Technically we also have to do a small amount of 0s writing, but that should be relatively small and encompassed in the weighting of the padding time by `1.1`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125780
Approved by: https://github.com/shunting314
ghstack dependencies: #125772, #125773
2024-05-15 18:05:53 +00:00
4fb5d69b3b Reland '[Inductor] GEMM shape padding improvements (#118522)' (#125773)
Relanding just the pad in a single pass portion of [the pr](https://github.com/pytorch/pytorch/pull/118522). Not including
the transpose logic:

This was previously accepted and reviewed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125773
Approved by: https://github.com/shunting314
ghstack dependencies: #125772
2024-05-15 17:34:41 +00:00
a91311e7c2 [easy] Remove aot_config from pre_compile returns, rename fw_metadata in post_compile (#125854)
This field never changes so pre_compile doesn't need to return it again: remove it just for a cleaner refactor.

As @aorenste  points out, the fw_metadata passed to post_compile is actually the fw_metadata after all wrapper's pre_compile's have run. I want to make this clear in the code, so I renamed the arg in post_compile.

Wrappers that need the exact metadata that they were passed in pre_compile need to save that fw_metadata properly themselves.

Currently, wrappers come in two categories:

1. Wrappers that modify fw_metadata, but then never use fw_metadata in post compile
2. Wrappers that never modify fw_metadata, and only consume the "final" fw_metadata.

So none of the behaviors will change for the existing wrappers. That said, it might be useful to define a "SimpleCompilerWrapper" subclass which guarantees it does not modify fw_metadata. I'll do that in a separate PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125854
Approved by: https://github.com/aorenste, https://github.com/bdhirsh
2024-05-15 17:23:47 +00:00
44e47d5bd0 [onnx.export] Avoid linear loop over symbol_dim_map (#123029)
This PR is part of an effort to speed up torch.onnx.export (#121422).

- Doing a reverse look-up in `symbol_dim_map` incurs a linear cost in number of symbols. This happens for each node, so incurs a quadratic cost to the whole export.
- Add a reverse look-up `dim_symbol_map` that is kept in parallel of `symbol_dim_map`. This avoids a linear time look-up, which creates a quadratic export time complexity.
- This is a highly pragmatic solution. If someone more familiar with the code base has a better solution, I'm interested to hear about it.
- Resolves (9) in #121422.

(partial fix of #121422)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123029
Approved by: https://github.com/justinchuby
2024-05-15 17:22:30 +00:00
490d72e4e6 CMake: Improve check and report of Magma (#117858)
- Only search for magma if it is used (GPU builds)
- Don't report it was not found when it isn't searched for
- Don't report if magma is disabled (currently: "MAGMA not found. Compiling without MAGMA support" is reported)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117858
Approved by: https://github.com/malfet
2024-05-15 17:18:22 +00:00
f91cae461d [Dynamo] SizeVariable supports hasattr (#126222)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126222
Approved by: https://github.com/williamwen42, https://github.com/anijain2305
2024-05-15 17:16:36 +00:00
c1dc8bb858 [DTensor] Turn on foreach implementation of optimizer for DTensor by default (#123394)
Append DTensor to the optimizer `_foreach_supported_types` and turn on foreach implementation of optimizer for DTensor if not specified by the users.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123394
Approved by: https://github.com/wanchaol
2024-05-15 16:45:42 +00:00
4ab2c399be Faster int8 quantized (#125704)
Or my journey to learn how to write fast Metal kernels (more details would be posted [here](https://github.com/malfet/llm_experiments/tree/main/metal-perf) )

Using gpt-fast as a benchmark (by running `python generate.py --checkpoint_path checkpoints/stories110M/model_int8.pth --device mps`)

Before the change, on M2 Pro I get 50 tokens per sec
After adding a very naive
```metal
template<typename T>
kernel void int8pack_mm(
    constant T                 * A              [[buffer(0)]],
    constant char              * B              [[buffer(1)]],
    constant T                 * scales         [[buffer(2)]],
    device   T                 * outputData     [[buffer(3)]],
    constant uint3             & sizes          [[buffer(4)]],
    uint                         thread_index   [[thread_position_in_grid]]) {
    const uint lda = sizes.y;
    const uint ldc = sizes.z;
    const uint m = thread_index / sizes.z; // 0..sizes.x-1
    const uint n = thread_index % sizes.z; // 0..sizes.z-1
    constant T *A_ptr = A + m * lda;
    constant char *B_ptr = B + n * lda;

    float rc = 0.0;
    for(uint k = 0; k < sizes.y;  k++) {
      const auto a_val = float(A_ptr[k]);
      const auto b_val = float(B_ptr[k]);
      rc += a_val * b_val;
    }
    outputData[thread_index] = T(rc * float(scales[n]));
}
```
Perf dropped down to sad 15 tokens per seconds.
Replacing inner loop with vectorized operations
```metal
    float rc = 0.0;
    for(uint k = 0; k < sizes.y/4;  k++) {
      const auto a_val = float4(A_ptr[k]);
      const auto b_val = float4(B_ptr[k]);
      rc += dot(a_val, b_val);
    }
```
Perf jumps back up to 53 tokens per second, but it's a bit of a lie when it comes to llama2-7B perf.

Next step in unlocking the performance were to replace a 1D grid with a 2D one, but limit the thread group size to a single row, which results in a much better data locality which unfortunately is not observable with `stories110M` anymore as it small model size and Python runtime overhead hide the perf gain)

There were several unsuccessful attempts at caching inputs in thread local memory or using `float4x4` to speed up computation. But the key to unlocking the perf were a comment in 631dfbe673/mlx/backend/metal/kernels/gemv.metal (L184)
which hinted at exploiting both SIMD groups and thread local caches, which resulted in 5x jump in performance compared to initial vectorization approach and 3x perf jump in end-to-end llama7b test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125704
Approved by: https://github.com/mikekgfb
2024-05-15 16:39:24 +00:00
719a8f42bf Foward fix lint after #125747 (#126295)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126295
Approved by: https://github.com/atalman
2024-05-15 16:37:48 +00:00
9689532106 [CI] 3 procs non cuda (#125932)
Too lazy to figure out actual time reduction here, I'll figure it out later.  Also I'd rather get an average of a couple of runs on trunk rather than just this one PR
Things got faster. Source? Trust me bro

* rel to https://github.com/pytorch/pytorch/pull/125598

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125932
Approved by: https://github.com/ZainRizvi
2024-05-15 16:18:36 +00:00
718bb9016f Revert "[Memory Snapshot] Add recordAnnotations to capture record_function annotations (#124179)"
This reverts commit 187aeaeabf612824c2d0e9be72f80ce6612760d4.

Reverted https://github.com/pytorch/pytorch/pull/124179 on behalf of https://github.com/clee2000 due to test_tensorexpr.py::TestTensorExprFuser::test_simple_add is causing a segfault https://github.com/pytorch/pytorch/actions/runs/9097383783/job/25007155440 187aeaeabf, test was skipped due to bad TD ([comment](https://github.com/pytorch/pytorch/pull/124179#issuecomment-2112948246))
2024-05-15 16:11:47 +00:00
f9dda37a74 [export] Cover more cases to copy tensor conversions. (#125628)
Summary:
Previously we tried to convert all .to() calls to to_copy in the graph, now some user reports that other methods like .float() is not covered: https://github.com/pytorch/PiPPy/issues/1104#issuecomment-2093352734

I think fundemantally .float() should look similar to .to() in export and this diff tries to expand the coverage of the tensor conversion methods here.

Test Plan: buck run mode/opt caffe2/test:test_export -- -r float_conversion

Differential Revision: D56951634

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125628
Approved by: https://github.com/tugsbayasgalan
2024-05-15 15:50:21 +00:00
c53e0ac7ba [Inductor] Generalize new introduced device-bias code. (#126261)
We find some Inductor test case failues when enabling Inductor UT for Intel GPU, the root cause is new introduced Inductor device-bias code from recent community PRs, which cause differnet beheaviors between Intel GPU and CUDA. This PR generalize these codes to align their beheaviors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126261
Approved by: https://github.com/EikanWang, https://github.com/peterbell10
2024-05-15 15:05:07 +00:00
ba3cd6e463 Enable UFMT on test/test_fake_tensor.py, test/test_flop_counter.py and some files (#125747)
Part of: #123062

Ran lintrunner on:

- test/test_fake_tensor.py
- test/test_flop_counter.py
- test/test_function_schema.py
- test/test_functional_autograd_benchmark.py
- test/test_functional_optim.py
- test/test_functionalization_of_rng_ops.py

Detail:

```bash
$ lintrunner -a --take UFMT --all-files
ok No lint issues.
Successfully applied all patches.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125747
Approved by: https://github.com/malfet
2024-05-15 14:50:14 +00:00
187aeaeabf [Memory Snapshot] Add recordAnnotations to capture record_function annotations (#124179)
Summary: Add new traceEvents into Memory Snapshot for record_function annotations. These will capture both the profiler's step annotation as well as user annotations.

Test Plan:
CI

New Snapshot Generated:
devvm2184.cco0.facebook.com.Apr_19_13_27_14.3072800.snapshot.pickle

Snippet of Snapshot device_traces show `ProfilerStep#0`, and `## forward ##` annotations:
```
[[{'action': 'user_defined',
   'addr': 0,
   'size': 0,
   'stream': 0,
   'time_us': 1713558427168556,
   'frames': [{'name': 'START', 'filename': 'ProfilerStep#0', 'line': 0}]},
  {'action': 'user_defined',
   'addr': 0,
   'size': 0,
   'stream': 0,
   'time_us': 1713558427168738,
   'frames': [{'name': 'END', 'filename': 'ProfilerStep#0', 'line': 0}]},
  {'action': 'user_defined',
   'addr': 0,
   'size': 0,
   'stream': 0,
   'time_us': 1713558427168865,
   'frames': [{'name': 'START', 'filename': 'ProfilerStep#1', 'line': 0}]},
  {'action': 'user_defined',
   'addr': 0,
   'size': 0,
   'stream': 0,
   'time_us': 1713558427168920,
   'frames': [{'name': 'START', 'filename': '## forward ##', 'line': 0}]},
  {'action': 'alloc',
   'addr': 140166073581568,
   'size': 3211264,
   'stream': 0,
   'time_us': 1713558427172978,
   'frames': [{'name': '_conv_forward',
     'filename': '/mnt/xarfuse/uid-416185/235d4caf-seed-nspid4026531836_cgpid32884718-ns-4026531840/torch/nn/modules/conv
```

Differential Revision: D55941362

Pulled By: aaronenyeshi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124179
Approved by: https://github.com/zdevito
2024-05-15 14:19:40 +00:00
ee8c1550d6 [AOTI][torchgen] Add a few more fallback ops (#126013)
Summary: They appear in some unit tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126013
Approved by: https://github.com/chenyang78
ghstack dependencies: #125962
2024-05-15 12:56:07 +00:00
563aa3e035 [AOTI][torchgen] Update NativeFunctionsGroup mapping (#125962)
Summary: When looking up for what backend call to use for a fallback op (see get_backend_index_for_aoti), sometimes we need to search for a NativeFunction's structured delegate. Previous str:NativeFunctionsGroup dict missed some cases, such as aten.index.Tensor, and that's why aten.index.Tensor was specified in the fallback_ops list but no C shim entry was generated for it. This PR uses a more robust OperatorName:NativeFunctionsGroup mapping.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125962
Approved by: https://github.com/chenyang78
2024-05-15 12:56:07 +00:00
a0aaf56114 Don't assert about pending when we are peeking (#126239)
Internal xref https://fb.workplace.com/groups/6829516587176185/posts/7211398545654652/

In particular, when we're collecting forward metadata, we aren't going
to discharge any of the pending, so we'll be continuously collecting
more and more pending symbols that we may not be able to resolve.  This
is fine.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126239
Approved by: https://github.com/lezcano
2024-05-15 12:18:34 +00:00
8f30f367d0 [CUDA] [CI] Add cu124 docker images (#125944)
Fixes issues encountered in https://github.com/pytorch/pytorch/pull/121956

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125944
Approved by: https://github.com/atalman
2024-05-15 09:52:38 +00:00
f060b0c6e6 [inductor][cpp] GEMM template (infra and fp32) (#124021)
This PR adds the Cpp template infrastructure and the initial FP32 gemm template. See RFC https://github.com/pytorch/pytorch/issues/125683 for more background info.
1. Cpp template infrastructure
Similar template abstractions as the CUTLASS template, i.e., `CppTemplate`, `CppTemplateKernel`, `CppTemplateBuffer`. The MicroGemm micro-kernel abstraction that can be used by Cpp GEMM templates.
2. Initial FP32 gemm template
This involves a GEMM template implementation `CppPackedGemmTemplate` that supports GEMM with constant weight (`B`) requiring `N` to be a multiple of register blocking while allows the static or dynamic sizes for the `M` (batch dim) of `A`. The `B` matrix would be prepacked. This is a typical setting for inference workloads. The template handles the thread decomposition (via `thread_blocking`) and cache blocking (via `cache_blocking`). Then it invokes `CppMicroGemm` which handles register blocking, instruction selection, and other CPU architecture-specific optimizations. A `CppMicroGemmFP32Vec` micro-kernel implementation is provided for fp32 matmuls implemented with ATen vec abstraction.
3. Correctness and performance
The changes have been validated with fp32 inference on the three benchmark suites (torchbench, huggingface and timm_models) with both static shape and dynamic shapes. Since it is an initial implementation, we are still working on further performance improves with follow-up PRs including the optimizations in kernels as well as fusions. The perf gains are only observed from a selective number of models compared to the ATen kernels which are implemented with MKL. The perf gains are more obvious with dynamic shapes since MKL only supports packed gemm for static shapes. Below are details.

Static shapes
| Benchmark | torchbench | huggingface | timm_models |
|------------|-------------|--------------|--------------|
| Multi-threaded (baseline) | 1.47x | 1.36x | 1.91x |
| Multi-threaded (max-autotune) | 1.47x | 1.36x | 1.92x |
| Single-threaded (baseline) | 1.56x | 1.19x | 1.51x |
| Single-threaded (max-autotune) | 1.56x | 1.19x | 1.52x |

Key models being sped up:
drq: 1.14x
soft_act: 1.12
cait_m36_384: 1.18x

Dynamic shapes
| Benchmark | torchbench | huggingface | timm_models |
| --- | --- | --- | --- |
| Multi-threaded (baseline) | 1.43x | 1.28x | 1.85x |
| Multi-threaded (max-autotune) | 1.47x | 1.28x | 1.85x |
| Single-threaded (baseline) | 1.55x | 1.20x | 1.51x |
| Single-threaded (max-autotune) | 1.56x | 1.19x | 1.53x |

Key models being sped up:
BERT_pytorch: 1.22x
pyhpc_turbulent: 1.13x
soft_actor_critic: 1.77x
BlenderbotForCausalLM: 1.09x
cait_m36_384: 1.17x

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124021
Approved by: https://github.com/jansel
2024-05-15 08:14:51 +00:00
79655a1321 Add force_disable_caches to the docs (#126184)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126184
Approved by: https://github.com/msaroufim
2024-05-15 07:16:08 +00:00
2d35b4564a [audio hash update] update the pinned audio hash (#126248)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126248
Approved by: https://github.com/pytorchbot
2024-05-15 05:45:16 +00:00
03467b3fed Add a few "warm start" smoketest runs to CI (#125955)
Summary:
Not sure which to choose, so my criteria was:
1) We care about huggingface as part of internal milestones
2) This handful of models seems to particularly benefite from caching
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125955
Approved by: https://github.com/desertfire
ghstack dependencies: #125917, #125953
2024-05-15 05:32:06 +00:00
c87c39d935 [benchmarking] Suppress csv creation on cold-start phase of --warm-start-latency (#125953)
Summary: It seems that most (all?) of our utilities for examining benchmark output expect single-line entries per benchmark. The way the --warm-start-latency flag is currently implemented, it means that we'll see two entries for every benchmark run (one for the warm-up run and one for the actual run). This PR adds a --disable-output flag that we can use for the first run to suppress populating the csv. This way, the existing utilities like `benchmarks/dynamo/check_accuracy.py` will function without any changes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125953
Approved by: https://github.com/desertfire
ghstack dependencies: #125917
2024-05-15 05:32:06 +00:00
9f0d3f71c9 Adjust number of repeats when using --warm-start-latency benchmark flag (#125917)
Summary: In --warm-start-latency mode, we can just perform the cache-warmup run once instead of whatever was provided with --repeat

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125917
Approved by: https://github.com/desertfire
2024-05-15 05:32:06 +00:00
0dedc1aff2 Update CUDA out of memory mesage with private pool info (#124673)
Fixes https://github.com/pytorch/pytorch/issues/121932

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124673
Approved by: https://github.com/eellison, https://github.com/eqy
2024-05-15 05:30:47 +00:00
5178baefa9 use statically known instead of suppress guard for ddp stride propagation (#126234)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126234
Approved by: https://github.com/ezyang
2024-05-15 05:21:55 +00:00
e74a6f487a [Inductor] Skip test_nll_loss_backward for intel GPU. (#126157)
Skip this test case due to unaligned behavior to CUDA for Triton `mask_load`. We submitted issue #126173 to elaborate on the root cause. We intend to skip this case for XPU first as we need to take some time to fix the issue and have full validation to update the Triton commit pin for Intel GPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126157
Approved by: https://github.com/EikanWang, https://github.com/peterbell10, https://github.com/desertfire
2024-05-15 05:16:07 +00:00
FEI
b950217f19 Support third-party devices emit a range for each autograd operator (#125822)
Fixes #125752

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125822
Approved by: https://github.com/aaronenyeshi
2024-05-15 05:06:24 +00:00
cyy
bdea4904c1 Add some type annotations to python stream and event classes (#126171)
For recent device agnostic code changes, we need type hinting on the parent classes for better tooling support.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126171
Approved by: https://github.com/ezyang
2024-05-15 04:58:07 +00:00
7dfd2949d7 Add missing type uint16, uint32, and uint64 to TensorHash in LTC. (#125972)
If I do:

```
xla_device = xm.xla_device()
xla_tensor_0 = torch.tensor(42, dtype=torch.uint32).to(xla_device)
```

I got the error:

```
RuntimeError: false INTERNAL ASSERT FAILED at "/ansible/pytorch/torch/csrc/lazy/core/hash.h":139, please report a bug to PyTorch. Unsupported scalar type:UInt16
```

This PR intends to fix this issue.
The data type can be found in pytorch/c10/core/ScalarType.h.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125972
Approved by: https://github.com/JackCaoG
2024-05-15 04:57:08 +00:00
dfab69fdf1 [Inductor] Flex attention supports dynamic shape (#125994)
## static shapes perf
```
| Type    |   Speedup |   batch_size |   num_heads |   q_seq_len |   k_seq_len |   head_dim | score_mod   | dtype          |
|---------|-----------|--------------|-------------|-------------|-------------|------------|-------------|----------------|
| Average |     0.692 |              |             |             |             |            |             |                |
| Max     |     0.855 |           16 |          16 |        4096 |        4096 |         64 | head_bias   | torch.bfloat16 |
| Min     |     0.419 |            8 |          16 |         512 |         512 |        256 | noop        | torch.bfloat16 |
```
## dynamic shapes perf
```
| Type    |   Speedup |   batch_size |   num_heads |   q_seq_len |   k_seq_len |   head_dim | score_mod     | dtype          |
|---------|-----------|--------------|-------------|-------------|-------------|------------|---------------|----------------|
| Average |     0.670 |              |             |             |             |            |               |                |
| Max     |     0.864 |           16 |          16 |        4096 |        4096 |         64 | relative_bias | torch.bfloat16 |
| Min     |     0.376 |            8 |          16 |         512 |         512 |        256 | relative_bias | torch.bfloat16 |
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125994
Approved by: https://github.com/Chillee
2024-05-15 04:43:24 +00:00
1485621ccb [BE] Abstract out strings to top of file (#125640)
Summary:
Move const strings to top of file. This is in preparation of tooling to
make use of shared constants (e.g. version string). A non-functional change.
Ideally we want these const strings to be available from both C++ and Python - but I haven't figured out how to correctly share things in PyTorch. I'll do this in a subsequent change.

Test Plan:
python test/distributed/test_c10d_nccl.py NCCLTraceTest

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125640
Approved by: https://github.com/wconstab
2024-05-15 03:38:30 +00:00
24c30096e3 Set dtype when copying empty tensor (#126124)
Summary: Forward fix D57251348

Test Plan: `buck2 test 'fbcode//mode/dev' fbcode//executorch/kernels/test:aten_op_copy_test`

Differential Revision: D57304360

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126124
Approved by: https://github.com/bdhirsh
2024-05-15 03:25:07 +00:00
51ed4c46cf [Dynamo] Supports torch._C._is_any_autocast_enabled (#126196)
Fixes #126026

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126196
Approved by: https://github.com/anijain2305
2024-05-15 03:16:13 +00:00
314ba13f01 Support trace_subgraph in _MakefxTracer (#125363)
Adds trace_subgraph to _MakefxTracer, the motivation is in https://github.com/pytorch/pytorch/pull/122972. Also migrate all existing usage of reenter_make_fx to the new sub-tracer. Previously, the torch function mode for creating torch_fn metadata won't be re-enetered when we're in ProxyTensorMode (since it's inside of __torch_function__). This PR reconstruct the torch function mode based on parent tracer's config and reentered the torch function mode so the metadata is shown in the graph.

**Test Plan:**
Existing tests. We have a bunch of make_fx tests for cond, map and while_loop. Also remove expected failure for torch_fn since reenter_make_fx is able to re-construct torch function modes.

Also fixes https://github.com/pytorch/pytorch/issues/124643

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125363
Approved by: https://github.com/Chillee
ghstack dependencies: #125267
2024-05-15 03:12:24 +00:00
73d8c10f13 Refactor make_fx to better support hop subgraph tracing (#125267)
Code movement + minor rewrites. We extract the states of make_fx out and encapsulate them into a _MakefxTracer class. This allows us to create a new make_fx_tracer when tracing subgraphs, the actual logic for tracing subgraph is in the next diff.

Test Plan:
Existing tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125267
Approved by: https://github.com/Chillee
2024-05-15 03:12:24 +00:00
470723faea [pipelining] Add manual pipeline stage (#126123)
Add `ManualPipelineStage` under `_PipelineStage.py`

Fix some type hints since `args_recv_info` can contain more than one RecvInfo. Previously the hint was `Tuple[InputInfo]` which meant it is a tuple of size 1. This is different from `List[InputInfo]` which can contain any number of items. I needed to update to `Tuple[InputInfo, ...]` to make the number of items flexible.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126123
Approved by: https://github.com/kwen2501
2024-05-15 00:55:15 +00:00
dccb5cf7ca Allow for trailing 'a' in sm_arch (#126185)
# Summary
I was getting
``` Shell
File "/home/drisspg/meta/pytorch/torch/cuda/__init__.py", line 312, in _lazy_init
    raise DeferredCudaCallError(msg) from e
torch.cuda.DeferredCudaCallError: CUDA call failed lazily at initialization with error: invalid literal for int() with base 10: '90a'
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126185
Approved by: https://github.com/Skylion007
2024-05-15 00:16:42 +00:00
92eb1731d4 [torch/distributed] Bugfix: wait for all child procs to exit before c… (#125969)
Observed Problem
---------------------

When `torchrun` has finished running the main trainer function (aka entrypoint/user function) successfully, I noticed that sometimes it SIGTERMS the child processes. Then `torchrun` exits successfully.

This results in misleading warning log messages towards the end of the job like the one below:

```
W0510 14:52:48.185934  672413 api.py:513] Closing process 675171 via signal SIGTERM
W0510 14:52:48.185984  672413 api.py:513] Closing process 675172 via signal SIGTERM
W0510 14:52:48.186013  672413 api.py:513] Closing process 675174 via signal SIGTERM
# <---- ^^^ ??? everything runs successfully but child still SIGTERM'ed? ^^^ --->

I0510 14:52:48.229119  672413 api.py:877] [main] worker group successfully finished. Waiting 300 seconds for other agents to finish.
I0510 14:52:48.229161  672413 api.py:922] Local worker group finished (WorkerState.SUCCEEDED). Waiting 300 seconds for other agents to finish
I0510 14:52:48.229395  672413 api.py:936] Done waiting for other agents. Elapsed: 0.0001709461212158203 seconds
I0510 14:52:48.257544  672413 dynamic_rendezvous.py:1131] The node 'localhost_672413_0' has closed the rendezvous 'torchrun_qpfd'.
I0510 14:52:48.568198  672413 distributed.py:200] Deleting temp log directory: /tmp/torchrun_udgp8zoq
I0510 14:52:48.568989  672413 distributed.py:202] Finished running `main`
```

Root Cause
------------------

I noticed that this was due to the incorrect usage of `torch.multiprocessing.ProcessContext.join()` in `torch.distributed.elastic.multiprocessing.api.MultiprocessingContext`.

`torch.multiprocessing.ProcessContext.join()` does not actually wait for ALL child procs to exit, but rather waits for **at-least-one** child proc to exit. If only a subset of the child procs have exited, it returns `False` and if all child procs have exited it returns `True`.

`torch.distributed.elastic.multiprocessing.api.MultiprocessingContext` was assuming that `torch.multiprocessing.ProcessContext.join()` blocks indefinitely until all child procs have exited.

Fix
---------

The fix is simple, just loop, while continuing to call `pc.join()` until it returns `True`

> **NOTE**: that the indefinite blocking is NOT an issue since by the time `torch.distributed.elastic.multiprocessing.api.MultiprocessingContext` calls `pc.join()` it already did all the checking to validate that the entrypoint functions either return successfully or that one of them has failed. So we are really just waiting for the unix process to exit after running the entrypoint function.

> **NOTE**: since `pc.join()` already blocks until at-least-one child proc exits, there is no need to add a polling interval in the body of the loop and the debug logging will show at most `nproc_per_node` times so no log spamming is observed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125969
Approved by: https://github.com/d4l3k
2024-05-15 00:13:08 +00:00
e5cce35c21 Remove use of USE_C10D (#126120)
As per https://github.com/pytorch/pytorch/blob/main/torch/CMakeLists.txt#L271 the USE_DISTRIBUTED and USE_C10D are equivalent. In another PR I was cleaning this usage up so also cleaning it up here.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126120
Approved by: https://github.com/aaronenyeshi
2024-05-15 00:00:26 +00:00
fd48fb9930 Revert "[CUDA] [CI] Add cu124 docker images (#125944)"
This reverts commit 5fb4a766b88bcf633a23610bd66de0f3020f7c66.

Reverted https://github.com/pytorch/pytorch/pull/125944 on behalf of https://github.com/nWEIdia due to test failure seems related 5fb4a766b8 https://github.com/pytorch/pytorch/actions/runs/9085206167/job/24972040039 ([comment](https://github.com/pytorch/pytorch/pull/125944#issuecomment-2111321724))
2024-05-14 23:29:26 +00:00
b6d8b256e6 Revert "[inductor][cpp] GEMM template (infra and fp32) (#124021)"
This reverts commit 037615b989b37b1bf5eff0c031055fc8d1fbe5ae.

Reverted https://github.com/pytorch/pytorch/pull/124021 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is failing inductor.test_unbacked_symints.TestUnbackedSymintsCPU::test_autotuning_cpu ([comment](https://github.com/pytorch/pytorch/pull/124021#issuecomment-2111318883))
2024-05-14 23:26:15 +00:00
c1aa05f80c [easy][dynamo] Use disable_dynamo for torch.manual_seed (#126192)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126192
Approved by: https://github.com/yanboliang
ghstack dependencies: #126191
2024-05-14 23:20:32 +00:00
c6f3f1d239 [reland][dynamo][disable] Move disable impl to its own __call__ method (#126191)
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126191
Approved by: https://github.com/yoyoyocmu, https://github.com/yanboliang, https://github.com/fegin
2024-05-14 23:20:32 +00:00
41fabbd93f Fanatically correct real tensor cloning for propagate_real_tensors (#126175)
Internal xref:
https://fb.workplace.com/groups/6829516587176185/posts/7211398545654652/

Previously I did it in a crappy way using clone_input in the callback,
but this results in tensors that don't have quite the same
size/stride/storage offset and there was an internal test case where
not having completely accurate information was causing a downstream
problem in propagation.  So now I make real tensors as similar to their
fake equivalents as much as possible.  Though... I don't bother with
autograd lol.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126175
Approved by: https://github.com/albanD
2024-05-14 23:14:17 +00:00
328b75d1a0 Enable epilogue fusion benchmarking internally (#125455)
Differential Revision: [D56920738](https://our.internmc.facebook.com/intern/diff/D56920738)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125455
Approved by: https://github.com/Chillee
2024-05-14 23:06:29 +00:00
e046c59e5b [export] handle aliased/unused params for unflattening (#125758)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125758

Aliased and unused params are currently an issue for strict-mode export. For a model like this:
```
def __init__(self):
    # ...
    self.alpha = nn.Parameter(torch.randn(4))
    self.beta = self.alpha
    self.gamma = self.alpha
def forward(self, x):
    return x + self.beta
```
Dynamo will trace only 1 parameter (beta) and assign a dynamo name (e.g. `L__self___beta`) which can be difficult to match to the correct FQN in the original eager module. This leads to export graph signature potentially having the incorrect target FQN for the parameter, leading to downstream issues unflattening (the parameter may be assigned to the wrong target attribute, mismatching the relevant placeholder node in the unflattened module).

This handles aliasing issues by assigning all tensors present in the state dict as module attributes, even if they're unused. Still, only the used tensors will appear in the graph's forward pass.

Another issue that exists is weight-sharing is not maintained in unflattening (all params/buffers are re-cloned) - handle this by checking tensor ids too.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125758
Approved by: https://github.com/zhxchen17
2024-05-14 23:00:46 +00:00
4d063c8e8a Do not print escape characters in xdoctest logs (#126219)
By invoking make with `vt100` terminal settings
Test Plan:
[Before](https://github.com/pytorch/pytorch/actions/runs/9086391859/job/24972547633)
```
2024-05-14T21:50:09.0459741Z reading sources... [ 57%] generated/torch.func.stack_module_state .. generated/torch.gradient
2024-05-14T21:50:09.2204992Z reading sources... [ 59%] generated/torch.greater .. generated/torch.jit.ignore
2024-05-14T21:50:09.9598581Z reading sources... [ 61%] generated/torch.jit.interface .. generated/torch.linalg.multi_dot
2024-05-14T21:50:10.5383853Z reading sources... [ 64%] generated/torch.linalg.norm .. generated/torch.moveaxis
```
[After](https://github.com/pytorch/pytorch/actions/runs/9086780396/job/24973727737?pr=126219)
```
2024-05-14T22:27:22.9388802Z reading sources... [ 57%] generated/torch.func.stack_module_state .. generated/torch.gradient
2024-05-14T22:27:23.5874407Z reading sources... [ 59%] generated/torch.greater .. generated/torch.jit.ignore
2024-05-14T22:27:23.7649947Z reading sources... [ 61%] generated/torch.jit.interface .. generated/torch.linalg.multi_dot
2024-05-14T22:27:24.3492981Z reading sources... [ 64%] generated/torch.linalg.norm .. generated/torch.moveaxis
2024-05-14T22:27:24.9723946Z reading sources... [ 66%] generated/torch.movedim .. generated/torch.nn.AdaptiveLogSoftmaxWithLoss
```
Fixes https://github.com/pytorch/pytorch/issues/123166

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126219
Approved by: https://github.com/clee2000
2024-05-14 22:45:55 +00:00
b522e65056 Check pointer for null before deref in Aten/native/sparse (#126163)
Fixes #126162

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126163
Approved by: https://github.com/ezyang
2024-05-14 21:55:41 +00:00
bbdbfe3661 Reland add write_record_metadata to PyTorchFileWriter (#126087)
Reland of https://github.com/pytorch/pytorch/pull/125184 with compiler warning fixed by extending `m_pWrite` rather than adding `m_pSeek` to miniz API

Differential Revision: [](https://our.internmc.facebook.com/intern/diff/)

Differential Revision: [D57287327](https://our.internmc.facebook.com/intern/diff/D57287327)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126087
Approved by: https://github.com/albanD
2024-05-14 21:48:44 +00:00
1ba852c1dc Fix torch elastic test SimpleElasticAgentTest.test_restart_workers br… (#126002)
Failure Info:
```bash
(pt) betterman@bjys1009:/projs/framework/betterman/code/pytorch_new/test/distributed/elastic/agent/server/test$ pytest api_test.py -k test_restart_workers
=============================================================================================================================================== test session starts ================================================================================================================================================
platform linux -- Python 3.10.8, pytest-8.1.1, pluggy-1.4.0
rootdir: /projs/framework/betterman/code/pytorch_new
configfile: pytest.ini
plugins: hypothesis-6.15.0, rerunfailures-14.0, flakefinder-1.1.0, xdist-3.3.1
collecting 1 item                                                                                                                                                                                                                                                                                                  /
projs/framework/betterman/code/pytorch_new/test/distributed/elastic/agent/server/test/api_test.py:123: PytestCollectionWarning: cannot collect test class 'TestAgent' because it has a __init__ constructor (from: test/distributed/elastic/agent/server/test/api_test.py)
  class TestAgent(SimpleElasticAgent):
collected 29 items / 28 deselected / 1 selected
Running 1 items in this shard

api_test.py F                                                                                                                                                                                                                                                                                                [100%]

===================================================================================================================================================== FAILURES =====================================================================================================================================================
___________________________________________________________________________________________________________________________________ SimpleElasticAgentTest.test_restart_workers ____________________________________________________________________________________________________________________________________
Traceback (most recent call last):
  File "/usr/local/python3.10/lib/python3.10/unittest/case.py", line 59, in testPartExecutor
    yield
  File "/usr/local/python3.10/lib/python3.10/unittest/case.py", line 591, in run
    self._callTestMethod(testMethod)
  File "/usr/local/python3.10/lib/python3.10/unittest/case.py", line 549, in _callTestMethod
    method()
  File "/projs/framework/betterman/code/pytorch_new/test/distributed/elastic/agent/server/test/api_test.py", line 368, in test_restart_workers
    agent._restart_workers(worker_group)
  File "/projs/framework/betterman/code/pytorch_new/torch/distributed/elastic/metrics/api.py", line 123, in wrapper
    result = f(*args, **kwargs)
  File "/projs/framework/betterman/code/pytorch_new/torch/distributed/elastic/agent/server/api.py", line 728, in _restart_workers
    self._stop_workers(worker_group, is_restart=True)
TypeError: TestAgent._stop_workers() got an unexpected keyword argument 'is_restart'
============================================================================================================================================= short test summary info ==============================================================================================================================================
FAILED [0.0054s] api_test.py::SimpleElasticAgentTest::test_restart_workers - TypeError: TestAgent._stop_workers() got an unexpected keyword argument 'is_restart'
========================================================================================================================================= 1 failed, 28 deselected in 7.37s =========================================================================================================================================
```
Caused by #124819 .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126002
Approved by: https://github.com/ezyang
2024-05-14 21:36:24 +00:00
3a58d40b93 [Profiler] Clean up deprecated use_cuda by default (#126180)
Summary: Should not be setting use_cuda by default anymore, since it is deprecated. Instead it will be set via use_device="cuda".

Test Plan:
CI and ran locally:

Before:
```
[INFO: pytorch_resnet_integration_test.py:  196]: step: 80, peak allocated GPU mem: 3.17GB, peak active GPU mem: 3.17GB, peak reserved GPU mem: 3.39GB.
/data/users/aaronshi/fbsource/buck-out/v2/gen/fbcode/277373c3e83d278c/kineto/libkineto/fb/integration_tests/__pytorch_resnet_integration_test__/pytorch_resnet_integration_test#link-tree/torch/autograd/profiler.py:215: UserWarning:

The attribute `use_cuda` will be deprecated soon, please use ``use_device = 'cuda'`` instead.

  Log file: /tmp/libkineto_activities_812639.json
  Trace start time: 2024-05-14 08:44:50  Trace duration: 500ms
  Warmup duration: 5s
  Max GPU buffer size: 128MB
  Enabled activities: cpu_op,user_annotation,gpu_user_annotation,gpu_memcpy,gpu_memset,kernel,external_correlation,cuda_runtime,cuda_driver,cpu_instant_event,python_function,xpu_runtime,privateuse1_runtime,privateuse1_driver
  Manifold bucket: gpu_traces
  Manifold object: tree/traces/clientAPI/0/1715701483/devvm2184.cco0/libkineto_activities_812639.json
  Trace compression enabled: 1
  TTL in seconds: 31536000 (365 days)
INFO:2024-05-14 08:44:43 812639:812639 CuptiActivityProfiler.cpp:971] Enabling GPU tracing
```

After:
```
[INFO: pytorch_resnet_integration_test.py:  196]: step: 80, peak allocated GPU mem: 3.17GB, peak active GPU mem: 3.17GB, peak reserved GPU mem: 3.39GB.
  Log file: /tmp/libkineto_activities_903554.json
  Trace start time: 2024-05-14 09:05:47  Trace duration: 500ms
  Warmup duration: 5s
  Max GPU buffer size: 128MB
  Enabled activities: cpu_op,user_annotation,gpu_user_annotation,gpu_memcpy,gpu_memset,kernel,external_correlation,cuda_runtime,cuda_driver,cpu_instant_event,python_function,xpu_runtime,privateuse1_runtime,privateuse1_driver
  Manifold bucket: gpu_traces
  Manifold object: tree/traces/clientAPI/0/1715702740/devvm2184.cco0/libkineto_activities_903554.json
  Trace compression enabled: 1
  TTL in seconds: 31536000 (365 days)
INFO:2024-05-14 09:05:40 903554:903554 CuptiActivityProfiler.cpp:971] Enabling GPU tracing
```

Differential Revision: D57337445

Pulled By: aaronenyeshi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126180
Approved by: https://github.com/davidberard98
2024-05-14 21:23:31 +00:00
534c34b320 Fix copy-pasted docs, reversing the load and save description (#125993)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125993
Approved by: https://github.com/kwen2501, https://github.com/fegin
2024-05-14 21:14:16 +00:00
2973c9bb88 [export] add SchemaCheckMode testing for pre-dispatch export, OpInfo (#125481)
This adds a new dispatch mode, PreDispatchSchemaCheckMode, built on top of SchemaCheckMode, used for verifying op schemas for functionalization for PreDispatch IR. More specifically, the mode runs in eager mode on concrete inputs, checking if op schemas incorrectly claim to be functional, but are aliasing or mutating. This mode is pushed to the pre-dispatch mode stack, and run before decompositions.

Current testing is hooked up to OpInfo, containing 1103 tests on 600 unique ops. Below is a list of ops that fail testing. One caveat is we only raise errors on ops that claim to be functional - if an op schema admits aliasing or mutating but fails testing for the other, it still may decompose further and become functional.

List of failed ops:
```
aten.atleast_1d.default
aten.atleast_2d.default
aten.atleast_3d.default
aten.cartesian_prod.default
aten.conj_physical.default
aten.alpha_dropout.default
aten.feature_dropout.default
aten.feature_alpha_dropout.default
aten.unsafe_chunk.default
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125481
Approved by: https://github.com/tugsbayasgalan
2024-05-14 21:07:21 +00:00
534ddfa619 Move compute unbacked bindings call to track_tensor_tree (#126168)
This ensures we hit it in all the HOP proxy tensor implementations

Fixes https://github.com/pytorch/pytorch/issues/125869

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126168
Approved by: https://github.com/ydwu4
2024-05-14 21:05:05 +00:00
54131ecb25 Remove redundant spaces in CMakeLists.txt (#126042)
Fixes #126023

```diff
diff --git a/CMakeLists.txt b/CMakeLists.txt
index 79db67e735..924721d2e6 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -281,8 +281,8 @@ if(NOT DEFINED USE_VULKAN)
 endif()

 option(USE_SLEEF_FOR_ARM_VEC256 "Use sleef for arm" OFF)
-option(USE_SOURCE_DEBUG_ON_MOBILE "Enable " ON)
-option(USE_LITE_INTERPRETER_PROFILER "Enable " ON)
+option(USE_SOURCE_DEBUG_ON_MOBILE "Enable" ON)
+option(USE_LITE_INTERPRETER_PROFILER "Enable" ON)
 option(USE_VULKAN_FP16_INFERENCE "Vulkan - Use fp16 inference" OFF)
 option(USE_VULKAN_RELAXED_PRECISION "Vulkan - Use relaxed precision math in the kernels (mediump)" OFF)
 # option USE_XNNPACK: try to enable xnnpack by default.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126042
Approved by: https://github.com/r-barnes
2024-05-14 21:04:49 +00:00
7ed67cdbcc Add compile time smoketest for foreach (#126136)
Fixes [T175425693](https://www.internalfb.com/intern/tasks/?t=175425693)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126136
Approved by: https://github.com/yanboliang
2024-05-14 21:00:55 +00:00
a8eac0efa8 fix: unknown CMake command "check_function_exists" (#126165)
When building pytorch with OpenBLAS on windows I ran into this CMake issue:

```
CMake Error at cmake/Modules/FindLAPACK.cmake:137 (check_function_exists):
  Unknown CMake command "check_function_exists".
Call Stack (most recent call first):
  cmake/Dependencies.cmake:1745 (find_package)
  CMakeLists.txt:708 (include)
```

Similarly described here: https://discuss.pytorch.org/t/cmake-with-error-by-compiling-on-windows-with-mingw32-make/159140

This PR fixes this issue by adding:

```
include(CheckFunctionExists)
```

To the offending CMake file.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126165
Approved by: https://github.com/ezyang
2024-05-14 20:54:06 +00:00
4a8db9d45b [dynamo] reset grad state in aotdispatch test, add failing trace functional tensor test to dynamo (#126113)
Workaround for https://github.com/pytorch/pytorch/issues/125568.

We could add additional global state to reset (e.g. autocast?) or move this setup/teardown to a more general place.

Also added a minimal repro for the linked issue - will investigate in a followup PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126113
Approved by: https://github.com/ezyang, https://github.com/bdhirsh
2024-05-14 20:42:49 +00:00
f6a00a8032 [inductor] Add abs to index_propagation (#124616)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124616
Approved by: https://github.com/lezcano
ghstack dependencies: #124119
2024-05-14 20:14:53 +00:00
c30ea3387b [inductor] Improve stability of scaled softmax (#124119)
This adds a pattern which replaces:
```python
   scale(x) - scale(x).amax(dim, keepdim=True)
```
with
```python
   scale(x - x.amax(dim, keepdim=True))
```
where `scale` can be either multiplication or division by a scalar,
or a tensor that is broadcast in the `dim` dimension.

We can find this pattern inside of the decomposed graph of:
```python
F.softmax(scale(x), dim=dim)
```

This has the effect of both reducing the chance of hitting the `fma`
issue and also means we avoid recomputing `scale(x)` inside and outside
the reduction which may be significant if we can remove an extra division.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124119
Approved by: https://github.com/lezcano
2024-05-14 20:14:53 +00:00
352a893b0c Fast standalone symbolize for unwinding (#123966)
We've had issues using addr2line. On certain versions of
CentOS it is on a version that has a performance regression making it very slow,
and even normallly it is not that fast, taking several seconds even when parallelized
for a typical memory trace dump.

Folly Symbolize or LLVMSymbolize are fast but it requires PyTorch take a dependency on those libraries to do this, and given the number of environments we run stuff in, we end up hitting cases where we fallback to slow addr2line behavior.

This adds a standalone symbolizer to PyTorch similar to the unwinder which has
no external dependencies and is ~20x faster than addr2line for unwinding PyTorch frames.

I've tested this on some memory profiling runs using all combinations of {gcc, clang} x {dwarf4, dwarf5} and it seems to do a good job at getting line numbers and function names right. It is also careful to route all reads of library data through the `CheckedLexer` object, which ensure it is not reading out of bounds of the section. Errors are routed through UnwindError so that those exceptions get caught and we produce a ?? frame rather than crash. I also added a fuzz test which gives all our symbolizer options random addresses in the process to make sure they do not crash.

Differential Revision: [D56828968](https://our.internmc.facebook.com/intern/diff/D56828968)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123966
Approved by: https://github.com/ezyang, https://github.com/aaronenyeshi
2024-05-14 19:39:17 +00:00
5fb4a766b8 [CUDA] [CI] Add cu124 docker images (#125944)
Fixes issues encountered in https://github.com/pytorch/pytorch/pull/121956

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125944
Approved by: https://github.com/atalman
2024-05-14 19:38:10 +00:00
ed327876f5 [codemod] c10:optional -> std::optional (#126135)
Generated by running the following from PyTorch root:
```
find . -regex ".*\.\(cpp\|h\|cu\|hpp\|cc\|cxx\)$" | grep -v "build/" | xargs -n 50 -P 4 perl -pi -e 's/c10::optional/std::optional/'
```

`c10::optional` is just an alias for `std::optional`. This removes usages of that alias in preparation for eliminating it entirely.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126135
Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/albanD, https://github.com/aaronenyeshi
2024-05-14 19:35:51 +00:00
b55f57b7af [codemod][lowrisk] Remove extra semi colon from caffe2/c10/core/SymNodeImpl.h (#123055)
Summary:
`-Wextra-semi` or `-Wextra-semi-stmt`

If the code compiles, this is safe to land.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123055
Approved by: https://github.com/Skylion007
2024-05-14 19:35:29 +00:00
023f05cfe6 Allow symbols to reach conv_layout stride argument #125829 (#126116)
https://github.com/pytorch/pytorch/pull/125829 was reverted i rebased and the error could be merge error
because its not reproducible.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126116
Approved by: https://github.com/anijain2305
2024-05-14 19:22:16 +00:00
0e6462f69a [pipelining] Consolidate test models into a registry (#126114)
Resolves https://github.com/pytorch/PiPPy/issues/1062.

Also added a gradient equivalence test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126114
Approved by: https://github.com/H-Huang
ghstack dependencies: #125729, #125975
2024-05-14 19:11:54 +00:00
38b8b614a2 [ROCm] Implement forward AD for miopen_batch_norm (#125069)
Implements forward automatic differentiation support for miopen_batch_norm as well as unskips the associated unit tests. Also fixes a class of functorch related unit tests that fail due to failing a contiguous tensor assertion in BatchNorm_miopen.cpp. Solution was to just limit tensors to miopen_batch_norm that have at least 3 dimensions. The exact restriction already existed in the cudnn path and is why the tests in question only failed on ROCm.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125069
Approved by: https://github.com/jeffdaily, https://github.com/andrewor14
2024-05-14 19:09:50 +00:00
1a28f731dc [optim] Merge the pyi files into py files of optimizer (#125452)
Continue the work of pytorch/pytorch#125153
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125452
Approved by: https://github.com/janeyx99
2024-05-14 18:24:50 +00:00
a00a99e801 [profiler] Report strides in json trace (#125851)
We already collect strides, we just don't report them anywhere.

Note: this depends on concrete input collection being enabled, which I think is currently not the case internally.

Differential Revision: [D57165421](https://our.internmc.facebook.com/intern/diff/D57165421)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125851
Approved by: https://github.com/Chillee, https://github.com/aaronenyeshi
2024-05-14 18:24:24 +00:00
50c3d58734 [onnx.export] Cache AllGraphInputsStatic (#123028)
This PR is part of an effort to speed up torch.onnx.export (#121422).

- The inputs (dynamic inputs and constants) do not change as as nodes are added and it is expensive to re-compute for every node. So, we cache this value so we avoid computing it for every node. Open to entirely other solution as well.
- Resolves (5) in #121422.

(partial fix of #121545)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123028
Approved by: https://github.com/justinchuby
2024-05-14 18:19:04 +00:00
3cba50e478 [quant] Make per_group and per_token quant match torch.fake_quantize (#125781)
Summary: Follow-up to https://github.com/pytorch/ao/pull/229.
This resolves the difference between `input.div(scales)` and
`input.mul(1.0 / scales)`, which results in small numerical
discrepancies on some inputs.

Test Plan:
python test/test_quantization.py TestQuantizedTensor.test_decomposed_quantize_per_channel_group
python test/test_quantization.py TestQuantizedTensor.test_decomposed_quantize_per_token

Reviewers: jerryzh168

Subscribers: jerryzh168, supriyar
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125781
Approved by: https://github.com/jerryzh168
2024-05-14 18:18:54 +00:00
3892e86c94 [FSDP2] Changed grad acc test to use data parallel ref model (#126161)
This simplifies the test a bit.

**Context**
Option 1: Ref model is data parallel. Each rank's ref model receives local batch. We manually all-reduce gradients and divide them by world size to match DDP/FSDP semantics.
Option 2: Ref model is not data parallel. Each rank's ref model receives the same global batch. We manually divide the ref model's gradients by world size to match DDP/FSDP semantics. (Note that all ranks have the same ref model and same global batch.)

All of our other unit tests are written following Option 1, which is simpler and a more direct comparison to what our claimed semantics are. This PR switches the gradient accumulation test from being written as following Option 2 to as following Option 1.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126161
Approved by: https://github.com/wanchaol
ghstack dependencies: #126067, #126070
2024-05-14 18:15:38 +00:00
4ded666535 [FSDP2] Factored out MLPStack to de-dup code (#126070)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126070
Approved by: https://github.com/wanchaol
ghstack dependencies: #126067
2024-05-14 18:13:51 +00:00
48f98bcdfc [TD] Enable test removal on most default configs + distributed CUDA for everyone (#125931)
yolo

Add the longest jobs in pull:
* default cpu configs
* non sm86 cuda
* distributed cuda for everyone

Still excluding
* slow, inductor, rocm, onnx, mac, dynamo
* distributed cpu
* windows cuda
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125931
Approved by: https://github.com/huydhn, https://github.com/ZainRizvi
2024-05-14 17:35:12 +00:00
db3b38202b Improve dead code elimination of unnecessary int arguments (#126074)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126074
Approved by: https://github.com/lezcano
ghstack dependencies: #125325, #125915
2024-05-14 17:22:30 +00:00
9df2f8687f cprofile every compile id [x/y] to keep consistent with tlparse (#125659)
This PR moves cprofile decorator to keep consistent with `torch_inductor_stats` logging and is needed by fbcode diffs of profiling enablement in internal e2e jobs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125659
Approved by: https://github.com/ezyang
2024-05-14 17:09:28 +00:00
2e4d011195 [FSDP2] Used CommDebugMode in grad acc test (#126067)
+9/-27 lines -- very nice :)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126067
Approved by: https://github.com/wanchaol
2024-05-14 16:43:37 +00:00
20aa7cc678 Revert "[c10d] Add an option for NAN check on every collective (#125726)"
This reverts commit 6db32710074f0944305b2d1e4571bb4ce571bf6a.

Reverted https://github.com/pytorch/pytorch/pull/125726 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but the new test is failing on both multigpu and rocm distributed, i.e. c712b0f8a3 ([comment](https://github.com/pytorch/pytorch/pull/125726#issuecomment-2110646075))
2024-05-14 16:26:34 +00:00
aac215a824 SymInt-ify unsqueeze_copy (#125976)
Fixes https://github.com/pytorch/pytorch/issues/125853

I only half-know how to code c++ so please lmk if I did templating incorrectly 🙈
The reason I used a template is because the `InferUnsqueezeGeometryResult` struct gets used in a couple of other places, like for unsqueeze_quantized, but I wasn't sure if I should symint-ify those too.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125976
Approved by: https://github.com/larryliu0820, https://github.com/ezyang
2024-05-14 15:58:52 +00:00
ed76079af3 Revert "Remove Caffe2 python code (#126035)"
This reverts commit 9a1bf39c6629e27cad281393059244791b82a166.

Reverted https://github.com/pytorch/pytorch/pull/126035 on behalf of https://github.com/jeanschmidt due to Seems to have introduced lint error: Error: Module 'onnx' has no attribute 'numpy_helper' ([comment](https://github.com/pytorch/pytorch/pull/126035#issuecomment-2110570863))
2024-05-14 15:47:33 +00:00
d1f254dce8 Add a cache mechanism to accelerate torch.compile-for-eager (#116368)
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.

In this PR, we are trying to enable a cache mechanism to accelerate **eager-through-torch.compile**. When **eager-through-torch.compile** is enabled, we will store a persistent config to cache the kernel information for the aten operation.

The persistent config consists of two parts - meta_info and kernel_path.

- meta_info: The input tensors' shape, stride, device type, data type, and symbolic flag.
- kernel_path: The path of the kernel produced by Inductor.

When an aten operation is registered, the `kernel_holder` will load the persistent config and parse it to build the cache map;  the meta_info is key, and the kernel library is the value.

Currently, this PR only supports static shape to guard the kernel.

Take a `mul` as an example.
```python
class MulKernel:
  def __init__(self) -> None:
    pass

  def __call__(self, *args: Any, **kwargs: Any) -> Any:
    with torch._C._SetExcludeDispatchKeyGuard(torch._C.DispatchKey.Python, False):
      opt_fn = torch.compile(torch.ops.aten.mul, dynamic=False, options={
          "aot_inductor.eager_mode": True,
          "aot_inductor.eager_op_name": "mul_Tensor"
        }
      )
      return opt_fn(*args, **kwargs)

torch_compile_op_lib_impl = torch.library.Library("aten", "IMPL")

_, overload_names = torch._C._jit_get_operation("aten::mul")
schema = torch._C._get_schema("aten::mul", overload_name)
reg_name = schema.name
if schema.overload_name:
  reg_name = f"{reg_name}.{schema.overload_name}"
  torch_compile_op_lib_impl.impl(
    reg_name,
    MulKernel(),
    "CUDA",
    compile_mode=True)

a = torch.randn(1024, 1024, device=device)
b = torch.randn(1024, 1024, device=device)
warm_up_iter = 1000
iter = 10000
fn = torch.mul
# Warm up
for _ in range(warm_up_iter):
    fn(a, b)

# Collect performance
beg = time.time()
for _ in range(iter):
    fn(a, b)
end = time.time()
print(f"E2E run: {end - beg}")
```
It will produce the config as follows.

```json
[
    {
        "meta_info": [
            {
                "is_symbolic": false,
                "device_type": "cuda",
                "dtype": "torch.float32",
                "sizes": [1024, 1024],
                "strides": [1024, 1]
            },
            {
                "is_symbolic": false,
                "device_type": "cuda",
                "dtype": "torch.float32",
                "sizes": [1024, 1024],
                "strides": [1024, 1]
            }
        ],
        "kernel_path": "/tmp/torchinductor_eikan/e4/ce4jw46i5l2e7v3tvr2pyglpjmahnp7x3hxaqotrvxwoeh5t6qzc.so"
    }
]
```

Performance-wise, we collected mul.Tensor through torch.compile w/ 10000 runs(e2e). The data is as follows. And we will collect data when we support dynamic shape.

- Eager: ~266.11ms
- W/O Cache: ~3455.54ms
- W/ Cache and Cache Miss: ~3555.3ms
- W/ Cache and Cache Hit: ~267.12ms

Hardware:

- CPU: Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz
- GPU: CUDA A10

Software:

- PyTorch Version: 39df084001c54cca5fe3174176f9b0206ddb7dcf
- GPU Driver Version: 525.147.05
- CUDA Version: 12.0

Differential Revision: [D57216427](https://our.internmc.facebook.com/intern/diff/D57216427)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116368
Approved by: https://github.com/jansel, https://github.com/atalman
2024-05-14 15:43:48 +00:00
b3a8a3cbab Fix typos in torch._dynamo.config.py (#126150)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126150
Approved by: https://github.com/Skylion007
2024-05-14 14:27:35 +00:00
680a568721 Fix typo in HistogramKernel.cpp (#126156)
Fix typo in HistogramKernel.cpp

```diff
diff --git a/aten/src/ATen/native/cpu/HistogramKernel.cpp b/aten/src/ATen/native/cpu/HistogramKernel.cpp
index 196bfd5647..0505271f6a 100644
--- a/aten/src/ATen/native/cpu/HistogramKernel.cpp
+++ b/aten/src/ATen/native/cpu/HistogramKernel.cpp
@@ -100,7 +100,7 @@ void histogramdd_cpu_contiguous(Tensor& hist, const TensorList& bin_edges,

     TensorAccessor<const input_t, 2> accessor_in = input.accessor<const input_t, 2>();

-    /* Constructs a c10::optional<TensorAccessor> containing an accessor iff
+    /* Constructs a c10::optional<TensorAccessor> containing an accessor if
      * the optional weight tensor has a value.
      */
     const auto accessor_wt = weight.has_value()
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126156
Approved by: https://github.com/r-barnes
2024-05-14 14:26:35 +00:00
cyy
9a1bf39c66 Remove Caffe2 python code (#126035)
Follows the recent changes of Caffe2.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126035
Approved by: https://github.com/r-barnes, https://github.com/Skylion007
2024-05-14 14:23:46 +00:00
9641a8db25 [optim] deprecate LRScheduler.print_lr (#126105)
Fixes #99270

Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126105
Approved by: https://github.com/janeyx99
2024-05-14 14:13:03 +00:00
37596769d8 Autocast vdot (#125697)
Fixes https://github.com/pytorch/pytorch/issues/125544

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125697
Approved by: https://github.com/jbschlosser
2024-05-14 12:05:02 +00:00
556e4ec6c9 [FSDP] Add device in pin_memory argument (#119878)
Add device to pin_memory argument to support other backends like HPU

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119878
Approved by: https://github.com/awgu
2024-05-14 10:30:00 +00:00
9dec41b684 add avx512 specialization for vec_shuffle_down (#125147)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125147
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/peterbell10
2024-05-14 08:26:13 +00:00
8bf9e99cea [pytorch][cuda] Some speedup on depth wise convolution 2D forward (#125362)
This PR does a few things:

- Adds a generic implementation for `conv_depthwise2d` when the filter size is non standard. This implementation works faster because it doesn't do edge condition checks inside the innermost loops. We avoid the checks by calculating the boundaries ahead of the loop.
- Hints to nvcc to minimize the register usage so that we squeeze more memory bandwidth
- Adds filter size 5 as a common size where we can use the template implementation to improve unrolling and generate more efficient code

The implementation doesn't completely fix the issue described in https://github.com/pytorch/pytorch/issues/18631. For that we need to rewrite the kernel using the suggestions described in the issue chat. This PR uses the same order of accessing the tensor as before but just removes overhead instructions in the inner loops to get the speedup.

Before:

```
conv2d-performance:
         B      C      iH      iW    kH    kW  native (cpu)  conv2d (cuda)  conv2d-fp16 (cuda)
0      8.0   64.0  1024.0  1008.0   5.0   5.0    149.052643      24.982176            3.236192
1      8.0   64.0  1008.0  1008.0   5.0   5.0    150.810333      24.643536            3.237760
2      4.0   48.0   720.0   539.0   6.0   1.0     15.747776       2.636320            1.788672
3      4.0  120.0   379.0   283.0   6.0   1.0     12.234080       1.791712            1.231360
4      4.0   32.0   713.0   532.0   6.0   1.0     10.362272       1.731584            1.170544
5      4.0    3.0   712.0   542.0  31.0  31.0     24.965248       3.406304            4.165440
6      4.0  120.0   379.0   288.0   1.0   6.0     10.772512       1.215616            0.939936
7   1024.0  384.0     1.0   928.0   1.0   3.0     60.051582       7.594256            2.861344
8      4.0   24.0   687.0   512.0   6.0   1.0     10.231536       1.196704            0.818432
9     96.0   96.0   112.0   112.0   5.0   5.0     21.025631       5.110096            0.715520
10    96.0   80.0    56.0    56.0   5.0   5.0      9.730064       1.016080            0.207424
11    64.0  128.0    64.0    84.0   3.0   3.0     18.759552       0.616736            0.200832
12    16.0  960.0     7.0     7.0   5.0   5.0      0.274880       0.020288            0.014688
13    16.0   64.0   112.0   112.0   3.0   3.0      6.425696       0.189088            0.053728
```

After

```
         B      C      iH      iW    kH    kW  native (cpu)  conv2d (cuda)  conv2d-fp16 (cuda)
0      8.0   64.0  1024.0  1008.0   5.0   5.0    122.534370      12.915648            3.269936
1      8.0   64.0  1008.0  1008.0   5.0   5.0    126.026978      12.826848            3.236608
2      4.0   48.0   720.0   539.0   6.0   1.0     14.488160       1.803424            1.794368
3      4.0  120.0   379.0   283.0   6.0   1.0     11.556304       1.251200            1.240736
4      4.0   32.0   713.0   532.0   6.0   1.0      9.737841       1.186240            1.174128
5      4.0    3.0   712.0   542.0  31.0  31.0     19.394785       2.017056            2.310368
6      4.0  120.0   379.0   288.0   1.0   6.0      9.586752       0.828736            0.843712
7   1024.0  384.0     1.0   928.0   1.0   3.0     48.939903       5.529312            2.860768
8      4.0   24.0   687.0   512.0   6.0   1.0     13.474000       0.831920            0.825280
9     96.0   96.0   112.0   112.0   5.0   5.0     15.439168       2.611616            0.724864
10    96.0   80.0    56.0    56.0   5.0   5.0      5.991968       0.520352            0.207456
11    64.0  128.0    64.0    84.0   3.0   3.0      9.381472       0.609680            0.202832
12    16.0  960.0     7.0     7.0   5.0   5.0      0.265504       0.015680            0.014496
13    16.0   64.0   112.0   112.0   3.0   3.0      2.384832       0.187168            0.053280
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125362
Approved by: https://github.com/ezyang
2024-05-14 07:27:02 +00:00
1370f3a00d [inductor] make mm template work with non-contiguous input (#126106)
Fix https://github.com/pytorch/pytorch/issues/125437 .

Triton matmul template does not work well with non-contiguous inputs and cause mis-aligned memory access. It happens both for inductor matmul template and triton.ops.matmul op. This PR avoid adding `tl.multiple_of` and `tl.max_contiguous` if the input tensors are not contiguous. This work around the issue. We'll follow up and try to figure out  the root cause in the GH issue.

The if/else added to the template should be resolved at compile time and they by themselves does not cause any perf hit.

Test:
```
TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/huggingface.py --backend inductor --amp --accuracy --only BertForMaskedLM --training
```
Previously fail with misaligned memory access and now pass

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126106
Approved by: https://github.com/htyu
2024-05-14 07:21:53 +00:00
60b00b4b4d [CI] Upgrade intel support packages for XPU (#125655)
upgrade intel basekit package to 0.5 for XPU
Works for #114850
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125655
Approved by: https://github.com/EikanWang, https://github.com/chuanqi129, https://github.com/atalman
2024-05-14 06:50:23 +00:00
c312cd8890 add simple test for nccl metadata (#125317)
Add a few test cases to verify newly added NCCL metadata in profiler events
The test looks at the following blocks record_param_comms
```
{
  "ph": "X",
  "cat": "cpu_op",
  "name": "record_param_comms",
  "pid": 2840966,
  "tid": 2844581,
  "ts": 2424859.045,
  "dur": 203.866,
  "args": {
    "Collective name": "allreduce",
    "Process Group Description": "default_pg",
    "dtype": "Float",
    "In msg nelems": 100,
    "Global rank start": 0,
    "Group size": 2,
    "Process Group Ranks": "[0, 1]",
    "Record function id": 0,
    "Out msg nelems": 100,
    "Global rank stride": 1,
    "Process Group Name": "0",
  }
}
```

## Unit test

```
>$ touch /tmp/barrier && TEMP_DIR="/tmp" BACKEND="nccl" WORLD_SIZE="2" python test/distributed/test_distributed_spawn.py -v TestDistBackendWithSpawn.test_ddp_profiling_torch_profiler

test_ddp_profiling_torch_profiler (__main__.TestDistBackendWithSpawn.test_ddp_profiling_torch_profiler) ... NCCL version 2.20.5+cuda12.0
STAGE:2024-05-01 16:41:15 2840966:2840966 ActivityProfilerController.cpp:316] Completed Stage: Warm Up
STAGE:2024-05-01 16:41:15 2840965:2840965 ActivityProfilerController.cpp:316] Completed Stage: Warm Up
STAGE:2024-05-01 16:41:17 2840965:2840965 ActivityProfilerController.cpp:322] Completed Stage: Collection
STAGE:2024-05-01 16:41:17 2840966:2840966 ActivityProfilerController.cpp:322] Completed Stage: Collection
STAGE:2024-05-01 16:41:17 2840966:2840966 ActivityProfilerController.cpp:326] Completed Stage: Post Processing
STAGE:2024-05-01 16:41:17 2840965:2840965 ActivityProfilerController.cpp:326] Completed Stage: Post Processing
STAGE:2024-05-01 16:41:18 2840966:2840966 ActivityProfilerController.cpp:316] Completed Stage: Warm STAGE:2024-05-01 16:41:18 2840965:2840965 ActivityProfilerController.cpp:316] Completed Stage: Warm Up
STAGE:2024-05-01 16:41:18 2840965:2840965 ActivityProfilerController.cpp:322] Completed Stage: Collection
STAGE:2024-05-01 16:41:18 2840966:2840966 ActivityProfilerController.cpp:322] Completed Stage: Collection
STAGE:2024-05-01 16:41:18 2840965:2840965 ActivityProfilerController.cpp:326] Completed Stage: Post Processing
STAGE:2024-05-01 16:41:18 2840966:2840966 ActivityProfilerController.cpp:326] Completed Stage: Post Processing
Trace saved to /tmp/tmpvwivp7mo.json
Trace saved to /tmp/tmpvwvsc1fy.json
ok
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125317
Approved by: https://github.com/LucasLLC, https://github.com/kwen2501
2024-05-14 06:20:50 +00:00
b805d3cbcb Modify device check in capturable optimizer to support more devices (#124919)
Fixes #124830

Modify device check in capturable optimizer to support more device

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124919
Approved by: https://github.com/janeyx99
2024-05-14 05:56:00 +00:00
e0e9d3ed79 make sure device mesh can be imported from torch.distributed (#126119)
as titled

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126119
Approved by: https://github.com/kwen2501, https://github.com/anijain2305
2024-05-14 05:00:48 +00:00
2ae65b72ff [dtensor] early return for _split_tensor (#125810)
as titled, if _split_tensor does not require padding or even is evenly
sharded on the dim, no need to calculate padding and could simply return

This is to avoid some unnecessary CPU operations

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125810
Approved by: https://github.com/wz337
2024-05-14 04:59:27 +00:00
bdaa9b2981 [Dynamo] Wrap set as SetVariable and support isdisjoint by polyfill (#126046)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126046
Approved by: https://github.com/anijain2305, https://github.com/jansel
2024-05-14 04:56:06 +00:00
bc9587778c update pointwise cat heuristics (#125772)
Fix for https://github.com/pytorch/pytorch/issues/122871. There are two cases where we emit pointwise cat:

- fusing into a pointwise use
- horizontally fusing copy_ kernels

The regression I looked into previously was due to being overly aggressive in the latter case. I've updated the logic there so that we only emit the horizontal fusion in the case where there are not reductions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125772
Approved by: https://github.com/Chillee
2024-05-14 04:46:27 +00:00
d0f3ae8e67 [Doc] Update Intel GPU Support on README (#126001)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126001
Approved by: https://github.com/jgong5, https://github.com/albanD, https://github.com/gujinghui, https://github.com/EikanWang
2024-05-14 04:42:58 +00:00
812534d27e Skip two LR schedulers with eager memory leaks in compiled optim tests (#126133)
SequentialLR and ChainedLR leak memory, so disable these two schedulers until https://github.com/pytorch/pytorch/issues/126131 is fixed.

Re-enables
https://github.com/pytorch/pytorch/issues/125925
https://github.com/pytorch/pytorch/issues/125925
https://github.com/pytorch/pytorch/issues/125924

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126133
Approved by: https://github.com/yanboliang, https://github.com/aorenste
2024-05-14 04:42:34 +00:00
9a2beb862d Permit trivial solves for floating point equality in ShapeEnv (#125915)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125915
Approved by: https://github.com/lezcano
ghstack dependencies: #125325
2024-05-14 04:10:01 +00:00
2ba102f689 Implement native support for float inputs in Dynamo and ShapeEnv (#125325)
The big idea is that floats are treated as Tensors on input/output to the FX graph, but on the inside, we immediately call item() on the synthetic Tensor and record regular float operations on it. Canonicalization to Tensor operations will happen in a standalone FX pass. This behavior is controlled by `specialize_float` config variable when set to False.

The generated graph looks like this for the test `test_unspec_float_output`:

```
 def forward(self, L_x_: "f32[3]", L_y_: "f32[]"):
     l_x_ = L_x_
     l_y_ = L_y_

     # File: /data/users/ezyang/a/pytorch/test/dynamo/test_unspec.py:511 in f, code: return x + 1, y * 2
     add: "f32[3]" = l_x_ + 1;  l_x_ = None
     item: "Sym(zf0)" = l_y_.item();  l_y_ = None
     mul: "Sym(2*zf0)" = item * 2;  item = None
     scalar_tensor: "f32[]" = torch.scalar_tensor(mul);  mul = None
     return (add, scalar_tensor)
```

The ingredients:

* **torch/_dynamo/variables/builder.py** When `specialize_float` is False, we wrap float literals with `wrap_symfloat`. This is an unholy mashup of `wrap_symint` and `wrap_unspecialized_primitive`. The overall strategy is that we first generate a tensor argument (because that's what we want to show up into the FX graph), but then immediately call item() on the tensor argument to get a SymNodeVariable, which we will do the rest of the tracing with.  Importantly, this SymNodeVariable is backed with the source of the original float: this means we can guard on the resulting value (something we could NOT do with UnspecializedPythonVariable). This has to be done manually, because if you literally call item() on the tensor, you will end up with an unbacked float. There is a bit of copy paste from wrap_symint and wrap_unspecialized_primitive which we can try to factor out, but this really is its own thing and you should review every line of code in the function.
* **torch/fx/experimental/symbolic_shapes.py** We now can generate guards on float inputs, and these guards are handled inside of ShapeEnv. So we need to be able to allocate (backed!) float symbols, and produce guards for them. Fairly straightforward generalization.
* **torch/_dynamo/codegen.py** I also need to maintain the invariant that there are no float outputs to the FX graph. I chose to do this at codegen time. When we detect a SymNodeVariable on the return stack for a float, we on the fly convert it (via `as_tensor`) to a TensorVariable, which is the true output. We then special case the output bytecode to call item() on it again. The tensor conversion is memoized on SymNodeVariable since we typically run the code generation process twice.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125325
Approved by: https://github.com/lezcano, https://github.com/jansel
2024-05-14 04:10:01 +00:00
04877dc430 Update context manager for cudnn (#126122)
# Summay
Updates the context manager to support cudnn backend

This results were done using cuda toolkit 12-3 and cudnn 9.0.0.

## H100 Numbers
 _power limited_
 ``` Markdown
+--------------+-------------------+---------+------------+-------------------+-------------------+-------------------------+
|   Batch Size |   Sequence Length |   Heads |   Head Dim |   Flash Time (µs) |   CUDNN Time (µs) |   Speedup (CUDNN/Flash) |
+==============+===================+=========+============+===================+===================+=========================+
|            1 |              4096 |      32 |         64 |           665.053 |           498.59  |                 1.33387 |
+--------------+-------------------+---------+------------+-------------------+-------------------+-------------------------+
|            1 |              4096 |      16 |        128 |           591.225 |           323.828 |                 1.82574 |
+--------------+-------------------+---------+------------+-------------------+-------------------+-------------------------+
|            1 |              8192 |      32 |         64 |          2579.77  |          1933.34  |                 1.33436 |
+--------------+-------------------+---------+------------+-------------------+-------------------+-------------------------+
|            1 |              8192 |      16 |        128 |          2297.4   |          1211.33  |                 1.89659 |
+--------------+-------------------+---------+------------+-------------------+-------------------+-------------------------+
|            1 |             16384 |      32 |         64 |         10178.2   |          7619.18  |                 1.33587 |
+--------------+-------------------+---------+------------+-------------------+-------------------+-------------------------+
|            1 |             16384 |      16 |        128 |          9093.51  |          4725.03  |                 1.92454 |
+--------------+-------------------+---------+------------+-------------------+-------------------+-------------------------+
|            1 |             32768 |      32 |         64 |         39893.1   |         29850.6   |                 1.33643 |
+--------------+-------------------+---------+------------+-------------------+-------------------+-------------------------+
|            1 |             32768 |      16 |        128 |         36160.9   |         18615.9   |                 1.94247 |
+--------------+-------------------+---------+------------+-------------------+-------------------+-------------------------+
|            1 |             65536 |      32 |         64 |        157965     |        116794     |                 1.35251 |
+--------------+-------------------+---------+------------+-------------------+-------------------+-------------------------+
|            1 |             65536 |      16 |        128 |        142039     |         73102.1   |                 1.94303 |
+--------------+-------------------+---------+------------+-------------------+-------------------+-------------------------+
|            1 |            131072 |      32 |         64 |        621100     |        465143     |                 1.33529 |
+--------------+-------------------+---------+------------+-------------------+-------------------+-------------------------+
|            1 |            131072 |      16 |        128 |        556142     |        289776     |                 1.91922 |
+--------------+-------------------+---------+------------+-------------------+-------------------+-------------------------+
```

## A100 Numbers
```Markdown
+--------------+-------------------+---------+------------+-------------------+-------------------+------------------+-------------------------+------------------------+
|   Batch Size |   Sequence Length |   Heads |   Head Dim |   Flash Time (µs) |   CUDNN Time (µs) |   Flex Time (µs) |   Speedup (CUDNN/Flash) |   Speedup (Flex/Flash) |
+==============+===================+=========+============+===================+===================+==================+=========================+========================+
|            1 |              4096 |      32 |         64 |           799.391 |           836.327 |          981.234 |                0.955836 |               0.814679 |
+--------------+-------------------+---------+------------+-------------------+-------------------+------------------+-------------------------+------------------------+
|            1 |              4096 |      16 |        128 |           750.131 |           806.964 |          944.766 |                0.929572 |               0.793986 |
+--------------+-------------------+---------+------------+-------------------+-------------------+------------------+-------------------------+------------------------+
|            1 |              8192 |      32 |         64 |          3211.84  |          3234.41  |         3803.09  |                0.993022 |               0.844534 |
+--------------+-------------------+---------+------------+-------------------+-------------------+------------------+-------------------------+------------------------+
|            1 |              8192 |      16 |        128 |          2984.2   |          3164.66  |         3626.79  |                0.942979 |               0.822821 |
+--------------+-------------------+---------+------------+-------------------+-------------------+------------------+-------------------------+------------------------+
|            1 |             16384 |      32 |         64 |         12630.6   |         12673.1   |        14900.6   |                0.996643 |               0.847653 |
+--------------+-------------------+---------+------------+-------------------+-------------------+------------------+-------------------------+------------------------+
|            1 |             16384 |      16 |        128 |         11722.7   |         12499.4   |        13763.5   |                0.937862 |               0.851725 |
+--------------+-------------------+---------+------------+-------------------+-------------------+------------------+-------------------------+------------------------+
|            1 |             32768 |      32 |         64 |         50068.3   |         51061.2   |        60094     |                0.980556 |               0.833167 |
+--------------+-------------------+---------+------------+-------------------+-------------------+------------------+-------------------------+------------------------+
|            1 |             32768 |      16 |        128 |         46283.6   |         49708.7   |        55336.7   |                0.931096 |               0.836399 |
+--------------+-------------------+---------+------------+-------------------+-------------------+------------------+-------------------------+------------------------+
|            1 |             65536 |      32 |         64 |        203124     |        203083     |       239618     |                1.0002   |               0.847701 |
+--------------+-------------------+---------+------------+-------------------+-------------------+------------------+-------------------------+------------------------+
|            1 |             65536 |      16 |        128 |        187326     |        198364     |       221912     |                0.944355 |               0.844145 |
+--------------+-------------------+---------+------------+-------------------+-------------------+------------------+-------------------------+------------------------+
|            1 |            131072 |      32 |         64 |        816813     |        827419     |       978836     |                0.987182 |               0.834473 |
+--------------+-------------------+---------+------------+-------------------+-------------------+------------------+-------------------------+------------------------+
|            1 |            131072 |      16 |        128 |        749693     |        845463     |       905696     |                0.886725 |               0.827754 |
+--------------+-------------------+---------+------------+-------------------+-------------------+------------------+-------------------------+------------------------+
```

## Script
``` Python
import os
import torch
from typing import Callable
from torch.nn.attention import SDPBackend, sdpa_kernel
from itertools import product
from tqdm import tqdm
from tabulate import tabulate

os.environ["TORCH_CUDNN_SDPA_ENABLED"] = "1"

causal = False

from triton.testing import do_bench
from torch.nn.functional import scaled_dot_product_attention as sdpa

def benchmark_torch_function_in_microseconds(func: Callable, *args, **kwargs) -> float:
    # warmup
    for _ in range(5):
        func(*args, **kwargs)
    return do_bench(lambda: func(*args, **kwargs)) * 1e3

def run_attention_test(backend_name, backend_enum):
    results = []
    batch_sizes = [1]
    seq_lengths = [4096, 8192, 16384, 32768, 65536, 131072]

    torch.cuda.empty_cache()
    for b, s in tqdm(product(batch_sizes, seq_lengths), total=len(batch_sizes) * len(seq_lengths), desc=backend_name):
        for h, d in zip((32, 16), (64, 128)):
            q, k, v = torch.randn(
                b, s, h * d * 3, dtype=torch.bfloat16, device="cuda", requires_grad=False
            ).chunk(3, dim=-1)
            q = q.view(b, -1, h, d).transpose(1, 2)
            k = k.view(b, -1, h, d).transpose(1, 2)
            v = v.view(b, -1, h, d).transpose(1, 2)
            with torch.no_grad(), sdpa_kernel(backend_enum):
                time = benchmark_torch_function_in_microseconds(sdpa, q, k, v, is_causal=False)
            results.append((backend_name, b, s, h, d, time))
    return results

flash_results = run_attention_test("Flash Attention", SDPBackend.FLASH_ATTENTION)
cudnn_results = run_attention_test("CUDNN Attention", SDPBackend.CUDNN_ATTENTION)

# Combine results for comparison
combined_results = []
for flash, cudnn in zip(flash_results, cudnn_results):
    speedup = flash[5] / cudnn[5]
    combined_results.append(
        (flash[1], flash[2], flash[3], flash[4], flash[5], cudnn[5], speedup)
    )

# Tabulate the results
headers = [
    "Batch Size",
    "Sequence Length",
    "Heads",
    "Head Dim",
    "Flash Time (s)",
    "CUDNN Time (s)",
    "Speedup (CUDNN/Flash)",
]
table = tabulate(combined_results, headers, tablefmt="grid")
print(table)

```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126122
Approved by: https://github.com/cpuhrsch
2024-05-14 03:34:19 +00:00
aeb9934bda [AOTI] Fix a problem in https://github.com/pytorch/pytorch/pull/125730 (#126110)
Summary: `generate_c_shim_extern_kernel_call` needs to handle tensor args wrapped with wrap_with_raii_handle_if_needed, to fix some internal test failures

Differential Revision: D57293873

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126110
Approved by: https://github.com/huydhn
2024-05-14 02:16:04 +00:00
71467abc44 Changes to compile with 3.13 (#126033)
This is mainly:
- Fix refcount access macro
- Hide all the Dynamo code that needs update as usual
- Add _PyWeakref_ClearRef as an extern provided by CPython. Including the pycore header that defines it would require raw c include shenanigans that I don't think are worth it.
This allows to build both with regular and nogil version of cpython. Both

Note that this requires the 3.13 branch at least past [d3094744d40de2deefbda9b1996d5029c9ebf0b0](d3094744d4) which we need for mimalloc include and weakref function being exposed.

debug-only issues in pybind11 with PyMem_MALLOC vs PyObject_MALLOC being should be synced either by updating pybind or cpython. @colesbury I can send a PR to ifdef the proper use in pybind if you think that this is the best solution here?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126033
Approved by: https://github.com/colesbury
2024-05-14 02:14:57 +00:00
ef7d8ad6af Use source code hash instead of torch version (#126092)
Differential Revision: [D57289808](https://our.internmc.facebook.com/intern/diff/D57289808/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126092
Approved by: https://github.com/masnesral, https://github.com/jansel
2024-05-14 01:53:31 +00:00
3c4058cf18 Add master cache disable switch for inductor (#126084)
Fixes #125699

Differential Revision: [D57284558](https://our.internmc.facebook.com/intern/diff/D57284558/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126084
Approved by: https://github.com/jansel
2024-05-14 01:19:28 +00:00
c712b0f8a3 [export] Fix runtime assertions to add call_function (#125878)
Fixes [internal issue](https://www.internalfb.com/intern/everpaste/?handle=GJCK9xUNpYXovnEBAHfuJ7vQLxZnbsIXAAAB)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125878
Approved by: https://github.com/ezyang
2024-05-14 00:57:50 +00:00
6a5acd91c3 add shape check for rrelu_with_noise (#122870)
Fix https://github.com/pytorch/pytorch/issues/121094.
Add shape check for rrelu_with_noise, check whether the shape of input tensor and noise tensor are the same.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122870
Approved by: https://github.com/mingfeima, https://github.com/ezyang
2024-05-14 00:12:00 +00:00
6db3271007 [c10d] Add an option for NAN check on every collective (#125726)
Summary:
The NAN CHECK is done through device side assert without copying needed
from GPU to CPU
Test Plan:
Unit test for collectives that should experience run time error

(sqzhang_1) [sqzhang@devgpu009.cln1 ~/pytorch (38f5143e)]$  python
test/distributed/test_c10d_nccl.py ProcessGroupNCCLTest.test_nan_assert
/home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15:
checkForNaN: block: [0,0,0], thread: [0,0,0] Assertion `!isnan(val)`
failed.
/home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15:
checkForNaN: block: [0,0,0], thread: [1,0,0] Assertion `!isnan(val)`
failed.
/home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15:
checkForNaN: block: [0,0,0], thread: [2,0,0] Assertion `!isnan(val)`
failed.
/home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15:
checkForNaN: block: [0,0,0], thread: [3,0,0] Assertion `!isnan(val)`
failed.
/home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15:
checkForNaN: block: [0,0,0], thread: [4,0,0] Assertion `!isnan(val)`
failed.
/home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15:
checkForNaN: block: [0,0,0], thread: [5,0,0] Assertion `!isnan(val)`
failed.
[rank0]:[E507 17:31:56.885473996 Utils.cu:30] CUDA error during
checkForNan: device-side assert triggered

/home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15:
checkForNaN: block: [0,0,0], thread: [0,0,0] Assertion `!isnan(val)`
failed.
/home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15:
checkForNaN: block: [0,0,0], thread: [1,0,0] Assertion `!isnan(val)`
failed.
/home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15:
checkForNaN: block: [0,0,0], thread: [2,0,0] Assertion `!isnan(val)`
failed.
/home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15:
checkForNaN: block: [0,0,0], thread: [3,0,0] Assertion `!isnan(val)`
failed.
/home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15:
checkForNaN: block: [0,0,0], thread: [4,0,0] Assertion `!isnan(val)`
failed.
/home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15:
checkForNaN: block: [0,0,0], thread: [5,0,0] Assertion `!isnan(val)`
failed.
[rank1]:[E507 17:31:56.128961534 Utils.cu:30] CUDA error during
checkForNan: device-side assert triggered

.
----------------------------------------------------------------------
Ran 1 test in 7.723s

OK

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125726
Approved by: https://github.com/kwen2501
2024-05-14 00:05:41 +00:00
1e47c7b11b [inductor] enable software pipelining on AMD devices (#125858)
Summary:
per-AMD, software pipelining is enabled by setting `num_stages=0`, and should provide a nice perf boost for GEMMs. caveat is that `num_stages=1` is preferred for instances of back-to-back GEMMs, but take `num_stages=0` as the better default.

wait to land until triton upstream lands in OSS, pipelining does not work well on the fork

Test Plan: n/a

Reviewed By: xw285cornell, yoyoyocmu

Differential Revision: D56221447

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125858
Approved by: https://github.com/pragupta, https://github.com/yoyoyocmu
2024-05-13 22:36:23 +00:00
ec7f2b2626 [DCP] adds type safety to str filtering in EmptyStateDict (#126082)
[DCP] adds type safety to str filtering in EmptyStateDict

Differential Revision: [D57281133](https://our.internmc.facebook.com/intern/diff/D57281133/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126082
Approved by: https://github.com/fegin, https://github.com/wz337, https://github.com/Skylion007
2024-05-13 22:13:05 +00:00
bd3cbdba2f Revert "[optim] add fused_adagrad support for CPU device (#124905)"
This reverts commit 1c3fe8403365db3cc9b75524ae742e3027b745e2.

Reverted https://github.com/pytorch/pytorch/pull/124905 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is failing distributed multigpu test in trunk 1c3fe84033 ([comment](https://github.com/pytorch/pytorch/pull/124905#issuecomment-2108777063))
2024-05-13 20:53:22 +00:00
36e6f3b339 [caffe2] Make all get_backtrace() implementations lazy (#125750) (#126064)
Summary:

#125682 (D56586844) added support for lazy symbolization to `Error` and adopted it for internal use cases; this commit adopts it for `get_backtrace()` as well.

Test Plan:
Sandcastle and GH CI.

NOTE: This is a resubmit of D56881683, a spurious copypasted line in the Android implementation broke the build, but this was not surfaced by diff tests.

Reproed the breakage with
```
$ fbpython scripts/build_android_app/build_android_app.py --buck-config-files='@//fbandroid/mode/have_libgflags @//fbandroid/mode/static_linking @//xplat/langtech/mobile/android_opt_buck_config_with_et_boltnn' --build-target='fbsource//xplat/langtech/mobile:transcribe_binAndroid-android-arm64'
```
Verified that the fixed diff builds successfully.

Differential Revision: D57275456

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126064
Approved by: https://github.com/ezyang
2024-05-13 20:17:41 +00:00
c098cd0cbb Eliminate a C++11 code pattern in pimpl.h (#126069)
Test Plan: Sandcastle

Differential Revision: D57224687

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126069
Approved by: https://github.com/Skylion007
2024-05-13 19:01:13 +00:00
b9e7b35912 Remove caffe2 from more build files (#125898)
Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125898
Approved by: https://github.com/Skylion007
2024-05-13 18:37:59 +00:00
b620231378 Fix nested fqn discovery (#125957)
I think I missed some fix!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125957
Approved by: https://github.com/sanketpurandare, https://github.com/janeyx99
2024-05-13 18:24:56 +00:00
9e1826deff [torchbind] Add inductor support (#123709)
Example inductor generated python code: [P1245776497](https://www.internalfb.com/phabricator/paste/view/P1245776497)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123709
Approved by: https://github.com/eellison
2024-05-13 18:18:17 +00:00
4d8fa7df40 Fix four misspellings of "its" in documentation (#125681)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125681
Approved by: https://github.com/Skylion007, https://github.com/svekars
2024-05-13 18:14:09 +00:00
7f1d5aba93 [FSDP] Use generic device handle instead of cuda (#121620)
In FSDP _optim_utils.py  Use generic device handle instead of cuda
to support other backends

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121620
Approved by: https://github.com/awgu, https://github.com/wz337
2024-05-13 18:07:08 +00:00
3183d65ac0 use shutil.which in _find_cuda_home (#126060)
Replace `subprocess.check_output` call with `shutil.which`, similarly to how this is done in `_find_rocm_home`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126060
Approved by: https://github.com/r-barnes
2024-05-13 17:38:17 +00:00
637074983e [inductor] Make load_mask() codegen determinstic (#126017)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126017
Approved by: https://github.com/shunting314
2024-05-13 17:36:52 +00:00
82edc8b5d5 [NT] Make NestedTensor register as having symbolic sizes/strides (#124687)
Fixes #123698

This PR makes TensorImpl::has_symbolic_sizes_strides return false for NestedTensors.

1. It passes in the actual sizes when we call `_make_wrapper_subclass` - this is the change that makes the subclass register as `has_symbolic_sizes_strides() == True`
2. It adds a field to `_make_wrapper_subclass` where an explicit `numel` can be provided. This allows us to skip the numel computation for the storage, which previously fails due to arithmetic on NestedInts.
3. Implements `aten::numel` for NJT - this is separate from the overridden numel in `make_wrapper_subclass` for now. Note also that this means that we leave `dispatch_sizes_strides_policy="sizes"`, so that we call into the custom `numel` implementation (as well as `sizes` and `strides`), because `numel` cannot currently be computed from `sizes` for NJT.

Note also that this depends on #121361, because calling TensorImpl::set_sizes_and_strides() tries to clone the sizes into the tensor, which means that we need `clone` to be implemented on NestedInt.

Differential Revision: [D57225736](https://our.internmc.facebook.com/intern/diff/D57225736)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124687
Approved by: https://github.com/albanD
2024-05-13 16:50:25 +00:00
96bdb7a0fb in test_foreach.py pacth KINETO_LOG_LEVEL to silence profiler log (#126048)
as per title, `patch.dict` the env var in favor of cleaner logs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126048
Approved by: https://github.com/janeyx99
2024-05-13 15:31:56 +00:00
7899034282 [fbcode] remove xcode_public_headers_symlinks (#125966)
Summary: These attributes do nothing in Buck 2, we can remove them.

Test Plan:
```
$ buck2 uquery //... > /dev/null
```

Differential Revision: D57169445

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125966
Approved by: https://github.com/malfet
2024-05-13 15:06:35 +00:00
56b271fd7a STRONG_CONSTEXPR -> constexpr (#125872)
Test Plan: Sandcastle

Differential Revision: D57158890

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125872
Approved by: https://github.com/Skylion007
2024-05-13 14:07:26 +00:00
f0c8b93487 Make wrapIndexOnce check async, avoid DtoH sync on index_put_ (#125952)
Internal xref: https://fb.workplace.com/groups/1075192433118967/posts/1427156211255919/

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125952
Approved by: https://github.com/lezcano
2024-05-13 13:28:45 +00:00
c0b7b56cf4 [xla hash update] update the pinned xla hash (#126052)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126052
Approved by: https://github.com/pytorchbot
2024-05-13 12:36:51 +00:00
afda6685ae fixed typo in documentation (#125974)
Summary: Fixed typo in documentation. Trying to get familiar with the PR workflow for contributing to PyTorch.

Test Plan: None

Reviewers:

Subscribers:

Tasks:

Tags:

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125974
Approved by: https://github.com/ezyang
2024-05-13 04:37:51 +00:00
1c3fe84033 [optim] add fused_adagrad support for CPU device (#124905)
Support fused_sgd_kernel support for CPU.

## Bench result:
32 core/sockets ICX
Test Scripts:
https://gist.github.com/zhuhaozhe/79e842e0a6e25d6d7fa1e4598807272c
https://gist.github.com/zhuhaozhe/b4c6998a509dcea1796dd05b3005c969
```
Tensor Size: 262144, Num Tensor 4, Num Threads: 1
_single_tensor_adagrad time: 0.2500 seconds
_fused_adagrad time: 0.0933 seconds
Tensor Size: 4194304, Num Tensor 32, Num Threads: 32
_single_tensor_adagrad time: 2.8819 seconds
_fused_adagrad time: 1.7591 seconds
```
## Test Plan:
```
python test_optim.py -k test_fused_matches_forloop
python test_optim.py -k test_fused_large_tensor
python test_optim.py -k test_can_load_older_state_dict
python test_optim.py -k test_grad_scaling_autocast_fused_optimizers
python test_torch.py -k test_grad_scaling_autocast_fused
python test_torch.py -k test_params_invalidated_with_grads_invalidated_between_unscale_and_step
```

Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124905
Approved by: https://github.com/jgong5, https://github.com/janeyx99
2024-05-13 01:16:20 +00:00
cyy
4b88a5bd0b Remove AnalyzeTemporaryDtors from clang-tidy config (#125985)
Remove AnalyzeTemporaryDtors from clang-tidy config which is not used in newer releases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125985
Approved by: https://github.com/Skylion007
2024-05-12 21:44:33 +00:00
34910f87f0 [BE]: Update ruff to v0.4.4 (#125031)
Update ruff version to 0.4.2. This version mostly has bugfixes for the new parser and also updates the f-string rule to be able to apply more fixes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125031
Approved by: https://github.com/albanD, https://github.com/malfet
2024-05-12 20:02:37 +00:00
ae9a4fa63c [ROCm] enforce ROCM_VERSION >= 6.0 (#125646)
Remove any code relying on ROCM_VERSION < 6.0.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125646
Approved by: https://github.com/albanD, https://github.com/eqy
2024-05-12 18:01:28 +00:00
cyy
0116ffae7f Remove deprecated _aminmax operator (#125995)
It has been deprecated for a long time.

Co-authored-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125995
Approved by: https://github.com/ezyang
2024-05-12 17:50:17 +00:00
037615b989 [inductor][cpp] GEMM template (infra and fp32) (#124021)
This PR adds the Cpp template infrastructure and the initial FP32 gemm template. See RFC https://github.com/pytorch/pytorch/issues/125683 for more background info.
1. Cpp template infrastructure
Similar template abstractions as the CUTLASS template, i.e., `CppTemplate`, `CppTemplateKernel`, `CppTemplateBuffer`. The MicroGemm micro-kernel abstraction that can be used by Cpp GEMM templates.
2. Initial FP32 gemm template
This involves a GEMM template implementation `CppPackedGemmTemplate` that supports GEMM with constant weight (`B`) requiring `N` to be a multiple of register blocking while allows the static or dynamic sizes for the `M` (batch dim) of `A`. The `B` matrix would be prepacked. This is a typical setting for inference workloads. The template handles the thread decomposition (via `thread_blocking`) and cache blocking (via `cache_blocking`). Then it invokes `CppMicroGemm` which handles register blocking, instruction selection, and other CPU architecture-specific optimizations. A `CppMicroGemmFP32Vec` micro-kernel implementation is provided for fp32 matmuls implemented with ATen vec abstraction.
3. Correctness and performance
The changes have been validated with fp32 inference on the three benchmark suites (torchbench, huggingface and timm_models) with both static shape and dynamic shapes. Since it is an initial implementation, we are still working on further performance improves with follow-up PRs including the optimizations in kernels as well as fusions. The perf gains are only observed from a selective number of models compared to the ATen kernels which are implemented with MKL. The perf gains are more obvious with dynamic shapes since MKL only supports packed gemm for static shapes. Below are details.

Static shapes
| Benchmark | torchbench | huggingface | timm_models |
|------------|-------------|--------------|--------------|
| Multi-threaded (baseline) | 1.47x | 1.36x | 1.91x |
| Multi-threaded (max-autotune) | 1.47x | 1.36x | 1.92x |
| Single-threaded (baseline) | 1.56x | 1.19x | 1.51x |
| Single-threaded (max-autotune) | 1.56x | 1.19x | 1.52x |

Key models being sped up:
drq: 1.14x
soft_act: 1.12
cait_m36_384: 1.18x

Dynamic shapes
| Benchmark | torchbench | huggingface | timm_models |
| --- | --- | --- | --- |
| Multi-threaded (baseline) | 1.43x | 1.28x | 1.85x |
| Multi-threaded (max-autotune) | 1.47x | 1.28x | 1.85x |
| Single-threaded (baseline) | 1.55x | 1.20x | 1.51x |
| Single-threaded (max-autotune) | 1.56x | 1.19x | 1.53x |

Key models being sped up:
BERT_pytorch: 1.22x
pyhpc_turbulent: 1.13x
soft_actor_critic: 1.77x
BlenderbotForCausalLM: 1.09x
cait_m36_384: 1.17x

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124021
Approved by: https://github.com/jansel
2024-05-12 07:46:44 +00:00
02093b6c6a Keep track of ViewMeta with symbolic inputs. (#125876)
Fix: #125387

This PR helps keep track of whether an instantiated `ViewMeta` has symbolic values as
input or not. This is used for checking whether we use the AOTAutograd `ViewMeta`-replay
execution path, e.g. it doesn't support tensors that have `ViewMeta` with symbolic inputs.

In summary, the changes are:

- Add the field `ViewMeta::has_symbolic_inputs` and make it a required constructor
parameter
- Add the field `FunctionalTensorWrapper::is_symbolic_` and the method
`FunctionalTensorWrapper::maybe_mark_symbolic`
    - Marks a `FunctionalTensorWrapper` as symbolic iff any of its `ViewMeta` have
    symbolic inputs
- Add the plumbing of `FunctionalTensorWrapper::is_symbolic` to the Python API
- Codegen the computation of `ViewMeta::has_symbolic_inputs` for each view operation
- Use the AOTAutograd `ViewMeta`-replay path if:
    - `target_functional_tensor` is not `None`; and
    - `target_functional_tensor` is not symbolic (instead of using a functorch config)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125876
Approved by: https://github.com/ezyang
2024-05-12 01:41:06 +00:00
6ffc94fa62 Fix cpp node instance check (#125875)
Mostly visible when calling multi_grad_hook and thus using this to test it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125875
Approved by: https://github.com/jackiexu1992, https://github.com/ezyang
2024-05-11 21:31:12 +00:00
07d6ab5aa2 [pipelining] Add pipeline schedules (#125975)
1. Add pipeline schedules:
- GPipe
- 1F1B
- Interleaved 1F1B
- LoopedBFS

2. Add basic forward and backward tests:
test_schedule.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125975
Approved by: https://github.com/wconstab
ghstack dependencies: #125729
2024-05-11 21:17:53 +00:00
f19e07b056 Memoize local_scalar_dense calls, refactor all memos (#125623)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125623
Approved by: https://github.com/eellison
2024-05-11 21:12:35 +00:00
0935b3d794 [dynamo] Turn on guard_nn_modules (#125202)
Turning on guard_nn_modules adds large number of guards, so we are bound to take a perf hit. But the perf hit is small. These are the numbers

![image](https://github.com/pytorch/pytorch/assets/13822661/c8793906-c8c7-432b-9af4-4594713067be)

First we observe that compared to Python guards, C++ guards give around 6x speedup. This reduces the total time spent in guards. This is shown in the last column (cpp_guards/inductor_optimized_latency). The worst model is around 1.61%, with most of the models below 1%. I think this is good enough signal to turn the config on.

One might also wonder how much guard slowdown occurs with `guard_nn_modules=True`. This is the table
![image](https://github.com/pytorch/pytorch/assets/13822661/932a885b-1c03-424b-8405-5bc8fd35dd39)

For most models, the guard overhead with nn module guards is under 2x. There are a few outliers, where the slowdown is really high and for those models we spend 1%-2% time in C++ guards as shown in first table.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125202
Approved by: https://github.com/ezyang
2024-05-11 19:28:24 +00:00
0dda3389e5 [AOTI][torchgen] Minor improvements to C shim torchgen (#125928)
Summary: Make some improvements to https://github.com/pytorch/pytorch/pull/125589
* Add a .default suffix to default ops in fallback_ops.py, to make it clear that those are OpOverload.
* Update warnings and comments based on feedbacks to https://github.com/pytorch/pytorch/pull/125589
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125928
Approved by: https://github.com/angelayi
ghstack dependencies: #125291, #125730, #125731
2024-05-11 18:12:46 +00:00
2df114e6be [AOTI] Fix 'int' object is not subscriptable (#125731)
Summary: for https://github.com/pytorch/pytorch/issues/117369

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125731
Approved by: https://github.com/chenyang78
ghstack dependencies: #125291, #125730
2024-05-11 18:12:46 +00:00
cyy
3f11958d39 Remove FFMPEG from CI scripts (#125546)
Because FFMPEG was solely used by Caffe2.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125546
Approved by: https://github.com/r-barnes, https://github.com/kit1980, https://github.com/albanD, https://github.com/malfet, https://github.com/seemethere
2024-05-11 16:46:13 +00:00
d49abf039a Revert "update pointwise cat heuristics (#125772)"
This reverts commit d19d932183f265f5108e6cc30f514d88060a67d7.

Reverted https://github.com/pytorch/pytorch/pull/125772 on behalf of https://github.com/izaitsevfb due to Fails numerical stability test for aps model, see D57215900 ([comment](https://github.com/pytorch/pytorch/pull/125772#issuecomment-2105932504))
2024-05-11 15:27:44 +00:00
d5470749bc Revert "[dynamo][disable] Move disable impl to its own __call__ method (#125486)"
This reverts commit d474d79420dbb0c0ba7e203d63d953afcbb595a4.

Reverted https://github.com/pytorch/pytorch/pull/125486 on behalf of https://github.com/izaitsevfb due to Fails internal tests, see D57216402 ([comment](https://github.com/pytorch/pytorch/pull/125486#issuecomment-2105925702))
2024-05-11 15:01:58 +00:00
a174c536f8 GPT-fast benchmark: adding memory bandwidth and use A100-40GB as target (#125881)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125881
Approved by: https://github.com/Chillee
2024-05-11 10:46:54 +00:00
b24ad7eab5 Enable dynamo traced test_param_group_with_lrscheduler_goes_right_direction (#124544)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124544
Approved by: https://github.com/janeyx99
ghstack dependencies: #125825, #125826
2024-05-11 06:29:59 +00:00
e72ef4f22a Fix capturable enablement conditions (#125826)
Only enable capturable if state hasn't been initialized and all parameters are on CUDA.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125826
Approved by: https://github.com/anijain2305
ghstack dependencies: #125825
2024-05-11 06:29:59 +00:00
b833fc0ecb Tighten fallback conditions for compiled optim (#125825)
Since we now will support `capturable=False` when it's valid, narrow the eager fallback conditions to the cases where `compile` will fail. The lone case here is when the user deletes the capturable flag; `state_steps` are on cuda and `capturable` is `False`. Because a cuda tensor is not supported in the `value` kwarg for foreach ops this results in an error.

The fallback wrapper is changed to check the device of `state_steps` if `capturable=False`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125825
Approved by: https://github.com/janeyx99
2024-05-11 06:29:51 +00:00
1115a25c36 Add obc counter for TS migration. (#125986)
Summary: Since table caffe2_pytorch_usage_stats only has 1 day retention which renders it useless for TS migration purposes, we want to build a lightweight counter mechanism to collect usage data about torch jit APIs which can monitor the usage decline in the long term.

Test Plan: CI

Reviewed By: SherlockNoMad

Differential Revision: D57216847

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125986
Approved by: https://github.com/gmagogsfm
2024-05-11 05:14:02 +00:00
7e92a2c1c9 Revert "Allow symbols to reach conv_layout stride argument (#125829)"
This reverts commit 013722bcb89b9f450d03ce2bd3ed81db6a89d97d.

Reverted https://github.com/pytorch/pytorch/pull/125829 on behalf of https://github.com/malfet due to Broke inductor tests, see https://github.com/pytorch/pytorch/actions/runs/9028121462/job/24809113861 ([comment](https://github.com/pytorch/pytorch/pull/125829#issuecomment-2105545503))
2024-05-11 04:43:36 +00:00
e9c5f1cb80 [MPS] Improve _int4pack_mm (#125983)
But dispatching it as 2D kernel, which improves data locality and bumps perf from 5.9 to 6.6 tokens per sec on M2 Pro

And other minor cleanups
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125983
Approved by: https://github.com/mikekgfb
2024-05-11 04:40:40 +00:00
9f4bb4d6bc Enable UFMT format on test/test_throughput_benchmark.py test/test_type_hints.py test/test_type_info.py (#125906)
Fixes some files in https://github.com/pytorch/pytorch/issues/123062

Run lintrunner on files:
test/test_throughput_benchmark.py
test/test_type_hints.py
test/test_type_info.py

```bash
$ lintrunner -a --take UFMT --all-files
ok No lint issues.
Successfully applied all patches.
```
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125906
Approved by: https://github.com/shink, https://github.com/soulitzer, https://github.com/malfet
2024-05-11 04:32:01 +00:00
9dee3ef919 Ingest gpt-fast benchmark results from S3 to Rockset (#125891)
A follow-up of https://github.com/pytorch/pytorch/pull/125450, this extends the `tools/stats/upload_dynamo_perf_stats.py` script to upload arbitrary benchmark results in CSV format.

* Upload gpt-fast benchmarks to a new Rockset collection `benchmarks/oss_ci_benchmark`.  The file is in the following format:
```
$ cat test/test-reports/gpt_fast_benchmark.csv
name,mode,target,actual,percentage
Llama-2-7b-chat-hf,bfloat16,104,104.754128,100.73%
```
* The CSV output needs to be kept in `test/test-reports` directory.
* Re-use the existing `.github/workflows/upload-test-stats.yml` workflow

### Testing

Run the commands manually

```
(py3.11) huydo@huydo-mbp pytorch % python3 -m tools.stats.upload_artifacts --workflow-run-id 9026179545 --workflow-run-attempt 1 --repo "pytorch/pytorch"
Using temporary directory: /var/folders/x4/2kd9r0fn5b9bf_sbcw16fxsc0000gn/T/tmp6eug3cdz
Downloading test-jsons-runattempt1-test-inductor-micro-benchmark-1-1-linux.gcp.a100_24803987212.zip
Upload /private/var/folders/x4/2kd9r0fn5b9bf_sbcw16fxsc0000gn/T/tmp6eug3cdz/test-jsons-runattempt1-test-inductor-micro-benchmark-1-1-linux.gcp.a100_24803987212.zip to s3://gha-artifacts/pytorch/pytorch/9026179545/1/artifact/test-jsons-test-inductor-micro-benchmark-1-1-linux.gcp.a100_24803987212.zip
Downloading test-reports-runattempt1-test-inductor-micro-benchmark-1-1-linux.gcp.a100_24803987212.zip
Upload /private/var/folders/x4/2kd9r0fn5b9bf_sbcw16fxsc0000gn/T/tmp6eug3cdz/test-reports-runattempt1-test-inductor-micro-benchmark-1-1-linux.gcp.a100_24803987212.zip to s3://gha-artifacts/pytorch/pytorch/9026179545/1/artifact/test-reports-test-inductor-micro-benchmark-1-1-linux.gcp.a100_24803987212.zip

(py3.11) huydo@huydo-mbp pytorch % python3 -m tools.stats.upload_dynamo_perf_stats --workflow-run-id 9026179545 --workflow-run-attempt 1 --repo "pytorch/pytorch" --head-branch "ciflow/inductor-micro-benchmark/125891" --rockset-collection oss_ci_benchmark --rockset-workspace benchmarks --match-filename "^gpt_fast_benchmark"
Using temporary directory: /var/folders/x4/2kd9r0fn5b9bf_sbcw16fxsc0000gn/T/tmp8xr4sdxk
Downloading test-reports-test-inductor-micro-benchmark-1-1-linux.gcp.a100_24803987212.zip
Extracting test-reports-test-inductor-micro-benchmark-1-1-linux.gcp.a100_24803987212.zip to unzipped-test-reports-test-inductor-micro-benchmark-1-1-linux.gcp.a100_24803987212
Processing gpt_fast_benchmark from test-reports-test-inductor-micro-benchmark-1-1-linux.gcp.a100_24803987212.zip
Writing 3 documents to Rockset
Done!
```

Also run a sanity check on ingesting inductor benchmark results:

```
(py3.11) huydo@huydo-mbp pytorch % python -m tools.stats.upload_dynamo_perf_stats --workflow-run-id 8997654356 --workflow-run-attempt 1 --repo pytorch/pytorch --head-branch main --rockset-collection torch_dynamo_perf_stats --rockset-workspace inductor --match-filename "^inductor_"
...
Writing 4904 documents to Rockset
Done!
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125891
Approved by: https://github.com/yanboliang
2024-05-11 04:16:36 +00:00
Bin
c1690a3e12 Fix the link to torch.compiler_custom_backends. (#125865)
Tiny fix. Fixes #119272

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125865
Approved by: https://github.com/soulitzer
2024-05-11 04:13:44 +00:00
0a9c6e92f8 Skip test_memory_format_nn_BatchNorm2d in inductor (#125970)
Skipping the test in the context of https://github.com/pytorch/pytorch/issues/125967 until the issue is root caused and fixed properly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125970
Approved by: https://github.com/clee2000
2024-05-11 04:11:18 +00:00
da7ced6e8c S390x binaries (#120398)
Allow building nightly, rc and release binaries for s390x.

This PR implements building binaries, but publishing part is currently missing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120398
Approved by: https://github.com/huydhn
2024-05-11 02:32:25 +00:00
d8708a35f6 [pipelining] Add _PipelineStage runtime (#125729)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125729
Approved by: https://github.com/wconstab
2024-05-11 01:59:18 +00:00
c6e5d0d2e6 Revert "Memoize local_scalar_dense calls, refactor all memos (#125623)"
This reverts commit fcbf2b61e6f40048ef0e6d77360c86771956cc9c.

Reverted https://github.com/pytorch/pytorch/pull/125623 on behalf of https://github.com/malfet due to Broke ROCM, see https://github.com/pytorch/pytorch/actions/runs/9026074378/job/24804583041 ([comment](https://github.com/pytorch/pytorch/pull/125623#issuecomment-2105444091))
2024-05-11 01:58:39 +00:00
01fb9676b8 Enable UFMT format on test/license.py test/logging.py (#125737)
Fixes some files in #123062

Run lintrunner on files:
test/license.py test/logging.py

```bash
$ lintrunner -a --take UFMT --all-files
ok No lint issues.
Successfully applied all patches.
```
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125737
Approved by: https://github.com/ezyang, https://github.com/malfet
2024-05-11 01:52:35 +00:00
a5c93a6899 Speed up _extract_graph_with_inputs_outputs (#125937)
_extract_graph_with_inputs_outputs() does membership testing on the input nodes but often that collection is a list so the test is O(n).  Ensure it's a set before looping over all the nodes.

This change speeds up the internal repro (D57090987) by about 18%:
before:
```
708.88user 15.86system 12:16.19elapsed 98%CPU (0avgtext+0avgdata 12898628maxresident)k
0inputs+91968outputs (3major+3532970minor)pagefaults 0swaps
```
after:
```
583.39user 15.98system 10:10.11elapsed 98%CPU (0avgtext+0avgdata 12895108maxresident)k
0inputs+87488outputs (4major+3374582minor)pagefaults 0swaps
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125937
Approved by: https://github.com/oulgen, https://github.com/anijain2305
2024-05-11 00:20:39 +00:00
cyy
4457cd9a30 [Distributed] [7/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#124987)
This PR continues to clean clang-tidy warnings in torch/csrc/distributed/c10d, following #124701.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124987
Approved by: https://github.com/malfet
2024-05-11 00:03:52 +00:00
31946c10d0 Add missing parameter doc of Adagrad (#125886)
Add the missing documentation for `initial_accumulator_value` parameter in Adagrad, and update the algorithm description in the documentation (adjusted to reflect the implementation).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125886
Approved by: https://github.com/janeyx99
2024-05-10 22:55:22 +00:00
ee804d256b Revert "[caffe2] Make all get_backtrace() implementations lazy (#125750)"
This reverts commit cc4da72b47ef63b7c448f0de4cdbdd792e9195ea.

Reverted https://github.com/pytorch/pytorch/pull/125750 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/125750#issuecomment-2105285301))
2024-05-10 21:23:10 +00:00
cyy
45628e3b66 Remove Caffe2 python (#125143)
This PR tries to decompose https://github.com/pytorch/pytorch/pull/122527 into a smaller one. Caffe2 python build scripts were removed and some tensorboard code using Caffe2 was removed too.
To be noted, this was inspired and is co-dev with @r-barnes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125143
Approved by: https://github.com/r-barnes, https://github.com/albanD
2024-05-10 21:15:43 +00:00
b08072f645 [CI] Relax per proc memory by a little bit, mark a test as serial (#125960)
test failure is here https://github.com/pytorch/pytorch/actions/runs/9036789873/job/24836020415

* OOMs etc rel to https://github.com/pytorch/pytorch/pull/125598
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125960
Approved by: https://github.com/huydhn
2024-05-10 21:11:39 +00:00
c61bfd24c1 [PT2] Register fake impl for quantized embedding bag ops (#125884)
Summary: Register fake impl for quantized embedding bag ops (e.g. quantized::embedding_bag_4bit_rowwise_offsets) and bypass registration if it has been registered.

Test Plan:
Before:
```
NotImplementedError: quantized::embedding_bag_4bit_rowwise_offsets: attempted to run this operator with Meta tensors, but there was no fake impl or Meta kernel registered
```
See context here -
https://fb.workplace.com/groups/1075192433118967/permalink/1423106614994212/

After:
Snapsoht was published successfully with PT2Archive.
```
AIMP_DISABLE_PRUNING=false  fdb buck2 run mode/opt-split-dwarf -c python.package_style=inplace -c fbcode.enable_gpu_sections=true  lego/scripts:lego_cli -- debug-locally --model_entity_id 545861329  --config_version 14 --publish_context OFFLINE_PUBLISH    --lego_pipeline aiplatform.modelstore.model_generation.lego.lego_pipeline_builder.gmpp_lego_pipeline --gmpp_config '{"gmpp_pipeline_descriptor": "aiplatform.modelstore.model_generation.v1.ads_pipelines.aimp_pyper_pipeline.model_generation_pipeline", "worker_process_number":24, "worker_thread_per_process_number": 12, "use_work_assignment": true}' --publish_config_overrides '{"gpu_inference_options": "{\"submodules_to_lower\": []}"}'  2>&1 | tee ./gmpp_lc_aimp.txt
```

Reviewed By: ydwu4

Differential Revision: D57172944

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125884
Approved by: https://github.com/ydwu4
2024-05-10 21:11:22 +00:00
538877d204 [AOTI] Fix convolution_backward (#125730)
Summary: for https://github.com/pytorch/pytorch/issues/125922

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125730
Approved by: https://github.com/chenyang78
ghstack dependencies: #125291
2024-05-10 20:13:34 +00:00
aca0807101 [AOTI] Use random inputs to autotune the backward pass (#125291)
Summary: This is for JIT Inductor with cpp wrapper, fixing https://github.com/pytorch/pytorch/issues/117367.

In the backward pass, we don't have real inputs to execute the backward pass to autotune kernels. We have 3 options here, 1) use random tensor inputs; 2) store the forward outputs and feed them to backward (non-trivial because of parameter re-ordering); 3) autotune each kernel with random inputs in a subprocess (similar to select_algorithm). This PR uses the easist option 1. Option 3 is where we are going as the next step, which will simplify the cpp wrapper codegen for the CUDA backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125291
Approved by: https://github.com/chenyang78, https://github.com/angelayi
2024-05-10 20:13:34 +00:00
9e85d3d830 Add "accurate" FlopCounter implementations for NestedTensor SDPA kernels (#125776)
This adds implementations for:
* _flash_attention_forward
* _efficient_attention_forward
* _flash_attention_backward
* _efficient_attention_backward

These flop counts are implemented as follows:
* Unbind the batch elements
* Calculate flops individually for each element in the batch
* Sum the final result

This means that we are accessing the concrete sequence lengths (which could be slow, and may trigger a GPU/CPU sync); but, the FLOP numbers will vary with the sparsity of the NestedTensor - more accurate than if we just assumed we padded everything.

Differential Revision: [D57120139](https://our.internmc.facebook.com/intern/diff/D57120139)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125776
Approved by: https://github.com/Chillee
2024-05-10 19:49:37 +00:00
4dad988822 Revert "Remove vision packages from CI scripts (#125546)"
This reverts commit f42ea14c3f795082138421fcef90d24f64c6fd35.

Reverted https://github.com/pytorch/pytorch/pull/125546 on behalf of https://github.com/huydhn due to I think we are using vision in inductor tests with their various models there ([comment](https://github.com/pytorch/pytorch/pull/125546#issuecomment-2105174723))
2024-05-10 19:43:23 +00:00
0e853327cb Implement wrappers for aot_dedup and aot_synthetic_base (#125764)
it's kind of gross that aot_synthetic base requires storing the *old* fw_metadata's InputInfo, but it is what it is. After this change, aot_dispatch_base's runtime wrappers should all be implemented. After this, I'll start working on aot_dispatch_autograd's remaining runtime wrapping changes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125764
Approved by: https://github.com/bdhirsh
ghstack dependencies: #125610
2024-05-10 19:33:35 +00:00
c520929c83 add typing in torch.optim.lr_scheduler (#125556)
Merge torch/optim/lr_scheduler.pyi into torch/optim/lr_scheduler.py
Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125556
Approved by: https://github.com/janeyx99
2024-05-10 19:28:00 +00:00
59f2e716cc Test foreach functions with all dtypes except qints (#125527)
Set `dtypes` and the others to all dtypes except qints, with some required xfails

Related to #124726.

Co-authored-by: janeyx99 <janeyx99@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125527
Approved by: https://github.com/eqy, https://github.com/janeyx99
2024-05-10 18:56:37 +00:00
10c17b13d7 fix cudnn attention check (#122391)
For CUDNN attention, besides packed QKV layout with limited support of sequence length (seq_len <= 512) and head dim requirements. Also supporting a more generic "arbitrary sequence length with flash attention" as stated in `Transformer Engine`: 8e672ff075/transformer_engine/common/fused_attn/fused_attn.cpp (L126)

More about "fused flash attention" in CUDNN graph api: https://docs.nvidia.com/deeplearning/cudnn/developer/graph-api.html#fused-flash-attention-fprop

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122391
Approved by: https://github.com/eqy, https://github.com/drisspg
2024-05-10 18:52:38 +00:00
bef7d650c4 [CI] 3 procs on sm86 (#125598)
yolo
iirc the a10g/sm86 runners have ~21 GB of space, so we can increase parallelism on it to 3.  This results in about 6GB CUDA mem per proc.  The previous calculation + 2 procs resulted in about 8 GB

Also fixes the the calc for per proc memory, assuming that CUDA context + anything else take about a little under 1GB of space (previous calc was .11 on about 7.5 - 8 GB  <= .9GB)

Times on main are about 1.9-2.5hr per shard
This commit is around 1.6-2hr per shard

Risks: increase in flaky tests due to OOM

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125598
Approved by: https://github.com/huydhn
2024-05-10 18:48:43 +00:00
ff98731803 Speedup convert<float>(Vectorized<half>::loadu(ptr, 8)) on ARM (#125889)
By replacing `vdupq_n_f16(0)` with simple `std::memset`

Otherwise Apple's clang fails to dead-code eliminate that instruction, which results in a slower codepath.

I.e. following [snippet](https://godbolt.org/z/c757TaM1Y) (that mimics vec library parts)
```cpp
#include <arm_neon.h>
#include <tuple>
#include <cstring>

struct Foo {
  Foo() = default;
  Foo(float16x8x2_t v) : values(v) {}
  operator float16x8x2_t() const { return values; }
  float16x8x2_t values;
};

struct Bar {
  Bar() = default;
  Bar(float32x4x2_t v) : values(v) {}
  Bar(float32x4_t val0, float32x4_t val1) : values{val0, val1} {}
  inline void store(float *ptr) {
    vst1q_f32(ptr, values.val[0]);
    vst1q_f32(ptr + 4, values.val[1]);
  }
  float32x4x2_t values;
};

inline Foo loadu(const void* ptr, int64_t count) {
  if (count == 16) {
    return vld1q_f16_x2(reinterpret_cast<const float16_t*>(ptr));
  } else if (count == 8) {
    Foo res;
    res.values.val[0] = vld1q_f16(reinterpret_cast<const float16_t*>(ptr));
    //res.values.val[1] = vdupq_n_f16(0);
    std::memset(&res.values.val[1], 0, sizeof(res.values.val[1]));
    return res;
  }
  float16_t tmp_values[16];
  for (auto i = 0; i < 16; ++i) {
    tmp_values[i] = 0;
  }
  std::memcpy(
    tmp_values,
    reinterpret_cast<const float16_t*>(ptr),
    count * sizeof(float16_t));
  return vld1q_f16_x2(reinterpret_cast<const float16_t*>(tmp_values));
}

inline std::tuple<Bar, Bar> convert_half_float(const Foo& a) {
  float16x8x2_t arr = a;
  float16x8_t x = arr.val[0];
  float16x8_t y = arr.val[1];
  float32x4_t x1 = vcvt_f32_f16(vget_low_f16(x));
  float32x4_t x2 = vcvt_f32_f16(vget_high_f16(x));
  float32x4_t y1 = vcvt_f32_f16(vget_low_f16(y));
  float32x4_t y2 = vcvt_f32_f16(vget_high_f16(y));
  return { Bar(x1, x2), Bar(y1, y2) };

}

inline Bar cvt(const Foo& x) {
  auto rc = convert_half_float(x);
  return std::get<0>(rc);
}

void convert(const float16_t* inp, float* outp) {
    for(auto idx = 0; idx < 1024; idx += 8) {
        auto tmp0 = loadu(inp + idx, 8);
        auto tmp1 = cvt(tmp0);
        tmp1.store(outp + idx);
    }
}

Foo load8(const float16_t* inp) {
    return loadu(inp, 8);
}
```
if compiled with `-O3 -fno-unsafe-math-optimizations` produces
```asm
convert(half const*, float*):
0000000000000000	add	x8, x1, #0x10
0000000000000004	mov	x9, #-0x8
0000000000000008	ldr	q0, [x0], #0x10                 ; Latency: 4
000000000000000c	fcvtl	v1.4s, v0.4h                    ; Latency: 2
0000000000000010	fcvtl2	v0.4s, v0.8h                    ; Latency: 2
0000000000000014	stp	q1, q0, [x8, #-0x10]            ; Latency: 4
0000000000000018	add	x9, x9, #0x8
000000000000001c	add	x8, x8, #0x20
0000000000000020	cmp	x9, #0x3f8
0000000000000024	b.lo	0x8
0000000000000028	ret
load8(half const*):
000000000000002c	ldr	q0, [x0]                        ; Latency: 4
0000000000000030	movi.2d	v1, #0000000000000000           ; Latency: 2
0000000000000034	ret
```
but with `vdupq_n_f16` same yielded
```asm
convert(half const*, float*):
0000000000000000	add	x8, x1, #0x10
0000000000000004	mov	x9, #-0x8
0000000000000008	ldr	q0, [x0], #0x10                 ; Latency: 4
000000000000000c	scvtf	s1, wzr                         ; Latency: 10
0000000000000010	fcvt	h1, s1                          ; Latency: 4
0000000000000014	fcvtl	v1.4s, v0.4h                    ; Latency: 2
0000000000000018	fcvtl2	v0.4s, v0.8h                    ; Latency: 2
000000000000001c	stp	q1, q0, [x8, #-0x10]            ; Latency: 4
0000000000000020	add	x9, x9, #0x8
0000000000000024	add	x8, x8, #0x20
0000000000000028	cmp	x9, #0x3f8
000000000000002c	b.lo	0x8
0000000000000030	ret
load8(half const*):
0000000000000034	scvtf	s1, wzr                         ; Latency: 10
0000000000000038	ldr	q0, [x0]                        ; Latency: 4
000000000000003c	fcvt	h1, s1                          ; Latency: 4
0000000000000040	dup.8h	v1, v1[0]                       ; Latency: 7
0000000000000044	ret
```
(see `scvtf` completely eliminated from `convert` code and replaced with faster `movi.2d` in `load8`)
Fixes https://github.com/pytorch/pytorch/issues/125735

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125889
Approved by: https://github.com/desertfire
2024-05-10 18:18:30 +00:00
f25c7c9699 functionalize storage resizing, minimal ppFSDP traceable forward (#122434)
More details further down, but first a more high-level description of "how do we functionalize storage resizing"

Today, dynamo converts `param.untyped_storage().resize_(x)` calls that it sees from fsdp into a custom op, `ops.inductor.resize_storage_bytes_(x)`

So given this setup, there are 3 main cases that I think we want to handle:

(1) graph input starts with a real storage size, gets resized down to zero in the graph
(2) graph input starts with 0 storage size, gets resized up in the graph
(3) graph input starts with 0 storage size, gets resized up and used in some compute, then resized back down to 0

For case (1) we need to emit a `resize_storage_bytes_` at the end of the graph, similar to how we emit `copy_()` for data mutations.

For case (2), we need to emit a `resize_storage_bytes_` in the graph, and we **also** need to emit a `copy_()` (the input had its storage resized up, and filled in with data, which is we need to reflect as an input mutation)

For case (3), the net effect is that the input had no data on entry and exit of the function, so we don't need to emit any mutable ops in the end of the graph.

The main thing to call out is that: we need to write a functionalization rule for `resize_storage_byte_`, (`FunctionalTensorWrapper::storage_resize_()`) and this rule actually does very little. We would like to **not** emit any new ops in the graph (like say, a functional resize op). Instead, we should expect / rely on the fact that any resize up will be immediately followed by a `copy_()`/`foreach_copy_`/`out=` op, that will fill in the data of the tensor. So `FunctionalTensor` can temporarily live in a state where its data is invalid, until the `x.copy_(y)` "updates" its data with the new tensor.

So effectively, all that this rule does is:

(1) it stores metadata on the storage, indicating that the tensor was resized, as well as the updated storage size. We need this info in AOTAutograd, so it knows whether to emit a mutable resize_() op in the graph epilogue

(2) There is also a corner case: if we are resizing down to zero, but our tensor had **previously** had a zero size storage, then we update `value_` to point to the original value of the tensor. The reason this seems safe is because if we have a zero storage sized tensor `x`, and we resize it up, use it in some compute, resize it back down to zero, and use it somewhere, we would want the functional version of this code to use the original `x` after the second resize. For FSDP, this is important because we end up saving parameters (graph inputs) for backward, and we want to make sure that the thing we save (and the output to the forward graph) is the original, zero-storage-sized parameter, and not the "version 2" of the parameter after the first resize_()

I think a good order to look at changes in this PR would be:

(1) `test_aotdispatch.py` shows the 3 main cases I focused on as well as the expected functionalized graphs

(2) In `FunctionalStorageImpl.h/cpp`, I had to add a notion of "original base", and "original/curr_size". The first is so I can re-use the zero-size tensor after multiple resizes, and the second is so I can tell in AOTAutograd whether any resizes canceled each other out into a no-op

(3) FunctionalTensorWrapper.h/cpp has the new resize functionalizion rule + some extra utils

(4) `_functorch/_autograd`: the main changes in this folder were around adding the logic at trace-time to detect when we need to put a resize_() in the graph. I also have some assertions to check that any inputs that experience storage resizing will **always be in the graph** and not the opaque epilogue, and I also limited the resize_() mutation case so that you can only ever start with zero storage, or end with zero storage (you can't do e.g. `torch.ones(2).storage().resize_(3)`), and banned it on tensor subclasses

(5) `fake_tensor.py`/`meta_utils.py`: we now need to be able to fakeify tensors with zero storage, so I added a quick version of it in meta_utils.py. This also.. has ramifications for fake tensor caching that I need to fix (include the storage size on the cache key, maybe?)

------------------

This PR subsumes https://github.com/pytorch/pytorch/pull/120971.

This PR is enough to **almost** get a simple ppFSDP forward pass tracing with a functionalized resize_() properly. It also attempts to do the updated version from @jansel, where we don't have any notion of `resize_()` in the graph at all, post functionalization. It would probably be good to test it with @yf225 's FSDP changes, and see how many of the FX passes it allows us to remove. I think that in theory, it should allow us to remove all FX passes that affect the forward graph / partitioner, **except** the one that forces views to be recomputed in the backward (more details below).

There are a few things worth calling out:

(1) failed attempt at functionalizing `aten.copy_()`. I originally wanted to get a version takes these operations:
```
param.storage().resize_(all_gather_size)
param.copy_(all_gather_buffer)
out = aten.matmul(param, param)
```
and functionalizes them into:
```
out = aten.matmul(all_gather_buffer, all_gather_buffer)
```

This would involve getting functionalization to turn `x.copy_(y)` into a giant no-op that just returns `y`. Unfortunately, we can't actually do this in a reasonable way within functionalization (instead, there's a functional `aten.copy` in the graph - see the test case graph expecttest for details). Why? In order for that transformation to be safe, `x` and `y` need to have the same metadata. However, it's possible for `x` and `y` to be subclasses of different types. This is not something we can easily tell from within functionalization, and would be a layering violation. So for now I'm leaving it to downstream code to optimize away the `aten.copy` (this is already the case today, so I think inductor can handle this)

(2) The forward doesn't **actually** run successfully in this PR (see the `assertRaisesRegex` in the test). Why?

The final forward graph looks like this:
```
def forward(self, primals_1, primals_2):
    _foreach_copy = torch.ops.aten._foreach_copy.default([primals_1], [primals_2]);  primals_2 = None
    getitem = _foreach_copy[0];  _foreach_copy = None
    mm = torch.ops.aten.mm.default(getitem, getitem);  getitem = None
    t_1 = torch.ops.aten.t.default(primals_1);  primals_1 = None
    return [mm, t_1]
```

Where `primals_1` starts out as a secretly-zero-storage-size parameter, and gets resized up and back down within the forward (these are functionalized away).

Importantly, the matmul happy on the result of the `foreach_copy`, **but** the activation that we save for backward (`t_1`) is the result of transposing the **original parameter** (the zero-storage-size param). This is exactly the optimization in fsdp that allows us to have good peak memory usage.

The problem is that the min-cut partitioner decides to save `t_1` for backward. Running this code in eager breaks, because the kernel for `aten.permute(x)` is not happy when `x` has secretly-zero-sized-storage.

The real problem here is that in eager mode the `permute` kernel runs during the backward, after backward hooks have properly resized the saved activation. Here, we are running the transpose in the forward.

One option would be to turn off the checks in our view kernels and allow them to work on zero-storage-sized tensors, which feels pretty bad. Another option is to tweak the partitioner (or use one of Will's FX passes) to force the partitioner to not save views for backward, and allow the views to be recomputed in the backward. This seems kind of silly, but is also probably harmless.

(3) The backward is still broken. To be fair, this issue is pretty separable from "functionalizing storage resize calls", and can be fixed later (either by a real fix to our tracing infra, or via another hacky FX pass). More description of this problem is described at issue (8) of my PR description in https://github.com/pytorch/pytorch/pull/120971

(4) I only added support for "full graph" resizing: basically, the limited case where a param starts with zero storage size, and gets resized up and back down. I think we can add support for the graph break case, but I think we can keep that add-on separate from this PR unless we need it immediately. I also added asserts so we should fail loudly when we hit this case

(5) I have a change to FakeTensor creation when inputs have zero storage size that.. is probably ok. But I also removed FakeTensor caching on view ops, which I probably need to fix before I can land this PR

(6) I added a notion of "original_base" to `FunctionalStorageImpl`. More details are in the comments, but my rational for this was that we basically need it to ensure that autograd saves the **original**, zero-storage-sized param for backward, after resizing up and back down

(7) I had to update our eager kernels for `aten.copy` and `aten._foreach_copy`, to handle the case where the `self` argument has secretly-zero-storage. Inductor can probably generate correct code for this case, but we need these ops to work properly in this situation for the `aot_eager` backend to do the right thing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122434
Approved by: https://github.com/jansel
2024-05-10 18:09:10 +00:00
cyy
f42ea14c3f Remove vision packages from CI scripts (#125546)
Because they were solely used by Caffe2.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125546
Approved by: https://github.com/r-barnes, https://github.com/kit1980, https://github.com/albanD
2024-05-10 17:53:48 +00:00
d7fe3c4123 [RELAND] Switch default behavoir of export IR to be predispatch (#125860)
This PR switches export IR from aot-dispatch to pre-dispatch IR.

**What is pre-dispatch IR and why should you care?**

Currently the default IR returned by torch.export can contain only functional ATen operators after ALL pytorch dispatcher decompositions (for example, CompositeImplicitAutograd) run.

In contrast, pre-dispatch IR refers to an IR that can contain all functional ATen operators (i.e., not just from the core subset), before any decomposition happens, as well as operators that manipulate autograd state. Pre-dispatch IR closely resembles eager PyTorch computation, but is still functional and serializable by torch.export. As a result:

You can train the pre-dispatch IR in eager mode as the IR contains necessary information for the autograd engine to automatically generate a backward graph.
You can write sound graph transformations more easily as the IR is functional.
Since it is an ATen IR, it is still normalized. For example, torch.add has multiple overloads, but aten.add.Tensor is unique in this IR.
If you want to get the core aten IR out of torch.export, you will need to:
```
ep = torch.export.export(M(), inputs)
ep_for_core_aten = ep.run_decompositions()
```

Differential Revision: [D57172986](https://our.internmc.facebook.com/intern/diff/D57172986)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125860
Approved by: https://github.com/zhxchen17
2024-05-10 17:36:53 +00:00
4996a3fda3 [BE][Easy] Remove usage of deprecated ast.Str, ast.Ellipsis and ast.NameConstant (#125912)
`ast.Str`, `ast.Ellipsis`, and `ast.NameConstant` are deprecated in Python 3.8 and will be removed in Python 3.14. Replace them with `ast.Constant`.

Ref: https://docs.python.org/3/library/ast.html#node-classes

> **Changed in version 3.8:** Class [ast.Constant](https://docs.python.org/3/library/ast.html#ast.Constant) is now used for all constants.
>
> **Deprecated since version 3.8:** Old classes ast.Num, ast.Str, ast.Bytes, ast.NameConstant and ast.Ellipsis are still available, but they will be removed in future Python releases. In the meantime, instantiating them will return an instance of a different class.

CI log: https://github.com/metaopt/torchopt/actions/runs/9031146681/job/24816802280?pr=216#step:11:6706
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125912
Approved by: https://github.com/soulitzer
2024-05-10 17:35:35 +00:00
53a64e446f STRONG_NODISCARD -> [[nodiscard]] (#125873)
Test Plan: Sandcastle

Differential Revision: D57158864

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125873
Approved by: https://github.com/Skylion007
2024-05-10 17:10:53 +00:00
4630 changed files with 95885 additions and 233318 deletions

View File

@ -0,0 +1,5 @@
0.6b
manylinux_2_17
rocm6
04b5df8c8123f90cba3ede7e971e6fbc6040d506
3db6ecbc915893ff967abd6e1b43bd5f54949868873be60dc802086c3863e648

View File

@ -84,16 +84,16 @@ fi
# CMake 3.18 is needed to support CUDA17 language variant
CMAKE_VERSION=3.18.5
_UCX_COMMIT=00bcc6bb18fc282eb160623b4c0d300147f579af
_UCC_COMMIT=7cb07a76ccedad7e56ceb136b865eb9319c258ea
_UCX_COMMIT=7bb2722ff2187a0cad557ae4a6afa090569f83fb
_UCC_COMMIT=20eae37090a4ce1b32bcce6144ccad0b49943e0b
# It's annoying to rename jobs every time you want to rewrite a
# configuration, so we hardcode everything here rather than do it
# from scratch
case "$image" in
pytorch-linux-focal-cuda12.1-cudnn8-py3-gcc9)
CUDA_VERSION=12.1.1
CUDNN_VERSION=8
pytorch-linux-focal-cuda12.4-cudnn9-py3-gcc9)
CUDA_VERSION=12.4.0
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=9
PROTOBUF=yes
@ -105,9 +105,23 @@ case "$image" in
CONDA_CMAKE=yes
TRITON=yes
;;
pytorch-linux-focal-cuda12.1-cudnn8-py3-gcc9-inductor-benchmarks)
pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9)
CUDA_VERSION=12.1.1
CUDNN_VERSION=8
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=9
PROTOBUF=yes
DB=yes
VISION=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
CONDA_CMAKE=yes
TRITON=yes
;;
pytorch-linux-focal-cuda12.4-cudnn9-py3-gcc9-inductor-benchmarks)
CUDA_VERSION=12.4.0
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=9
PROTOBUF=yes
@ -120,9 +134,54 @@ case "$image" in
TRITON=yes
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-focal-cuda11.8-cudnn8-py3-gcc9)
pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9-inductor-benchmarks)
CUDA_VERSION=12.1.1
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=9
PROTOBUF=yes
DB=yes
VISION=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
CONDA_CMAKE=yes
TRITON=yes
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-focal-cuda12.1-cudnn9-py3.12-gcc9-inductor-benchmarks)
CUDA_VERSION=12.1.1
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.12
GCC_VERSION=9
PROTOBUF=yes
DB=yes
VISION=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
CONDA_CMAKE=yes
TRITON=yes
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-focal-cuda12.4-cudnn9-py3.12-gcc9-inductor-benchmarks)
CUDA_VERSION=12.4.0
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.12
GCC_VERSION=9
PROTOBUF=yes
DB=yes
VISION=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
CONDA_CMAKE=yes
TRITON=yes
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-focal-cuda11.8-cudnn9-py3-gcc9)
CUDA_VERSION=11.8.0
CUDNN_VERSION=8
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=9
PROTOBUF=yes
@ -134,9 +193,37 @@ case "$image" in
CONDA_CMAKE=yes
TRITON=yes
;;
pytorch-linux-focal-cuda12.1-cudnn8-py3-gcc9)
pytorch-linux-focal-cuda12.4-cudnn9-py3-gcc9)
CUDA_VERSION=12.4.0
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=9
PROTOBUF=yes
DB=yes
VISION=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
CONDA_CMAKE=yes
TRITON=yes
;;
pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9)
CUDA_VERSION=12.1.1
CUDNN_VERSION=8
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=9
PROTOBUF=yes
DB=yes
VISION=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
CONDA_CMAKE=yes
TRITON=yes
;;
pytorch-linux-focal-cuda12.4-cudnn9-py3-gcc9)
CUDA_VERSION=12.4.0
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=9
PROTOBUF=yes
@ -226,7 +313,7 @@ case "$image" in
PROTOBUF=yes
DB=yes
VISION=yes
BASEKIT_VERSION=2024.0.0-49522
XPU_VERSION=0.5
NINJA_VERSION=1.9.0
CONDA_CMAKE=yes
TRITON=yes
@ -243,10 +330,10 @@ case "$image" in
DOCS=yes
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-jammy-cuda11.8-cudnn8-py3.8-clang12)
pytorch-linux-jammy-cuda11.8-cudnn9-py3.8-clang12)
ANACONDA_PYTHON_VERSION=3.8
CUDA_VERSION=11.8
CUDNN_VERSION=8
CUDNN_VERSION=9
CLANG_VERSION=12
PROTOBUF=yes
DB=yes
@ -293,7 +380,7 @@ case "$image" in
ANACONDA_PYTHON_VERSION=3.9
CONDA_CMAKE=yes
;;
pytorch-linux-jammy-cuda11.8-cudnn8-py3.9-linter)
pytorch-linux-jammy-cuda11.8-cudnn9-py3.9-linter)
ANACONDA_PYTHON_VERSION=3.9
CUDA_VERSION=11.8
CONDA_CMAKE=yes
@ -360,7 +447,7 @@ tmp_tag=$(basename "$(mktemp -u)" | tr '[:upper:]' '[:lower:]')
#when using cudnn version 8 install it separately from cuda
if [[ "$image" == *cuda* && ${OS} == "ubuntu" ]]; then
IMAGE_NAME="nvidia/cuda:${CUDA_VERSION}-cudnn${CUDNN_VERSION}-devel-ubuntu${UBUNTU_VERSION}"
if [[ ${CUDNN_VERSION} == 8 ]]; then
if [[ ${CUDNN_VERSION} == 9 ]]; then
IMAGE_NAME="nvidia/cuda:${CUDA_VERSION}-devel-ubuntu${UBUNTU_VERSION}"
fi
fi
@ -403,7 +490,7 @@ docker build \
--build-arg "DOCS=${DOCS}" \
--build-arg "INDUCTOR_BENCHMARKS=${INDUCTOR_BENCHMARKS}" \
--build-arg "EXECUTORCH=${EXECUTORCH}" \
--build-arg "BASEKIT_VERSION=${BASEKIT_VERSION}" \
--build-arg "XPU_VERSION=${XPU_VERSION}" \
--build-arg "ACL=${ACL:-}" \
--build-arg "SKIP_SCCACHE_INSTALL=${SKIP_SCCACHE_INSTALL:-}" \
--build-arg "SKIP_LLVM_SRC_BUILD_INSTALL=${SKIP_LLVM_SRC_BUILD_INSTALL:-}" \
@ -412,7 +499,7 @@ docker build \
"$@" \
.
# NVIDIA dockers for RC releases use tag names like `11.0-cudnn8-devel-ubuntu18.04-rc`,
# NVIDIA dockers for RC releases use tag names like `11.0-cudnn9-devel-ubuntu18.04-rc`,
# for this case we will set UBUNTU_VERSION to `18.04-rc` so that the Dockerfile could
# find the correct image. As a result, here we have to replace the
# "$UBUNTU_VERSION" == "18.04-rc"

View File

@ -62,7 +62,7 @@ RUN if [ -n "${DB}" ]; then bash ./install_db.sh; fi
RUN rm install_db.sh
ENV INSTALLED_DB ${DB}
# (optional) Install vision packages like OpenCV and ffmpeg
# (optional) Install vision packages like OpenCV
ARG VISION
COPY ./common/install_vision.sh ./common/cache_vision_models.sh ./common/common_utils.sh ./
RUN if [ -n "${VISION}" ]; then bash ./install_vision.sh; fi
@ -77,6 +77,9 @@ RUN rm install_rocm.sh
COPY ./common/install_rocm_magma.sh install_rocm_magma.sh
RUN bash ./install_rocm_magma.sh
RUN rm install_rocm_magma.sh
COPY ./common/install_amdsmi.sh install_amdsmi.sh
RUN bash ./install_amdsmi.sh
RUN rm install_amdsmi.sh
ENV PATH /opt/rocm/bin:$PATH
ENV PATH /opt/rocm/hcc/bin:$PATH
ENV PATH /opt/rocm/hip/bin:$PATH
@ -110,6 +113,13 @@ COPY triton_version.txt triton_version.txt
RUN if [ -n "${TRITON}" ]; then bash ./install_triton.sh; fi
RUN rm install_triton.sh common_utils.sh triton-rocm.txt triton_version.txt
# Install AOTriton (Early fail)
COPY ./aotriton_version.txt aotriton_version.txt
COPY ./common/common_utils.sh common_utils.sh
COPY ./common/install_aotriton.sh install_aotriton.sh
RUN ["/bin/bash", "-c", "./install_aotriton.sh /opt/rocm && rm -rf install_aotriton.sh aotriton_version.txt common_utils.sh"]
ENV AOTRITON_INSTALLED_PREFIX /opt/rocm/aotriton
# Install ccache/sccache (do this last, so we get priority in PATH)
COPY ./common/install_cache.sh install_cache.sh
ENV PATH /opt/cache/bin:$PATH

View File

@ -1 +1 @@
bbe6246e37d8aa791c67daaf9d9d61b26c9ccfdc
01cbe5045a6898c9a925f01435c8277b2fe6afcc

View File

@ -1,6 +1,6 @@
set -euo pipefail
readonly version=v23.08
readonly version=v24.04
readonly src_host=https://review.mlplatform.org/ml
readonly src_repo=ComputeLibrary

View File

@ -0,0 +1,5 @@
#!/bin/bash
set -ex
cd /opt/rocm/share/amd_smi && pip install .

View File

@ -0,0 +1,23 @@
#!/bin/bash
set -ex
source "$(dirname "${BASH_SOURCE[0]}")/common_utils.sh"
TARBALL='aotriton.tar.bz2'
# This read command alwasy returns with exit code 1
read -d "\n" VER MANYLINUX ROCMBASE PINNED_COMMIT SHA256 < aotriton_version.txt || true
ARCH=$(uname -m)
AOTRITON_INSTALL_PREFIX="$1"
AOTRITON_URL="https://github.com/ROCm/aotriton/releases/download/${VER}/aotriton-${VER}-${MANYLINUX}_${ARCH}-${ROCMBASE}.tar.bz2"
cd "${AOTRITON_INSTALL_PREFIX}"
# Must use -L to follow redirects
curl -L --retry 3 -o "${TARBALL}" "${AOTRITON_URL}"
ACTUAL_SHA256=$(sha256sum "${TARBALL}" | cut -d " " -f 1)
if [ "${SHA256}" != "${ACTUAL_SHA256}" ]; then
echo -n "Error: The SHA256 of downloaded tarball is ${ACTUAL_SHA256},"
echo " which does not match the expected value ${SHA256}."
exit
fi
tar xf "${TARBALL}" && rm -rf "${TARBALL}"

View File

@ -3,7 +3,7 @@
set -ex
install_ubuntu() {
# NVIDIA dockers for RC releases use tag names like `11.0-cudnn8-devel-ubuntu18.04-rc`,
# NVIDIA dockers for RC releases use tag names like `11.0-cudnn9-devel-ubuntu18.04-rc`,
# for this case we will set UBUNTU_VERSION to `18.04-rc` so that the Dockerfile could
# find the correct image. As a result, here we have to check for
# "$UBUNTU_VERSION" == "18.04"*

View File

@ -1,20 +1,18 @@
#!/bin/bash
if [[ ${CUDNN_VERSION} == 8 ]]; then
if [[ -n "${CUDNN_VERSION}" ]]; then
# cuDNN license: https://developer.nvidia.com/cudnn/license_agreement
mkdir tmp_cudnn
pushd tmp_cudnn
if [[ ${CUDA_VERSION:0:4} == "12.1" ]]; then
CUDNN_NAME="cudnn-linux-x86_64-8.9.2.26_cuda12-archive"
curl --retry 3 -OLs https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-x86_64/${CUDNN_NAME}.tar.xz
elif [[ ${CUDA_VERSION:0:4} == "11.8" ]]; then
CUDNN_NAME="cudnn-linux-x86_64-8.7.0.84_cuda11-archive"
curl --retry 3 -OLs https://developer.download.nvidia.com/compute/redist/cudnn/v8.7.0/local_installers/11.8/${CUDNN_NAME}.tar.xz
if [[ ${CUDA_VERSION:0:2} == "12" ]]; then
CUDNN_NAME="cudnn-linux-x86_64-9.1.0.70_cuda12-archive"
elif [[ ${CUDA_VERSION:0:2} == "11" ]]; then
CUDNN_NAME="cudnn-linux-x86_64-9.1.0.70_cuda11-archive"
else
print "Unsupported CUDA version ${CUDA_VERSION}"
exit 1
fi
curl --retry 3 -OLs https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-x86_64/${CUDNN_NAME}.tar.xz
tar xf ${CUDNN_NAME}.tar.xz
cp -a ${CUDNN_NAME}/include/* /usr/local/cuda/include/
cp -a ${CUDNN_NAME}/lib/* /usr/local/cuda/lib64/

View File

@ -5,9 +5,14 @@ set -ex
# cuSPARSELt license: https://docs.nvidia.com/cuda/cusparselt/license.html
mkdir tmp_cusparselt && cd tmp_cusparselt
if [[ ${CUDA_VERSION:0:4} == "12.1" ]]; then
CUSPARSELT_NAME="libcusparse_lt-linux-x86_64-0.5.2.1-archive"
curl --retry 3 -OLs https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-x86_64/${CUSPARSELT_NAME}.tar.xz
if [[ ${CUDA_VERSION:0:4} =~ ^12\.[1-4]$ ]]; then
arch_path='sbsa'
export TARGETARCH=${TARGETARCH:-$(uname -m)}
if [ ${TARGETARCH} = 'amd64' ] || [ "${TARGETARCH}" = 'x86_64' ]; then
arch_path='x86_64'
fi
CUSPARSELT_NAME="libcusparse_lt-linux-${arch_path}-0.5.2.1-archive"
curl --retry 3 -OLs https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-${arch_path}/${CUSPARSELT_NAME}.tar.xz
elif [[ ${CUDA_VERSION:0:4} == "11.8" ]]; then
CUSPARSELT_NAME="libcusparse_lt-linux-x86_64-0.4.0.7-archive"
curl --retry 3 -OLs https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-x86_64/${CUSPARSELT_NAME}.tar.xz

View File

@ -30,10 +30,10 @@ pip_install \
pip_install coloredlogs packaging
pip_install onnxruntime==1.17.0
pip_install onnx==1.15.0
pip_install onnxruntime==1.18
pip_install onnx==1.16.0
# pip_install "onnxscript@git+https://github.com/microsoft/onnxscript@3e869ef8ccf19b5ebd21c10d3e9c267c9a9fa729" --no-deps
pip_install onnxscript==0.1.0.dev20240315 --no-deps
pip_install onnxscript==0.1.0.dev20240523 --no-deps
# Cache the transformers model to be used later by ONNX tests. We need to run the transformers
# package to download the model. By default, the model is cached at ~/.cache/huggingface/hub/

View File

@ -6,9 +6,6 @@ ver() {
printf "%3d%03d%03d%03d" $(echo "$1" | tr '.' ' ');
}
# Map ROCm version to AMDGPU version
declare -A AMDGPU_VERSIONS=( ["5.0"]="21.50" ["5.1.1"]="22.10.1" ["5.2"]="22.20" )
install_ubuntu() {
apt-get update
if [[ $UBUNTU_VERSION == 18.04 ]]; then
@ -26,31 +23,14 @@ install_ubuntu() {
apt-get install -y libc++1
apt-get install -y libc++abi1
if [[ $(ver $ROCM_VERSION) -ge $(ver 4.5) ]]; then
# Add amdgpu repository
UBUNTU_VERSION_NAME=`cat /etc/os-release | grep UBUNTU_CODENAME | awk -F= '{print $2}'`
local amdgpu_baseurl
if [[ $(ver $ROCM_VERSION) -ge $(ver 5.3) ]]; then
amdgpu_baseurl="https://repo.radeon.com/amdgpu/${ROCM_VERSION}/ubuntu"
else
amdgpu_baseurl="https://repo.radeon.com/amdgpu/${AMDGPU_VERSIONS[$ROCM_VERSION]}/ubuntu"
fi
echo "deb [arch=amd64] ${amdgpu_baseurl} ${UBUNTU_VERSION_NAME} main" > /etc/apt/sources.list.d/amdgpu.list
fi
ROCM_REPO="ubuntu"
if [[ $(ver $ROCM_VERSION) -lt $(ver 4.2) ]]; then
ROCM_REPO="xenial"
fi
if [[ $(ver $ROCM_VERSION) -ge $(ver 5.3) ]]; then
ROCM_REPO="${UBUNTU_VERSION_NAME}"
fi
# Add amdgpu repository
UBUNTU_VERSION_NAME=`cat /etc/os-release | grep UBUNTU_CODENAME | awk -F= '{print $2}'`
echo "deb [arch=amd64] https://repo.radeon.com/amdgpu/${ROCM_VERSION}/ubuntu ${UBUNTU_VERSION_NAME} main" > /etc/apt/sources.list.d/amdgpu.list
# Add rocm repository
wget -qO - http://repo.radeon.com/rocm/rocm.gpg.key | apt-key add -
local rocm_baseurl="http://repo.radeon.com/rocm/apt/${ROCM_VERSION}"
echo "deb [arch=amd64] ${rocm_baseurl} ${ROCM_REPO} main" > /etc/apt/sources.list.d/rocm.list
echo "deb [arch=amd64] ${rocm_baseurl} ${UBUNTU_VERSION_NAME} main" > /etc/apt/sources.list.d/rocm.list
apt-get update --allow-insecure-repositories
DEBIAN_FRONTEND=noninteractive apt-get install -y --allow-unauthenticated \
@ -59,7 +39,8 @@ install_ubuntu() {
rocm-libs \
rccl \
rocprofiler-dev \
roctracer-dev
roctracer-dev \
amd-smi-lib
if [[ $(ver $ROCM_VERSION) -ge $(ver 6.1) ]]; then
DEBIAN_FRONTEND=noninteractive apt-get install -y --allow-unauthenticated rocm-llvm-dev
@ -68,29 +49,18 @@ install_ubuntu() {
# precompiled miopen kernels added in ROCm 3.5, renamed in ROCm 5.5
# search for all unversioned packages
# if search fails it will abort this script; use true to avoid case where search fails
if [[ $(ver $ROCM_VERSION) -ge $(ver 5.5) ]]; then
MIOPENHIPGFX=$(apt-cache search --names-only miopen-hip-gfx | awk '{print $1}' | grep -F -v . || true)
if [[ "x${MIOPENHIPGFX}" = x ]]; then
echo "miopen-hip-gfx package not available" && exit 1
else
DEBIAN_FRONTEND=noninteractive apt-get install -y --allow-unauthenticated ${MIOPENHIPGFX}
fi
MIOPENHIPGFX=$(apt-cache search --names-only miopen-hip-gfx | awk '{print $1}' | grep -F -v . || true)
if [[ "x${MIOPENHIPGFX}" = x ]]; then
echo "miopen-hip-gfx package not available" && exit 1
else
MIOPENKERNELS=$(apt-cache search --names-only miopenkernels | awk '{print $1}' | grep -F -v . || true)
if [[ "x${MIOPENKERNELS}" = x ]]; then
echo "miopenkernels package not available" && exit 1
else
DEBIAN_FRONTEND=noninteractive apt-get install -y --allow-unauthenticated ${MIOPENKERNELS}
fi
DEBIAN_FRONTEND=noninteractive apt-get install -y --allow-unauthenticated ${MIOPENHIPGFX}
fi
# ROCm 6.0 had a regression where journal_mode was enabled on the kdb files resulting in permission errors at runtime
if [[ $(ver $ROCM_VERSION) -ge $(ver 6.0) ]]; then
for kdb in /opt/rocm/share/miopen/db/*.kdb
do
sqlite3 $kdb "PRAGMA journal_mode=off; PRAGMA VACUUM;"
done
fi
for kdb in /opt/rocm/share/miopen/db/*.kdb
do
sqlite3 $kdb "PRAGMA journal_mode=off; PRAGMA VACUUM;"
done
# Cleanup
apt-get autoclean && apt-get clean
@ -107,25 +77,19 @@ install_centos() {
yum install -y epel-release
yum install -y dkms kernel-headers-`uname -r` kernel-devel-`uname -r`
if [[ $(ver $ROCM_VERSION) -ge $(ver 4.5) ]]; then
# Add amdgpu repository
local amdgpu_baseurl
if [[ $OS_VERSION == 9 ]]; then
amdgpu_baseurl="https://repo.radeon.com/amdgpu/${AMDGPU_VERSIONS[$ROCM_VERSION]}/rhel/9.0/main/x86_64"
else
if [[ $(ver $ROCM_VERSION) -ge $(ver 5.3) ]]; then
amdgpu_baseurl="https://repo.radeon.com/amdgpu/${ROCM_VERSION}/rhel/7.9/main/x86_64"
else
amdgpu_baseurl="https://repo.radeon.com/amdgpu/${AMDGPU_VERSIONS[$ROCM_VERSION]}/rhel/7.9/main/x86_64"
fi
fi
echo "[AMDGPU]" > /etc/yum.repos.d/amdgpu.repo
echo "name=AMDGPU" >> /etc/yum.repos.d/amdgpu.repo
echo "baseurl=${amdgpu_baseurl}" >> /etc/yum.repos.d/amdgpu.repo
echo "enabled=1" >> /etc/yum.repos.d/amdgpu.repo
echo "gpgcheck=1" >> /etc/yum.repos.d/amdgpu.repo
echo "gpgkey=http://repo.radeon.com/rocm/rocm.gpg.key" >> /etc/yum.repos.d/amdgpu.repo
# Add amdgpu repository
local amdgpu_baseurl
if [[ $OS_VERSION == 9 ]]; then
amdgpu_baseurl="https://repo.radeon.com/amdgpu/${ROCM_VERSION}/rhel/9.0/main/x86_64"
else
amdgpu_baseurl="https://repo.radeon.com/amdgpu/${ROCM_VERSION}/rhel/7.9/main/x86_64"
fi
echo "[AMDGPU]" > /etc/yum.repos.d/amdgpu.repo
echo "name=AMDGPU" >> /etc/yum.repos.d/amdgpu.repo
echo "baseurl=${amdgpu_baseurl}" >> /etc/yum.repos.d/amdgpu.repo
echo "enabled=1" >> /etc/yum.repos.d/amdgpu.repo
echo "gpgcheck=1" >> /etc/yum.repos.d/amdgpu.repo
echo "gpgkey=http://repo.radeon.com/rocm/rocm.gpg.key" >> /etc/yum.repos.d/amdgpu.repo
local rocm_baseurl="http://repo.radeon.com/rocm/yum/${ROCM_VERSION}"
echo "[ROCm]" > /etc/yum.repos.d/rocm.repo
@ -143,33 +107,23 @@ install_centos() {
rocm-libs \
rccl \
rocprofiler-dev \
roctracer-dev
roctracer-dev \
amd-smi-lib
# precompiled miopen kernels; search for all unversioned packages
# if search fails it will abort this script; use true to avoid case where search fails
if [[ $(ver $ROCM_VERSION) -ge $(ver 5.5) ]]; then
MIOPENHIPGFX=$(yum -q search miopen-hip-gfx | grep miopen-hip-gfx | awk '{print $1}'| grep -F kdb. || true)
if [[ "x${MIOPENHIPGFX}" = x ]]; then
echo "miopen-hip-gfx package not available" && exit 1
else
yum install -y ${MIOPENHIPGFX}
fi
MIOPENHIPGFX=$(yum -q search miopen-hip-gfx | grep miopen-hip-gfx | awk '{print $1}'| grep -F kdb. || true)
if [[ "x${MIOPENHIPGFX}" = x ]]; then
echo "miopen-hip-gfx package not available" && exit 1
else
MIOPENKERNELS=$(yum -q search miopenkernels | grep miopenkernels- | awk '{print $1}'| grep -F kdb. || true)
if [[ "x${MIOPENKERNELS}" = x ]]; then
echo "miopenkernels package not available" && exit 1
else
yum install -y ${MIOPENKERNELS}
fi
yum install -y ${MIOPENHIPGFX}
fi
# ROCm 6.0 had a regression where journal_mode was enabled on the kdb files resulting in permission errors at runtime
if [[ $(ver $ROCM_VERSION) -ge $(ver 6.0) ]]; then
for kdb in /opt/rocm/share/miopen/db/*.kdb
do
sqlite3 $kdb "PRAGMA journal_mode=off; PRAGMA VACUUM;"
done
fi
for kdb in /opt/rocm/share/miopen/db/*.kdb
do
sqlite3 $kdb "PRAGMA journal_mode=off; PRAGMA VACUUM;"
done
# Cleanup
yum clean all

View File

@ -15,7 +15,7 @@ conda_reinstall() {
if [ -n "${ROCM_VERSION}" ]; then
TRITON_REPO="https://github.com/openai/triton"
TRITON_TEXT_FILE="triton-rocm"
elif [ -n "${BASEKIT_VERSION}" ]; then
elif [ -n "${XPU_VERSION}" ]; then
TRITON_REPO="https://github.com/intel/intel-xpu-backend-for-triton"
TRITON_TEXT_FILE="triton-xpu"
else

View File

@ -5,8 +5,7 @@ set -ex
install_ubuntu() {
apt-get update
apt-get install -y --no-install-recommends \
libopencv-dev \
libavcodec-dev
libopencv-dev
# Cleanup
apt-get autoclean && apt-get clean
@ -19,8 +18,7 @@ install_centos() {
yum --enablerepo=extras install -y epel-release
yum install -y \
opencv-devel \
ffmpeg-devel
opencv-devel
# Cleanup
yum clean all

View File

@ -3,10 +3,7 @@ set -xe
# Intel® software for general purpose GPU capabilities.
# Refer to https://dgpu-docs.intel.com/releases/LTS_803.29_20240131.html
# Intel® oneAPI Base Toolkit (version 2024.0.0) has been updated to include functional and security updates.
# Refer to https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit-download.html
# Refer to https://www.intel.com/content/www/us/en/developer/articles/tool/pytorch-prerequisites-for-intel-gpus.html
# Users should update to the latest version as it becomes available
@ -17,14 +14,16 @@ function install_ubuntu() {
# Set up the repository. To do this, download the key to the system keyring
wget -qO - https://repositories.intel.com/gpu/intel-graphics.key \
| gpg --dearmor --output /usr/share/keyrings/intel-graphics.gpg
wget -O- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB \
| gpg --dearmor | tee /usr/share/keyrings/oneapi-archive-keyring.gpg > /dev/null
wget -qO - https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB \
| gpg --dearmor --output /usr/share/keyrings/intel-for-pytorch-gpu-dev-keyring.gpg
# Add the signed entry to APT sources and configure the APT client to use the Intel repository
echo "deb [arch=amd64 signed-by=/usr/share/keyrings/intel-graphics.gpg] https://repositories.intel.com/gpu/ubuntu jammy/lts/2350 unified" \
echo "deb [arch=amd64 signed-by=/usr/share/keyrings/intel-graphics.gpg] \
https://repositories.intel.com/gpu/ubuntu jammy/lts/2350 unified" \
| tee /etc/apt/sources.list.d/intel-gpu-jammy.list
echo "deb [signed-by=/usr/share/keyrings/oneapi-archive-keyring.gpg] https://apt.repos.intel.com/oneapi all main" \
| tee /etc/apt/sources.list.d/oneAPI.list
echo "deb [signed-by=/usr/share/keyrings/intel-for-pytorch-gpu-dev-keyring.gpg] \
https://apt.repos.intel.com/intel-for-pytorch-gpu-dev all main" \
| tee /etc/apt/sources.list.d/intel-for-pytorch-gpu-dev.list
# Update the packages list and repository index
apt-get update
@ -40,11 +39,11 @@ function install_ubuntu() {
mesa-vdpau-drivers mesa-vulkan-drivers va-driver-all vainfo hwinfo clinfo
# Development Packages
apt-get install -y libigc-dev intel-igc-cm libigdfcl-dev libigfxcmrt-dev level-zero-dev
# Install Intel® oneAPI Base Toolkit
if [ -n "$BASEKIT_VERSION" ]; then
apt-get install intel-basekit=$BASEKIT_VERSION -y
# Install Intel Support Packages
if [ -n "$XPU_VERSION" ]; then
apt-get install -y intel-for-pytorch-gpu-dev-${XPU_VERSION}
else
apt-get install intel-basekit -y
apt-get install -y intel-for-pytorch-gpu-dev
fi
# Cleanup

View File

@ -56,7 +56,7 @@ RUN if [ -n "${DB}" ]; then bash ./install_db.sh; fi
RUN rm install_db.sh
ENV INSTALLED_DB ${DB}
# (optional) Install vision packages like OpenCV and ffmpeg
# (optional) Install vision packages like OpenCV
ARG VISION
COPY ./common/install_vision.sh ./common/cache_vision_models.sh ./common/common_utils.sh ./
RUN if [ -n "${VISION}" ]; then bash ./install_vision.sh; fi
@ -139,7 +139,7 @@ COPY --from=pytorch/llvm:9.0.1 /opt/llvm /opt/llvm
ARG CUDNN_VERSION
ARG CUDA_VERSION
COPY ./common/install_cudnn.sh install_cudnn.sh
RUN if [ "${CUDNN_VERSION}" -eq 8 ]; then bash install_cudnn.sh; fi
RUN if [ -n "${CUDNN_VERSION}" ]; then bash install_cudnn.sh; fi
RUN rm install_cudnn.sh
# Install CUSPARSELT
@ -152,6 +152,7 @@ RUN rm install_cusparselt.sh
RUN if [ -h /usr/local/cuda-11.6/cuda-11.6 ]; then rm /usr/local/cuda-11.6/cuda-11.6; fi
RUN if [ -h /usr/local/cuda-11.7/cuda-11.7 ]; then rm /usr/local/cuda-11.7/cuda-11.7; fi
RUN if [ -h /usr/local/cuda-12.1/cuda-12.1 ]; then rm /usr/local/cuda-12.1/cuda-12.1; fi
RUN if [ -h /usr/local/cuda-12.4/cuda-12.4 ]; then rm /usr/local/cuda-12.4/cuda-12.4; fi
USER jenkins
CMD ["bash"]

View File

@ -53,7 +53,7 @@ RUN if [ -n "${DB}" ]; then bash ./install_db.sh; fi
RUN rm install_db.sh
ENV INSTALLED_DB ${DB}
# (optional) Install vision packages like OpenCV and ffmpeg
# (optional) Install vision packages like OpenCV
ARG VISION
COPY ./common/install_vision.sh ./common/cache_vision_models.sh ./common/common_utils.sh ./
RUN if [ -n "${VISION}" ]; then bash ./install_vision.sh; fi
@ -78,6 +78,11 @@ ENV MAGMA_HOME /opt/rocm/magma
ENV LANG C.UTF-8
ENV LC_ALL C.UTF-8
# Install amdsmi
COPY ./common/install_amdsmi.sh install_amdsmi.sh
RUN bash ./install_amdsmi.sh
RUN rm install_amdsmi.sh
# (optional) Install non-default CMake version
ARG CMAKE_VERSION
COPY ./common/install_cmake.sh install_cmake.sh
@ -100,6 +105,13 @@ COPY triton_version.txt triton_version.txt
RUN if [ -n "${TRITON}" ]; then bash ./install_triton.sh; fi
RUN rm install_triton.sh common_utils.sh triton-rocm.txt triton_version.txt
# Install AOTriton
COPY ./aotriton_version.txt aotriton_version.txt
COPY ./common/common_utils.sh common_utils.sh
COPY ./common/install_aotriton.sh install_aotriton.sh
RUN ["/bin/bash", "-c", "./install_aotriton.sh /opt/rocm && rm -rf install_aotriton.sh aotriton_version.txt common_utils.sh"]
ENV AOTRITON_INSTALLED_PREFIX /opt/rocm/aotriton
# Install ccache/sccache (do this last, so we get priority in PATH)
COPY ./common/install_cache.sh install_cache.sh
ENV PATH /opt/cache/bin:$PATH

View File

@ -62,7 +62,7 @@ RUN if [ -n "${INDUCTOR_BENCHMARKS}" ]; then bash ./install_inductor_benchmark_d
RUN rm install_inductor_benchmark_deps.sh common_utils.sh timm.txt huggingface.txt
# Install XPU Dependencies
ARG BASEKIT_VERSION
ARG XPU_VERSION
COPY ./common/install_xpu.sh install_xpu.sh
RUN bash ./install_xpu.sh && rm install_xpu.sh
@ -83,7 +83,7 @@ RUN if [ -n "${DB}" ]; then bash ./install_db.sh; fi
RUN rm install_db.sh
ENV INSTALLED_DB ${DB}
# (optional) Install vision packages like OpenCV and ffmpeg
# (optional) Install vision packages like OpenCV
ARG VISION
COPY ./common/install_vision.sh ./common/cache_vision_models.sh ./common/common_utils.sh ./
RUN if [ -n "${VISION}" ]; then bash ./install_vision.sh; fi

View File

@ -80,7 +80,7 @@ RUN if [ -n "${DB}" ]; then bash ./install_db.sh; fi
RUN rm install_db.sh
ENV INSTALLED_DB ${DB}
# (optional) Install vision packages like OpenCV and ffmpeg
# (optional) Install vision packages like OpenCV
ARG VISION
COPY ./common/install_vision.sh ./common/cache_vision_models.sh ./common/common_utils.sh ./
RUN if [ -n "${VISION}" ]; then bash ./install_vision.sh; fi

View File

@ -44,15 +44,7 @@ if [[ "$BUILD_ENVIRONMENT" == *cuda11* ]]; then
fi
fi
if [[ ${BUILD_ENVIRONMENT} == *"caffe2"* ]]; then
echo "Caffe2 build is ON"
export BUILD_CAFFE2=ON
fi
if [[ ${BUILD_ENVIRONMENT} == *"paralleltbb"* ]]; then
export ATEN_THREADING=TBB
export USE_TBB=1
elif [[ ${BUILD_ENVIRONMENT} == *"parallelnative"* ]]; then
if [[ ${BUILD_ENVIRONMENT} == *"parallelnative"* ]]; then
export ATEN_THREADING=NATIVE
fi
@ -294,6 +286,9 @@ else
fi
WERROR=1 python setup.py bdist_wheel
else
if [[ "$BUILD_ENVIRONMENT" == *xla* ]]; then
source .ci/pytorch/install_cache_xla.sh
fi
python setup.py bdist_wheel
fi
pip_install_whl "$(echo dist/*.whl)"
@ -335,7 +330,7 @@ else
SITE_PACKAGES="$(python -c 'from distutils.sysconfig import get_python_lib; print(get_python_lib())')"
mkdir -p "$CUSTOM_OP_BUILD"
pushd "$CUSTOM_OP_BUILD"
cmake "$CUSTOM_OP_TEST" -DCMAKE_PREFIX_PATH="$SITE_PACKAGES/torch" -DPYTHON_EXECUTABLE="$(which python)" \
cmake "$CUSTOM_OP_TEST" -DCMAKE_PREFIX_PATH="$SITE_PACKAGES/torch" -DPython_EXECUTABLE="$(which python)" \
-DCMAKE_MODULE_PATH="$CUSTOM_TEST_MODULE_PATH" -DUSE_ROCM="$CUSTOM_TEST_USE_ROCM"
make VERBOSE=1
popd
@ -348,7 +343,7 @@ else
SITE_PACKAGES="$(python -c 'from distutils.sysconfig import get_python_lib; print(get_python_lib())')"
mkdir -p "$JIT_HOOK_BUILD"
pushd "$JIT_HOOK_BUILD"
cmake "$JIT_HOOK_TEST" -DCMAKE_PREFIX_PATH="$SITE_PACKAGES/torch" -DPYTHON_EXECUTABLE="$(which python)" \
cmake "$JIT_HOOK_TEST" -DCMAKE_PREFIX_PATH="$SITE_PACKAGES/torch" -DPython_EXECUTABLE="$(which python)" \
-DCMAKE_MODULE_PATH="$CUSTOM_TEST_MODULE_PATH" -DUSE_ROCM="$CUSTOM_TEST_USE_ROCM"
make VERBOSE=1
popd
@ -360,7 +355,7 @@ else
python --version
mkdir -p "$CUSTOM_BACKEND_BUILD"
pushd "$CUSTOM_BACKEND_BUILD"
cmake "$CUSTOM_BACKEND_TEST" -DCMAKE_PREFIX_PATH="$SITE_PACKAGES/torch" -DPYTHON_EXECUTABLE="$(which python)" \
cmake "$CUSTOM_BACKEND_TEST" -DCMAKE_PREFIX_PATH="$SITE_PACKAGES/torch" -DPython_EXECUTABLE="$(which python)" \
-DCMAKE_MODULE_PATH="$CUSTOM_TEST_MODULE_PATH" -DUSE_ROCM="$CUSTOM_TEST_USE_ROCM"
make VERBOSE=1
popd

View File

@ -6,4 +6,4 @@ source "$(dirname "${BASH_SOURCE[0]}")/common.sh"
echo "Testing pytorch docs"
cd docs
make doctest
TERM=vt100 make doctest

View File

@ -0,0 +1,37 @@
#!/bin/bash
# Script for installing sccache on the xla build job, which uses xla's docker
# image and doesn't have sccache installed on it. This is mostly copied from
# .ci/docker/install_cache.sh. Changes are: removing checks that will always
# return the same thing, ex checks for for rocm, CUDA, and changing the path
# where sccache is installed, and not changing /etc/environment.
set -ex
install_binary() {
echo "Downloading sccache binary from S3 repo"
curl --retry 3 https://s3.amazonaws.com/ossci-linux/sccache -o /tmp/cache/bin/sccache
}
mkdir -p /tmp/cache/bin
mkdir -p /tmp/cache/lib
export PATH="/tmp/cache/bin:$PATH"
install_binary
chmod a+x /tmp/cache/bin/sccache
function write_sccache_stub() {
# Unset LD_PRELOAD for ps because of asan + ps issues
# https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90589
# shellcheck disable=SC2086
# shellcheck disable=SC2059
printf "#!/bin/sh\nif [ \$(env -u LD_PRELOAD ps -p \$PPID -o comm=) != sccache ]; then\n exec sccache $(which $1) \"\$@\"\nelse\n exec $(which $1) \"\$@\"\nfi" > "/tmp/cache/bin/$1"
chmod a+x "/tmp/cache/bin/$1"
}
write_sccache_stub cc
write_sccache_stub c++
write_sccache_stub gcc
write_sccache_stub g++
write_sccache_stub clang
write_sccache_stub clang++

View File

@ -18,6 +18,7 @@ time python test/run_test.py --verbose -i distributed/test_c10d_gloo
time python test/run_test.py --verbose -i distributed/test_c10d_nccl
time python test/run_test.py --verbose -i distributed/test_c10d_spawn_gloo
time python test/run_test.py --verbose -i distributed/test_c10d_spawn_nccl
time python test/run_test.py --verbose -i distributed/test_cuda_p2p
time python test/run_test.py --verbose -i distributed/test_store
time python test/run_test.py --verbose -i distributed/test_pg_wrapper
time python test/run_test.py --verbose -i distributed/rpc/cuda/test_tensorpipe_agent
@ -50,6 +51,9 @@ time python test/run_test.py --verbose -i distributed/tensor/parallel/test_tp_ra
# FSDP2 tests
time python test/run_test.py --verbose -i distributed/_composable/fsdp/test_fully_shard_training -- -k test_2d_mlp_with_nd_mesh
# Pipelining composability tests
time python test/run_test.py --verbose -i distributed/pipelining/test_composability.py
# Other tests
time python test/run_test.py --verbose -i test_cuda_primary_ctx
time python test/run_test.py --verbose -i test_optim -- -k test_forloop_goes_right_direction_multigpu

View File

@ -264,6 +264,18 @@ elif [[ $TEST_CONFIG == 'nogpu_AVX512' ]]; then
export ATEN_CPU_CAPABILITY=avx2
fi
# temp workarounds for https://github.com/pytorch/pytorch/issues/126692, remove when fixed
if [[ "$BUILD_ENVIRONMENT" != *-bazel-* ]]; then
pushd test
CUDA_VERSION=$(python -c "import torch; print(torch.version.cuda)")
if [ "$CUDA_VERSION" == "12.4" ]; then
ISCUDA124="cu124"
else
ISCUDA124=""
fi
popd
fi
test_python_legacy_jit() {
time python test/run_test.py --include test_jit_legacy test_jit_fuser_legacy --verbose
assert_git_not_dirty
@ -326,6 +338,7 @@ test_inductor_distributed() {
python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_frozen.py --verbose
python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_mixed_precision.py -k test_compute_dtype --verbose
python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_mixed_precision.py -k test_reduce_dtype --verbose
python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_clip_grad_norm_.py -k test_clip_grad_norm_2d --verbose
python test/run_test.py -i distributed/fsdp/test_fsdp_tp_integration.py -k test_fsdp_tp_integration --verbose
# this runs on both single-gpu and multi-gpu instance. It should be smart about skipping tests that aren't supported
@ -350,10 +363,20 @@ test_inductor() {
test_inductor_cpp_wrapper_abi_compatible() {
export TORCHINDUCTOR_ABI_COMPATIBLE=1
TEST_REPORTS_DIR=$(pwd)/test/test-reports
mkdir -p "$TEST_REPORTS_DIR"
echo "Testing Inductor cpp wrapper mode with TORCHINDUCTOR_ABI_COMPATIBLE=1"
# cpu stack allocation causes segfault and needs more investigation
TORCHINDUCTOR_STACK_ALLOCATION=0 python test/run_test.py --include inductor/test_cpu_cpp_wrapper
PYTORCH_TESTING_DEVICE_ONLY_FOR="" python test/run_test.py --include inductor/test_cpu_cpp_wrapper
python test/run_test.py --include inductor/test_cuda_cpp_wrapper
TORCHINDUCTOR_CPP_WRAPPER=1 python benchmarks/dynamo/timm_models.py --device cuda --accuracy --amp \
--training --inductor --disable-cudagraphs --only vit_base_patch16_224 \
--output "$TEST_REPORTS_DIR/inductor_cpp_wrapper_training.csv"
python benchmarks/dynamo/check_accuracy.py \
--actual "$TEST_REPORTS_DIR/inductor_cpp_wrapper_training.csv" \
--expected "benchmarks/dynamo/ci_expected_accuracy/${ISCUDA124}/inductor_timm_training.csv"
}
# "Global" flags for inductor benchmarking controlled by TEST_CONFIG
@ -515,16 +538,16 @@ test_single_dynamo_benchmark() {
--output "$TEST_REPORTS_DIR/${name}_${suite}.csv"
python benchmarks/dynamo/check_accuracy.py \
--actual "$TEST_REPORTS_DIR/${name}_$suite.csv" \
--expected "benchmarks/dynamo/ci_expected_accuracy/${TEST_CONFIG}_${name}.csv"
--expected "benchmarks/dynamo/ci_expected_accuracy/${ISCUDA124}/${TEST_CONFIG}_${name}.csv"
python benchmarks/dynamo/check_graph_breaks.py \
--actual "$TEST_REPORTS_DIR/${name}_$suite.csv" \
--expected "benchmarks/dynamo/ci_expected_accuracy/${TEST_CONFIG}_${name}.csv"
--expected "benchmarks/dynamo/ci_expected_accuracy/${ISCUDA124}/${TEST_CONFIG}_${name}.csv"
fi
}
test_inductor_micro_benchmark() {
TEST_REPORTS_DIR=$(pwd)/test/test-micro-reports
python benchmarks/gpt_fast/benchmark.py
TEST_REPORTS_DIR=$(pwd)/test/test-reports
python benchmarks/gpt_fast/benchmark.py --output "${TEST_REPORTS_DIR}/gpt_fast_benchmark.csv"
}
test_dynamo_benchmark() {
@ -542,7 +565,11 @@ test_dynamo_benchmark() {
test_single_dynamo_benchmark "dashboard" "$suite" "$shard_id" "$@"
else
if [[ "${TEST_CONFIG}" == *cpu_inductor* ]]; then
test_single_dynamo_benchmark "inference" "$suite" "$shard_id" --inference --float32 "$@"
if [[ "${TEST_CONFIG}" == *freezing* ]]; then
test_single_dynamo_benchmark "inference" "$suite" "$shard_id" --inference --float32 --freezing "$@"
else
test_single_dynamo_benchmark "inference" "$suite" "$shard_id" --inference --float32 "$@"
fi
elif [[ "${TEST_CONFIG}" == *aot_inductor* ]]; then
test_single_dynamo_benchmark "inference" "$suite" "$shard_id" --inference --bfloat16 "$@"
else
@ -556,12 +583,16 @@ test_inductor_torchbench_smoketest_perf() {
TEST_REPORTS_DIR=$(pwd)/test/test-reports
mkdir -p "$TEST_REPORTS_DIR"
# smoke test the cpp_wrapper mode
TORCHINDUCTOR_CPP_WRAPPER=1 python benchmarks/dynamo/torchbench.py --device cuda --accuracy --bfloat16 \
--inference --inductor --only hf_T5 --output "$TEST_REPORTS_DIR/inductor_cpp_wrapper_smoketest.csv"
# Test some models in the cpp wrapper mode
TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCHINDUCTOR_CPP_WRAPPER=1 python benchmarks/dynamo/torchbench.py --device cuda --accuracy \
--bfloat16 --inference --inductor --only hf_T5 --output "$TEST_REPORTS_DIR/inductor_cpp_wrapper_inference.csv"
TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCHINDUCTOR_CPP_WRAPPER=1 python benchmarks/dynamo/torchbench.py --device cuda --accuracy \
--bfloat16 --inference --inductor --only llama --output "$TEST_REPORTS_DIR/inductor_cpp_wrapper_inference.csv"
TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCHINDUCTOR_CPP_WRAPPER=1 python benchmarks/dynamo/torchbench.py --device cuda --accuracy \
--bfloat16 --inference --inductor --only moco --output "$TEST_REPORTS_DIR/inductor_cpp_wrapper_inference.csv"
python benchmarks/dynamo/check_accuracy.py \
--actual "$TEST_REPORTS_DIR/inductor_cpp_wrapper_smoketest.csv" \
--expected "benchmarks/dynamo/ci_expected_accuracy/inductor_torchbench_inference.csv"
--actual "$TEST_REPORTS_DIR/inductor_cpp_wrapper_inference.csv" \
--expected "benchmarks/dynamo/ci_expected_accuracy/${ISCUDA124}/inductor_torchbench_inference.csv"
python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --float16 --training \
--batch-size-file "$(realpath benchmarks/dynamo/torchbench_models_list.txt)" --only hf_Bert \
@ -576,7 +607,13 @@ test_inductor_torchbench_smoketest_perf() {
# https://github.com/pytorch/pytorch/actions/runs/7158691360/job/19491437314,
# and thus we lower its threshold to reduce flakiness. If this continues to be a problem,
# we switch to use some other model.
python benchmarks/dynamo/check_perf_csv.py -f "$TEST_REPORTS_DIR/inductor_inference_smoketest.csv" -t 4.9
# Use 4.7 for cuda 12.4, change back to 4.9 after fixing https://github.com/pytorch/pytorch/issues/126692
if [ "$CUDA_VERSION" == "12.4" ]; then
THRESHOLD=4.7
else
THRESHOLD=4.9
fi
python benchmarks/dynamo/check_perf_csv.py -f "$TEST_REPORTS_DIR/inductor_inference_smoketest.csv" -t $THRESHOLD
# Check memory compression ratio for a few models
for test in hf_Albert timm_vision_transformer; do
@ -588,6 +625,15 @@ test_inductor_torchbench_smoketest_perf() {
"$TEST_REPORTS_DIR/inductor_training_smoketest_$test.csv" \
--expected benchmarks/dynamo/expected_ci_perf_inductor_torchbench.csv
done
# Perform some "warm-start" runs for a few huggingface models.
for test in AlbertForQuestionAnswering AllenaiLongformerBase DistilBertForMaskedLM DistillGPT2 GoogleFnet YituTechConvBert; do
python benchmarks/dynamo/huggingface.py --accuracy --training --amp --inductor --device cuda --warm-start-latency \
--only $test --output "$TEST_REPORTS_DIR/inductor_warm_start_smoketest_$test.csv"
python benchmarks/dynamo/check_accuracy.py \
--actual "$TEST_REPORTS_DIR/inductor_warm_start_smoketest_$test.csv" \
--expected "benchmarks/dynamo/ci_expected_accuracy/${ISCUDA124}/inductor_huggingface_training.csv"
done
}
test_inductor_torchbench_cpu_smoketest_perf(){
@ -671,7 +717,6 @@ test_aten() {
${SUDO} ln -sf "$TORCH_LIB_DIR"/libmkldnn* "$TEST_BASE_DIR"
${SUDO} ln -sf "$TORCH_LIB_DIR"/libnccl* "$TEST_BASE_DIR"
${SUDO} ln -sf "$TORCH_LIB_DIR"/libtorch* "$TEST_BASE_DIR"
${SUDO} ln -sf "$TORCH_LIB_DIR"/libtbb* "$TEST_BASE_DIR"
ls "$TEST_BASE_DIR"
aten/tools/run_tests.sh "$TEST_BASE_DIR"
@ -696,21 +741,6 @@ test_without_numpy() {
popd
}
# pytorch extensions require including torch/extension.h which includes all.h
# which includes utils.h which includes Parallel.h.
# So you can call for instance parallel_for() from your extension,
# but the compilation will fail because of Parallel.h has only declarations
# and definitions are conditionally included Parallel.h(see last lines of Parallel.h).
# I tried to solve it #39612 and #39881 by including Config.h into Parallel.h
# But if Pytorch is built with TBB it provides Config.h
# that has AT_PARALLEL_NATIVE_TBB=1(see #3961 or #39881) and it means that if you include
# torch/extension.h which transitively includes Parallel.h
# which transitively includes tbb.h which is not available!
if [[ "${BUILD_ENVIRONMENT}" == *tbb* ]]; then
sudo mkdir -p /usr/include/tbb
sudo cp -r "$PWD"/third_party/tbb/include/tbb/* /usr/include/tbb
fi
test_libtorch() {
local SHARD="$1"
@ -724,7 +754,6 @@ test_libtorch() {
ln -sf "$TORCH_LIB_DIR"/libc10* "$TORCH_BIN_DIR"
ln -sf "$TORCH_LIB_DIR"/libshm* "$TORCH_BIN_DIR"
ln -sf "$TORCH_LIB_DIR"/libtorch* "$TORCH_BIN_DIR"
ln -sf "$TORCH_LIB_DIR"/libtbb* "$TORCH_BIN_DIR"
ln -sf "$TORCH_LIB_DIR"/libnvfuser* "$TORCH_BIN_DIR"
export CPP_TESTS_DIR="${TORCH_BIN_DIR}"
@ -861,7 +890,6 @@ test_rpc() {
# test reporting process to function as expected.
ln -sf "$TORCH_LIB_DIR"/libtorch* "$TORCH_BIN_DIR"
ln -sf "$TORCH_LIB_DIR"/libc10* "$TORCH_BIN_DIR"
ln -sf "$TORCH_LIB_DIR"/libtbb* "$TORCH_BIN_DIR"
CPP_TESTS_DIR="${TORCH_BIN_DIR}" python test/run_test.py --cpp --verbose -i cpp/test_cpp_rpc
}
@ -1269,6 +1297,10 @@ elif [[ "${TEST_CONFIG}" == *dynamo* && "${SHARD_NUMBER}" == 1 && $NUM_TEST_SHAR
elif [[ "${TEST_CONFIG}" == *dynamo* && $SHARD_NUMBER -gt 1 && $NUM_TEST_SHARDS -gt 1 ]]; then
install_torchvision
test_dynamo_shard "${SHARD_NUMBER}"
elif [[ "${BUILD_ENVIRONMENT}" == *rocm* && -n "$TESTS_TO_INCLUDE" ]]; then
install_torchvision
test_python_shard "$SHARD_NUMBER"
test_aten
elif [[ "${SHARD_NUMBER}" == 1 && $NUM_TEST_SHARDS -gt 1 ]]; then
test_without_numpy
install_torchvision
@ -1298,10 +1330,6 @@ elif [[ "${BUILD_ENVIRONMENT}" == *-mobile-lightweight-dispatch* ]]; then
test_libtorch
elif [[ "${TEST_CONFIG}" = docs_test ]]; then
test_docs_test
elif [[ "${BUILD_ENVIRONMENT}" == *rocm* && -n "$TESTS_TO_INCLUDE" ]]; then
install_torchvision
test_python
test_aten
elif [[ "${BUILD_ENVIRONMENT}" == *xpu* ]]; then
install_torchvision
test_python

View File

@ -96,8 +96,13 @@ if [[ "$PACKAGE_TYPE" == conda ]]; then
conda install \${EXTRA_CONDA_FLAGS} -y "\$pkg" --offline
)
elif [[ "$PACKAGE_TYPE" != libtorch ]]; then
pip install "\$pkg" --index-url "https://download.pytorch.org/whl/\${CHANNEL}/${DESIRED_CUDA}"
retry pip install -q numpy protobuf typing-extensions
if [[ "\$BUILD_ENVIRONMENT" != *s390x* ]]; then
pip install "\$pkg" --index-url "https://download.pytorch.org/whl/\${CHANNEL}/${DESIRED_CUDA}"
retry pip install -q numpy protobuf typing-extensions
else
pip install "\$pkg"
retry pip install -q numpy protobuf typing-extensions
fi
fi
if [[ "$PACKAGE_TYPE" == libtorch ]]; then
pkg="\$(ls /final_pkgs/*-latest.zip)"

View File

@ -76,8 +76,8 @@ TRITON_VERSION=$(cat $PYTORCH_ROOT/.ci/docker/triton_version.txt)
# Here PYTORCH_EXTRA_INSTALL_REQUIREMENTS is already set for the all the wheel builds hence append TRITON_CONSTRAINT
if [[ "$PACKAGE_TYPE" =~ .*wheel.* && -n "${PYTORCH_EXTRA_INSTALL_REQUIREMENTS:-}" ]]; then
# Only linux Python < 3.12 are supported wheels for triton
TRITON_CONSTRAINT="platform_system == 'Linux' and platform_machine == 'x86_64' and python_version < '3.12'"
# Only linux Python < 3.13 are supported wheels for triton
TRITON_CONSTRAINT="platform_system == 'Linux' and platform_machine == 'x86_64' and python_version < '3.13'"
TRITON_REQUIREMENT="triton==${TRITON_VERSION}; ${TRITON_CONSTRAINT}"
if [[ -n "$PYTORCH_BUILD_VERSION" && "$PYTORCH_BUILD_VERSION" =~ .*dev.* ]]; then
TRITON_SHORTHASH=$(cut -c1-10 $PYTORCH_ROOT/.ci/docker/ci_commit_pins/triton.txt)

View File

@ -61,6 +61,7 @@ readability-simplify-subscript-expr,
readability-string-compare,
'
HeaderFilterRegex: '^(aten/|c10/|torch/).*$'
AnalyzeTemporaryDtors: false
WarningsAsErrors: '*'
CheckOptions:
misc-header-include-cycle.IgnoredFilesList: 'format.h;ivalue.h;custom_class.h;Dict.h;List.h'
...

View File

@ -1,9 +1,12 @@
self-hosted-runner:
labels:
# GitHub hosted x86 Linux runners
- linux.20_04.4x
- linux.20_04.16x
- linux.large
# Repo-specific LF hosted ARC runners
- linux.large.arc
# Organization-wide AWS Linux Runners
- linux.large
- linux.2xlarge
- linux.4xlarge
- linux.12xlarge
@ -13,17 +16,34 @@ self-hosted-runner:
- linux.8xlarge.nvidia.gpu
- linux.16xlarge.nvidia.gpu
- linux.g5.4xlarge.nvidia.gpu
# Organization-wide AWS Linux Runners on Linux Foundation account
- lf.linux.large
- lf.linux.2xlarge
- lf.linux.4xlarge
- lf.linux.12xlarge
- lf.linux.24xlarge
- lf.linux.arm64.2xlarge
- lf.linux.4xlarge.nvidia.gpu
- lf.linux.8xlarge.nvidia.gpu
- lf.linux.16xlarge.nvidia.gpu
- lf.linux.g5.4xlarge.nvidia.gpu
# Repo-specific IBM hosted S390x runner
- linux.s390x
# Organization wide AWS Windows runners
- windows.4xlarge.nonephemeral
- windows.8xlarge.nvidia.gpu
- windows.8xlarge.nvidia.gpu.nonephemeral
- windows.g5.4xlarge.nvidia.gpu
- bm-runner
# Organization-wide AMD hosted MI300 runners
- linux.rocm.gpu
# Repo-specific Apple hosted runners
- macos-m1-ultra
- macos-m2-14
# Org wise AWS `mac2.metal` runners (2020 Mac mini hardware powered by Apple silicon M1 processors)
- macos-m1-stable
- macos-m1-13
- macos-m1-14
- macos-12-xl
- macos-12
- macos12.3-m1
# GitHub-hosted MacOS runners
- macos-latest-xlarge
- macos-13-xlarge
- macos-14-xlarge

View File

@ -66,7 +66,8 @@ runs:
command: |
set -eux
# PyYAML 6.0 doesn't work with MacOS x86 anymore
python3 -m pip install requests==2.26.0 pyyaml==6.0.1
# This must run on Python-3.7 (AmazonLinux2) so can't use request=3.32.2
python3 -m pip install requests==2.27.1 pyyaml==6.0.1
- name: Parse ref
id: parse-ref

View File

@ -35,7 +35,7 @@ runs:
"${DOCKER_IMAGE}"
)
if [[ "${GPU_ARCH_TYPE}" != "rocm" && "${BUILD_ENVIRONMENT}" != "linux-aarch64-binary-manywheel" ]]; then
if [[ "${GPU_ARCH_TYPE}" != "rocm" && "${BUILD_ENVIRONMENT}" != "linux-aarch64-binary-manywheel" && "${BUILD_ENVIRONMENT}" != "linux-s390x-binary-manywheel" ]]; then
# Propagate download.pytorch.org IP to container. This is only needed on Linux non aarch64 runner
grep download.pytorch.org /etc/hosts | docker exec -i "${container_name}" bash -c "/bin/cat >> /etc/hosts"
fi
@ -44,3 +44,12 @@ runs:
# Generate test script
docker exec -t -w "${PYTORCH_ROOT}" -e OUTPUT_SCRIPT="/run.sh" "${container_name}" bash -c "bash .circleci/scripts/binary_linux_test.sh"
docker exec -t "${container_name}" bash -c "source ${BINARY_ENV_FILE} && bash -x /run.sh"
- name: Cleanup docker
if: always() && env.BUILD_ENVIRONMENT == 'linux-s390x-binary-manywheel'
shell: bash
run: |
# on s390x stop the container for clean worker stop
# ignore expansion of "docker ps -q" since it could be empty
# shellcheck disable=SC2046
docker stop $(docker ps -q) || true

View File

@ -1 +1 @@
ea437b31ce316ea3d66fe73768c0dcb94edb79ad
b829e936f7cc61b48149f5f957a451a38bf2a178

View File

@ -1 +1 @@
e3fc03314dab5f44e3ed9ccbba6c15fbca3285cd
6f0b61e5d782913a0fc7743812f2a8e522189111

154
.github/lf-canary-scale-config.yml vendored Normal file
View File

@ -0,0 +1,154 @@
# Defines runner types that will be provisioned by by LF Self-hosted
# runners for pytorch/pytorch-canary and their labels.
#
# Runners listed here will be available as self hosted runners.
# Configuration is directly pulled from the main branch.
#
# Default values:
#
# runner_types:
# runner_label: # label to specify in the Github Actions workflow
# instance_type: m4.large
# os: linux
# max_available: 20
# disk_size: 50
# is_ephemeral: true
runner_types:
lf.c.linux.12xlarge:
disk_size: 200
instance_type: c5.12xlarge
is_ephemeral: false
max_available: 1000
os: linux
lf.c.linux.24xl.spr-metal:
disk_size: 200
instance_type: c7i.metal-24xl
is_ephemeral: false
max_available: 30
os: linux
lf.c.linux.16xlarge.spr:
disk_size: 200
instance_type: c7i.16xlarge
is_ephemeral: false
max_available: 30
os: linux
lf.c.linux.12xlarge.ephemeral:
disk_size: 200
instance_type: c5.12xlarge
is_ephemeral: true
max_available: 300
os: linux
lf.c.linux.16xlarge.nvidia.gpu:
disk_size: 150
instance_type: g3.16xlarge
is_ephemeral: false
max_available: 30
os: linux
lf.c.linux.24xlarge:
disk_size: 150
instance_type: c5.24xlarge
is_ephemeral: false
max_available: 250
os: linux
lf.c.linux.2xlarge:
disk_size: 150
instance_type: c5.2xlarge
is_ephemeral: false
max_available: 3120
os: linux
lf.c.linux.4xlarge:
disk_size: 150
instance_type: c5.4xlarge
is_ephemeral: false
max_available: 1000
os: linux
lf.c.linux.4xlarge.nvidia.gpu:
disk_size: 150
instance_type: g3.4xlarge
is_ephemeral: false
max_available: 520
os: linux
lf.c.linux.8xlarge.nvidia.gpu:
disk_size: 150
instance_type: g3.8xlarge
is_ephemeral: false
max_available: 400
os: linux
lf.c.linux.g4dn.12xlarge.nvidia.gpu:
disk_size: 150
instance_type: g4dn.12xlarge
is_ephemeral: false
max_available: 50
os: linux
lf.c.linux.g4dn.metal.nvidia.gpu:
disk_size: 150
instance_type: g4dn.metal
is_ephemeral: false
max_available: 30
os: linux
lf.c.linux.g5.48xlarge.nvidia.gpu:
disk_size: 150
instance_type: g5.48xlarge
is_ephemeral: false
max_available: 20
os: linux
lf.c.linux.g5.12xlarge.nvidia.gpu:
disk_size: 150
instance_type: g5.12xlarge
is_ephemeral: false
max_available: 150
os: linux
lf.c.linux.g5.4xlarge.nvidia.gpu:
disk_size: 150
instance_type: g5.4xlarge
is_ephemeral: false
max_available: 1200
os: linux
lf.c.linux.large:
disk_size: 15
instance_type: c5.large
is_ephemeral: false
os: linux
lf.c.linux.arm64.2xlarge:
disk_size: 256
instance_type: t4g.2xlarge
is_ephemeral: false
max_available: 200
os: linux
lf.c.linux.arm64.m7g.2xlarge:
disk_size: 256
instance_type: m7g.2xlarge
is_ephemeral: false
max_available: 20
os: linux
lf.c.windows.4xlarge:
disk_size: 256
instance_type: c5d.4xlarge
is_ephemeral: true
max_available: 420
os: windows
lf.c.windows.4xlarge.nonephemeral:
disk_size: 256
instance_type: c5d.4xlarge
is_ephemeral: false
max_available: 420
os: windows
lf.c.windows.8xlarge.nvidia.gpu:
disk_size: 256
instance_type: p3.2xlarge
is_ephemeral: true
max_available: 150
os: windows
lf.c.windows.8xlarge.nvidia.gpu.nonephemeral:
disk_size: 256
instance_type: p3.2xlarge
is_ephemeral: false
max_available: 150
os: windows
lf.c.windows.g5.4xlarge.nvidia.gpu:
disk_size: 256
instance_type: g5.4xlarge
is_ephemeral: false
max_available: 250
os: windows

154
.github/lf-scale-config.yml vendored Normal file
View File

@ -0,0 +1,154 @@
# Defines runner types that will be provisioned by by LF Self-hosted
# runners for pytorch/pytorch and their labels.
#
# Runners listed here will be available as self hosted runners.
# Configuration is directly pulled from the main branch.
#
# Default values:
#
# runner_types:
# runner_label: # label to specify in the Github Actions workflow
# instance_type: m4.large
# os: linux
# max_available: 20
# disk_size: 50
# is_ephemeral: true
runner_types:
lf.linux.12xlarge:
disk_size: 200
instance_type: c5.12xlarge
is_ephemeral: false
max_available: 1000
os: linux
lf.linux.24xl.spr-metal:
disk_size: 200
instance_type: c7i.metal-24xl
is_ephemeral: false
max_available: 30
os: linux
lf.linux.16xlarge.spr:
disk_size: 200
instance_type: c7i.16xlarge
is_ephemeral: false
max_available: 30
os: linux
lf.linux.12xlarge.ephemeral:
disk_size: 200
instance_type: c5.12xlarge
is_ephemeral: true
max_available: 300
os: linux
lf.linux.16xlarge.nvidia.gpu:
disk_size: 150
instance_type: g3.16xlarge
is_ephemeral: false
max_available: 30
os: linux
lf.linux.24xlarge:
disk_size: 150
instance_type: c5.24xlarge
is_ephemeral: false
max_available: 250
os: linux
lf.linux.2xlarge:
disk_size: 150
instance_type: c5.2xlarge
is_ephemeral: false
max_available: 3120
os: linux
lf.linux.4xlarge:
disk_size: 150
instance_type: c5.4xlarge
is_ephemeral: false
max_available: 1000
os: linux
lf.linux.4xlarge.nvidia.gpu:
disk_size: 150
instance_type: g3.4xlarge
is_ephemeral: false
max_available: 520
os: linux
lf.linux.8xlarge.nvidia.gpu:
disk_size: 150
instance_type: g3.8xlarge
is_ephemeral: false
max_available: 400
os: linux
lf.linux.g4dn.12xlarge.nvidia.gpu:
disk_size: 150
instance_type: g4dn.12xlarge
is_ephemeral: false
max_available: 50
os: linux
lf.linux.g4dn.metal.nvidia.gpu:
disk_size: 150
instance_type: g4dn.metal
is_ephemeral: false
max_available: 30
os: linux
lf.linux.g5.48xlarge.nvidia.gpu:
disk_size: 150
instance_type: g5.48xlarge
is_ephemeral: false
max_available: 20
os: linux
lf.linux.g5.12xlarge.nvidia.gpu:
disk_size: 150
instance_type: g5.12xlarge
is_ephemeral: false
max_available: 150
os: linux
lf.linux.g5.4xlarge.nvidia.gpu:
disk_size: 150
instance_type: g5.4xlarge
is_ephemeral: false
max_available: 1200
os: linux
lf.linux.large:
disk_size: 15
instance_type: c5.large
is_ephemeral: false
os: linux
lf.linux.arm64.2xlarge:
disk_size: 256
instance_type: t4g.2xlarge
is_ephemeral: false
max_available: 200
os: linux
lf.linux.arm64.m7g.2xlarge:
disk_size: 256
instance_type: m7g.2xlarge
is_ephemeral: false
max_available: 20
os: linux
lf.windows.4xlarge:
disk_size: 256
instance_type: c5d.4xlarge
is_ephemeral: true
max_available: 420
os: windows
lf.windows.4xlarge.nonephemeral:
disk_size: 256
instance_type: c5d.4xlarge
is_ephemeral: false
max_available: 420
os: windows
lf.windows.8xlarge.nvidia.gpu:
disk_size: 256
instance_type: p3.2xlarge
is_ephemeral: true
max_available: 150
os: windows
lf.windows.8xlarge.nvidia.gpu.nonephemeral:
disk_size: 256
instance_type: p3.2xlarge
is_ephemeral: false
max_available: 150
os: windows
lf.windows.g5.4xlarge.nvidia.gpu:
disk_size: 256
instance_type: g5.4xlarge
is_ephemeral: false
max_available: 250
os: windows

View File

@ -245,6 +245,7 @@
- torch/xpu/**
- test/xpu/**
- third_party/xpu.txt
- .ci/docker/ci_commit_pins/triton-xpu.txt
approved_by:
- EikanWang
- jgong5

View File

@ -1,6 +1,5 @@
tracking_issue: 24422
ciflow_tracking_issue: 64124
TD_rollout_issue: 123120
ciflow_push_tags:
- ciflow/binaries
- ciflow/binaries_conda
@ -9,6 +8,7 @@ ciflow_push_tags:
- ciflow/inductor
- ciflow/inductor-perf-compare
- ciflow/inductor-micro-benchmark
- ciflow/inductor-cu124
- ciflow/linux-aarch64
- ciflow/mps
- ciflow/nightly
@ -20,7 +20,6 @@ ciflow_push_tags:
- ciflow/xpu
- ciflow/torchbench
retryable_workflows:
- lint
- pull
- trunk
- linux-binary

View File

@ -10,6 +10,6 @@ lintrunner==0.10.7
ninja==1.10.0.post1
nvidia-ml-py==11.525.84
pyyaml==6.0
requests==2.31.0
requests==2.32.2
rich==10.9.0
rockset==1.0.3

View File

@ -4,6 +4,5 @@ mkl-include=2022.1.0
ninja=1.10.2
numpy=1.23.3
pyyaml=6.0
requests=2.31.0
setuptools=68.2.2
typing-extensions=4.3.0
typing-extensions=4.9.0

View File

@ -3,6 +3,5 @@ cmake=3.22.1
ninja=1.10.2
numpy=1.23.3
pyyaml=6.0
requests=2.31.0
setuptools=68.2.2
typing-extensions=4.3.0
typing-extensions=4.9.0

View File

@ -2,7 +2,7 @@ numpy=1.22.3
pyyaml=6.0
setuptools=61.2.0
cmake=3.22.*
typing-extensions=4.3.0
typing-extensions=4.9.0
dataclasses=0.8
pip=22.2.2
pillow=10.0.1

View File

@ -4,7 +4,7 @@ numpy=1.21.2
pyyaml=5.3
setuptools=46.0.0
cmake=3.22.*
typing-extensions=4.3.0
typing-extensions=4.9.0
dataclasses=0.8
pip=22.2.2
pillow=10.0.1

View File

@ -2,6 +2,7 @@
import os
import re
from datetime import datetime
from functools import lru_cache
from pathlib import Path
from typing import Any, Callable, Dict, List, Set
@ -187,6 +188,17 @@ def get_recent_prs() -> Dict[str, Any]:
return prs_by_branch_base
@lru_cache(maxsize=1)
def get_open_prs() -> List[Dict[str, Any]]:
return paginate_graphql(
GRAPHQL_OPEN_PRS,
{"owner": "pytorch", "repo": "pytorch"},
lambda data: False,
lambda res: res["data"]["repository"]["pullRequests"]["nodes"],
lambda res: res["data"]["repository"]["pullRequests"]["pageInfo"],
)
def get_branches_with_magic_label_or_open_pr() -> Set[str]:
pr_infos: List[Dict[str, Any]] = paginate_graphql(
GRAPHQL_NO_DELETE_BRANCH_LABEL,
@ -196,15 +208,7 @@ def get_branches_with_magic_label_or_open_pr() -> Set[str]:
lambda res: res["data"]["repository"]["label"]["pullRequests"]["pageInfo"],
)
pr_infos.extend(
paginate_graphql(
GRAPHQL_OPEN_PRS,
{"owner": "pytorch", "repo": "pytorch"},
lambda data: False,
lambda res: res["data"]["repository"]["pullRequests"]["nodes"],
lambda res: res["data"]["repository"]["pullRequests"]["pageInfo"],
)
)
pr_infos.extend(get_open_prs())
# Get the most recent PR for each branch base (group gh together)
branch_bases = set()
@ -270,5 +274,41 @@ def delete_branches() -> None:
delete_branch(git_repo, branch)
def delete_old_ciflow_tags() -> None:
# Deletes ciflow tags if they are associated with a closed PR or a specific
# commit. Lightweight tags don't have information about the date they were
# created, so we can't check how old they are. The script just assumes that
# ciflow tags should be deleted regardless of creation date.
git_repo = GitRepo(str(REPO_ROOT), "origin", debug=True)
def delete_tag(tag: str) -> None:
print(f"Deleting tag {tag}")
ESTIMATED_TOKENS[0] += 1
delete_branch(git_repo, f"refs/tags/{tag}")
tags = git_repo._run_git("tag").splitlines()
open_pr_numbers = [x["number"] for x in get_open_prs()]
for tag in tags:
try:
if ESTIMATED_TOKENS[0] > 400:
print("Estimated tokens exceeded, exiting")
break
if not tag.startswith("ciflow/"):
continue
re_match_pr = re.match(r"^ciflow\/.*\/(\d{5,6})$", tag)
re_match_sha = re.match(r"^ciflow\/.*\/([0-9a-f]{40})$", tag)
if re_match_pr:
pr_number = int(re_match_pr.group(1))
if pr_number in open_pr_numbers:
continue
delete_tag(tag)
elif re_match_sha:
delete_tag(tag)
except Exception as e:
print(f"Failed to check tag {tag}: {e}")
if __name__ == "__main__":
delete_branches()
delete_old_ciflow_tags()

52
.github/scripts/docathon-label-sync.py vendored Normal file
View File

@ -0,0 +1,52 @@
import os
import re
import sys
from github import Github
def main() -> None:
token = os.environ.get("GITHUB_TOKEN")
repo_owner = "pytorch"
repo_name = "pytorch"
pull_request_number = int(sys.argv[1])
g = Github(token)
repo = g.get_repo(f"{repo_owner}/{repo_name}")
pull_request = repo.get_pull(pull_request_number)
pull_request_body = pull_request.body
# PR without description
if pull_request_body is None:
return
# get issue number from the PR body
if not re.search(r"#\d{1,6}", pull_request_body):
print("The pull request does not mention an issue.")
return
issue_number = int(re.findall(r"#(\d{1,6})", pull_request_body)[0])
issue = repo.get_issue(issue_number)
issue_labels = issue.labels
docathon_label_present = any(
label.name == "docathon-h1-2024" for label in issue_labels
)
# if the issue has a docathon label, add all labels from the issue to the PR.
if not docathon_label_present:
print("The 'docathon-h1-2024' label is not present in the issue.")
return
pull_request_labels = pull_request.get_labels()
pull_request_label_names = [label.name for label in pull_request_labels]
issue_label_names = [label.name for label in issue_labels]
labels_to_add = [
label for label in issue_label_names if label not in pull_request_label_names
]
if not labels_to_add:
print("The pull request already has the same labels.")
return
pull_request.add_to_labels(*labels_to_add)
print("Labels added to the pull request!")
if __name__ == "__main__":
main()

View File

@ -19,7 +19,7 @@ CUDA_ARCHES = ["11.8", "12.1", "12.4"]
CUDA_ARCHES_FULL_VERSION = {"11.8": "11.8.0", "12.1": "12.1.1", "12.4": "12.4.0"}
CUDA_ARCHES_CUDNN_VERSION = {"11.8": "8", "12.1": "8", "12.4": "8"}
CUDA_ARCHES_CUDNN_VERSION = {"11.8": "9", "12.1": "9", "12.4": "9"}
ROCM_ARCHES = ["6.0", "6.1"]
@ -31,12 +31,18 @@ CPU_CXX11_ABI_ARCH = ["cpu-cxx11-abi"]
CPU_AARCH64_ARCH = ["cpu-aarch64"]
CPU_S390X_ARCH = ["cpu-s390x"]
CUDA_AARCH64_ARCH = ["cuda-aarch64"]
PYTORCH_EXTRA_INSTALL_REQUIREMENTS = {
"11.8": (
"nvidia-cuda-nvrtc-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | " # noqa: B950
"nvidia-cuda-runtime-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cuda-cupti-cu11==11.8.87; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cudnn-cu11==8.7.0.84; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cudnn-cu11==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cublas-cu11==11.11.3.6; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cufft-cu11==10.9.0.58; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-curand-cu11==10.3.0.86; platform_system == 'Linux' and platform_machine == 'x86_64' | "
@ -49,7 +55,7 @@ PYTORCH_EXTRA_INSTALL_REQUIREMENTS = {
"nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | " # noqa: B950
"nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cudnn-cu12==8.9.2.26; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | "
@ -62,7 +68,7 @@ PYTORCH_EXTRA_INSTALL_REQUIREMENTS = {
"nvidia-cuda-nvrtc-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cuda-runtime-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cuda-cupti-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cudnn-cu12==8.9.7.29; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cublas-cu12==12.4.2.65; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cufft-cu12==11.2.0.44; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-curand-cu12==10.3.5.119; platform_system == 'Linux' and platform_machine == 'x86_64' | "
@ -130,6 +136,10 @@ def arch_type(arch_version: str) -> str:
return "cpu-cxx11-abi"
elif arch_version in CPU_AARCH64_ARCH:
return "cpu-aarch64"
elif arch_version in CPU_S390X_ARCH:
return "cpu-s390x"
elif arch_version in CUDA_AARCH64_ARCH:
return "cuda-aarch64"
else: # arch_version should always be "cpu" in this case
return "cpu"
@ -149,6 +159,8 @@ WHEEL_CONTAINER_IMAGES = {
"cpu": f"pytorch/manylinux-builder:cpu-{DEFAULT_TAG}",
"cpu-cxx11-abi": f"pytorch/manylinuxcxx11-abi-builder:cpu-cxx11-abi-{DEFAULT_TAG}",
"cpu-aarch64": f"pytorch/manylinuxaarch64-builder:cpu-aarch64-{DEFAULT_TAG}",
"cpu-s390x": f"pytorch/manylinuxs390x-builder:cpu-s390x-{DEFAULT_TAG}",
"cuda-aarch64": f"pytorch/manylinuxaarch64-builder:cuda12.4-{DEFAULT_TAG}",
}
CONDA_CONTAINER_IMAGES = {
@ -205,7 +217,9 @@ def translate_desired_cuda(gpu_arch_type: str, gpu_arch_version: str) -> str:
"cpu": "cpu",
"cpu-aarch64": "cpu",
"cpu-cxx11-abi": "cpu-cxx11-abi",
"cpu-s390x": "cpu",
"cuda": f"cu{gpu_arch_version.replace('.', '')}",
"cuda-aarch64": "cu124",
"rocm": f"rocm{gpu_arch_version}",
}.get(gpu_arch_type, gpu_arch_version)
@ -286,11 +300,11 @@ def generate_libtorch_matrix(
"libtorch_variant": libtorch_variant,
"libtorch_config": abi_version if os == "windows" else "",
"devtoolset": abi_version if os != "windows" else "",
"container_image": LIBTORCH_CONTAINER_IMAGES[
(arch_version, abi_version)
]
if os != "windows"
else "",
"container_image": (
LIBTORCH_CONTAINER_IMAGES[(arch_version, abi_version)]
if os != "windows"
else ""
),
"package_type": "libtorch",
"build_name": f"libtorch-{gpu_arch_type}{gpu_arch_version}-{libtorch_variant}-{abi_version}".replace(
".", "_"
@ -306,8 +320,8 @@ def generate_wheels_matrix(
python_versions: Optional[List[str]] = None,
) -> List[Dict[str, str]]:
package_type = "wheel"
if os == "linux" or os == "linux-aarch64":
# NOTE: We only build manywheel packages for x86_64 and aarch64 linux
if os == "linux" or os == "linux-aarch64" or os == "linux-s390x":
# NOTE: We only build manywheel packages for x86_64 and aarch64 and s390x linux
package_type = "manywheel"
if python_versions is None:
@ -323,22 +337,36 @@ def generate_wheels_matrix(
elif os == "linux-aarch64":
# Only want the one arch as the CPU type is different and
# uses different build/test scripts
arches = ["cpu-aarch64"]
arches = ["cpu-aarch64", "cuda-aarch64"]
elif os == "linux-s390x":
# Only want the one arch as the CPU type is different and
# uses different build/test scripts
arches = ["cpu-s390x"]
ret: List[Dict[str, str]] = []
for python_version in python_versions:
for arch_version in arches:
gpu_arch_type = arch_type(arch_version)
# Disable py3.12 builds for ROCm because of triton dependency
# on llnl-hatchet, which doesn't have py3.12 wheels available
if gpu_arch_type == "rocm" and python_version == "3.12":
continue
gpu_arch_version = (
""
if arch_version == "cpu"
or arch_version == "cpu-cxx11-abi"
or arch_version == "cpu-aarch64"
or arch_version == "cpu-s390x"
or arch_version == "cuda-aarch64"
else arch_version
)
# 12.1 linux wheels require PYTORCH_EXTRA_INSTALL_REQUIREMENTS to install
if arch_version in ["12.4", "12.1", "11.8"] and os == "linux":
if (
arch_version in ["12.4", "12.1", "11.8"]
and os == "linux"
or arch_version == "cuda-aarch64"
):
ret.append(
{
"python_version": python_version,
@ -347,10 +375,16 @@ def generate_wheels_matrix(
"desired_cuda": translate_desired_cuda(
gpu_arch_type, gpu_arch_version
),
"devtoolset": "",
"devtoolset": (
"cxx11-abi" if arch_version == "cuda-aarch64" else ""
),
"container_image": WHEEL_CONTAINER_IMAGES[arch_version],
"package_type": package_type,
"pytorch_extra_install_requirements": PYTORCH_EXTRA_INSTALL_REQUIREMENTS[arch_version], # fmt: skip
"pytorch_extra_install_requirements": (
PYTORCH_EXTRA_INSTALL_REQUIREMENTS[arch_version] # fmt: skip
if os != "linux-aarch64"
else ""
),
"build_name": f"{package_type}-py{python_version}-{gpu_arch_type}{gpu_arch_version}".replace( # noqa: B950
".", "_"
),
@ -365,17 +399,19 @@ def generate_wheels_matrix(
"desired_cuda": translate_desired_cuda(
gpu_arch_type, gpu_arch_version
),
"devtoolset": "cxx11-abi"
if arch_version == "cpu-cxx11-abi"
else "",
"devtoolset": (
"cxx11-abi" if arch_version == "cpu-cxx11-abi" else ""
),
"container_image": WHEEL_CONTAINER_IMAGES[arch_version],
"package_type": package_type,
"build_name": f"{package_type}-py{python_version}-{gpu_arch_type}{gpu_arch_version}".replace(
".", "_"
),
"pytorch_extra_install_requirements":
PYTORCH_EXTRA_INSTALL_REQUIREMENTS["12.1"] # fmt: skip
if os != "linux" else "",
"pytorch_extra_install_requirements": (
PYTORCH_EXTRA_INSTALL_REQUIREMENTS["12.1"] # fmt: skip
if os != "linux"
else ""
),
}
)
return ret

View File

@ -5,11 +5,11 @@ import sys
from dataclasses import asdict, dataclass, field
from pathlib import Path
from typing import Dict, Iterable, List, Literal, Set
from typing_extensions import TypedDict # Python 3.11+
import generate_binary_build_matrix # type: ignore[import]
import jinja2
from typing_extensions import TypedDict # Python 3.11+
Arch = Literal["windows", "linux", "macos"]
@ -60,7 +60,7 @@ class BinaryBuildWorkflow:
branches: str = "nightly"
# Mainly for macos
cross_compile_arm64: bool = False
macos_runner: str = "macos-12-xl"
macos_runner: str = "macos-14-xlarge"
def __post_init__(self) -> None:
if self.abi_version:
@ -95,6 +95,7 @@ class OperatingSystem:
MACOS = "macos"
MACOS_ARM64 = "macos-arm64"
LINUX_AARCH64 = "linux-aarch64"
LINUX_S390X = "linux-s390x"
LINUX_BINARY_BUILD_WORFKLOWS = [
@ -156,7 +157,7 @@ LINUX_BINARY_SMOKE_WORKFLOWS = [
package_type="manywheel",
build_configs=generate_binary_build_matrix.generate_wheels_matrix(
OperatingSystem.LINUX,
arches=["11.8", "12.1"],
arches=["11.8", "12.1", "12.4"],
python_versions=["3.8"],
),
branches="main",
@ -284,7 +285,7 @@ MACOS_BINARY_BUILD_WORKFLOWS = [
libtorch_variants=["shared-with-deps"],
),
cross_compile_arm64=False,
macos_runner="macos-13-xlarge",
macos_runner="macos-14-xlarge",
ciflow_config=CIFlowConfig(
labels={LABEL_CIFLOW_BINARIES, LABEL_CIFLOW_BINARIES_LIBTORCH},
isolated_workflow=True,
@ -297,7 +298,7 @@ MACOS_BINARY_BUILD_WORKFLOWS = [
OperatingSystem.MACOS_ARM64
),
cross_compile_arm64=False,
macos_runner="macos-13-xlarge",
macos_runner="macos-14-xlarge",
ciflow_config=CIFlowConfig(
labels={LABEL_CIFLOW_BINARIES, LABEL_CIFLOW_BINARIES_WHEEL},
isolated_workflow=True,
@ -307,7 +308,7 @@ MACOS_BINARY_BUILD_WORKFLOWS = [
os=OperatingSystem.MACOS_ARM64,
package_type="conda",
cross_compile_arm64=False,
macos_runner="macos-13-xlarge",
macos_runner="macos-14-xlarge",
build_configs=generate_binary_build_matrix.generate_conda_matrix(
OperatingSystem.MACOS_ARM64
),
@ -332,6 +333,20 @@ AARCH64_BINARY_BUILD_WORKFLOWS = [
),
]
S390X_BINARY_BUILD_WORKFLOWS = [
BinaryBuildWorkflow(
os=OperatingSystem.LINUX_S390X,
package_type="manywheel",
build_configs=generate_binary_build_matrix.generate_wheels_matrix(
OperatingSystem.LINUX_S390X
),
ciflow_config=CIFlowConfig(
labels={LABEL_CIFLOW_BINARIES, LABEL_CIFLOW_BINARIES_WHEEL},
isolated_workflow=True,
),
),
]
def main() -> None:
jinja_env = jinja2.Environment(
@ -350,6 +365,10 @@ def main() -> None:
jinja_env.get_template("linux_binary_build_workflow.yml.j2"),
AARCH64_BINARY_BUILD_WORKFLOWS,
),
(
jinja_env.get_template("linux_binary_build_workflow.yml.j2"),
S390X_BINARY_BUILD_WORKFLOWS,
),
(
jinja_env.get_template("linux_binary_build_workflow.yml.j2"),
LINUX_BINARY_SMOKE_WORKFLOWS,

View File

@ -7,7 +7,7 @@ eval "$(command conda 'shell.bash' 'hook' 2> /dev/null)"
conda activate "${CONDA_ENV}"
# Use uv to speed up lintrunner init
python3 -m pip install uv
python3 -m pip install uv==0.1.45
CACHE_DIRECTORY="/tmp/.lintbin"
# Try to recover the cached binaries

View File

@ -18,6 +18,7 @@ PYTEST_CACHE_KEY_PREFIX = "pytest_cache"
PYTEST_CACHE_DIR_NAME = ".pytest_cache"
BUCKET = "gha-artifacts"
LASTFAILED_FILE_PATH = Path("v/cache/lastfailed")
TD_HEURISTIC_PREVIOUSLY_FAILED_ADDITIONAL = "previous_failures_additional.json"
# Temp folders
ZIP_UPLOAD = "zip-upload"
@ -191,6 +192,10 @@ def _merge_pytest_caches(
pytest_cache_dir_to_merge_from, pytest_cache_dir_to_merge_into
)
_merge_additional_failures_files(
pytest_cache_dir_to_merge_from, pytest_cache_dir_to_merge_into
)
def _merge_lastfailed_files(source_pytest_cache: Path, dest_pytest_cache: Path) -> None:
# Simple cases where one of the files doesn't exist
@ -232,3 +237,27 @@ def _merged_lastfailed_content(
del to_lastfailed[""]
return to_lastfailed
def _merge_additional_failures_files(
source_pytest_cache: Path, dest_pytest_cache: Path
) -> None:
# Simple cases where one of the files doesn't exist
source_lastfailed_file = (
source_pytest_cache / TD_HEURISTIC_PREVIOUSLY_FAILED_ADDITIONAL
)
dest_lastfailed_file = dest_pytest_cache / TD_HEURISTIC_PREVIOUSLY_FAILED_ADDITIONAL
if not source_lastfailed_file.exists():
return
if not dest_lastfailed_file.exists():
copy_file(source_lastfailed_file, dest_lastfailed_file)
return
# Both files exist, so we need to merge them
from_lastfailed = load_json_file(source_lastfailed_file)
to_lastfailed = load_json_file(dest_lastfailed_file)
merged_content = list(set(from_lastfailed + to_lastfailed))
# Save the results
write_json_file(dest_lastfailed_file, merged_content)

View File

@ -0,0 +1,28 @@
#!/bin/bash
set -eoux pipefail
SYNC_BRANCH=fbcode/pytorch-stable-prototype
git config user.email "fake@example.com"
git config user.name "PyTorch Stable Bot"
git fetch origin main
git fetch origin "$SYNC_BRANCH"
git checkout "$SYNC_BRANCH"
for SHA in $(git log 4333e122d4b74cdf84351ed2907045c6a767b4cd..origin/main --pretty="%h" --reverse -- torch/distributed torch/csrc/distributed test/distributed test/cpp/c10d benchmarks/distributed)
do
# `git merge-base --is-ancestor` exits with code 0 if the given SHA is an ancestor, and non-0 otherwise
if git merge-base --is-ancestor $SHA HEAD || [[ $(git log --grep="(cherry picked from commit $SHA") ]]
then
echo "Skipping $SHA"
continue
fi
echo "Copying $SHA"
git cherry-pick -x "$SHA"
done
if [[ "${WITH_PUSH}" == true ]]; then
git push
fi

View File

@ -773,13 +773,13 @@ class TestBypassFailures(TestCase):
# than the one on the base commit. This should still count as broken trunk
"pr_num": 104214,
"related_failure_count": 0,
"unrelated_failure_count": 1,
"flaky_or_broken_trunk": 1,
},
{
# This PR had one broken trunk failure and it used ghstack
"pr_num": 105145,
"related_failure_count": 0,
"unrelated_failure_count": 1,
"flaky_or_broken_trunk": 1,
},
{
# The failure on the merge base was retried successfully and
@ -788,20 +788,20 @@ class TestBypassFailures(TestCase):
# be used to detect broken trunk
"pr_num": 107160,
"related_failure_count": 0,
"unrelated_failure_count": 4,
"flaky_or_broken_trunk": 1,
},
{
# This PR used Dr.CI broken trunk classification
"pr_num": 111253,
"related_failure_count": 1,
"unrelated_failure_count": 2,
"flaky_or_broken_trunk": 1,
},
]
for case in test_cases:
pr_num = case["pr_num"]
related_failure_count = case["related_failure_count"]
unrelated_failure_count = case["unrelated_failure_count"]
flaky_or_broken_trunk = case["flaky_or_broken_trunk"]
pr = GitHubPR("pytorch", "pytorch", pr_num)
checks = pr.get_checkrun_conclusions()
@ -823,7 +823,7 @@ class TestBypassFailures(TestCase):
)
self.assertTrue(len(pending) == 0)
self.assertTrue(
len(failed) == unrelated_failure_count + related_failure_count
len(failed) == flaky_or_broken_trunk + related_failure_count
)
def test_ignore_current(self, *args: Any) -> None:

View File

@ -2027,10 +2027,8 @@ def categorize_checks(
pending_checks: List[Tuple[str, Optional[str], Optional[int]]] = []
failed_checks: List[Tuple[str, Optional[str], Optional[int]]] = []
# ok_failed_checks is used with ok_failed_checks_threshold while ignorable_failed_checks
# is used to keep track of all ignorable failures when saving the merge record on Rockset
ok_failed_checks: List[Tuple[str, Optional[str], Optional[int]]] = []
ignorable_failed_checks: Dict[str, List[Any]] = defaultdict(list)
# failed_checks_categorization is used to keep track of all ignorable failures when saving the merge record on Rockset
failed_checks_categorization: Dict[str, List[Any]] = defaultdict(list)
# If required_checks is not set or empty, consider all names are relevant
relevant_checknames = [
@ -2058,36 +2056,38 @@ def categorize_checks(
continue
elif not is_passing_status(check_runs[checkname].status):
target = (
ignorable_failed_checks[classification]
failed_checks_categorization[classification]
if classification
in ("IGNORE_CURRENT_CHECK", "BROKEN_TRUNK", "FLAKY", "UNSTABLE")
else failed_checks
)
target.append((checkname, url, job_id))
if classification in ("BROKEN_TRUNK", "FLAKY", "UNSTABLE"):
ok_failed_checks.append((checkname, url, job_id))
flaky_or_broken_trunk = (
failed_checks_categorization["BROKEN_TRUNK"]
+ failed_checks_categorization["FLAKY"]
)
if ok_failed_checks:
if flaky_or_broken_trunk:
warn(
f"The following {len(ok_failed_checks)} checks failed but were likely due flakiness or broken trunk: "
+ ", ".join([x[0] for x in ok_failed_checks])
f"The following {len(flaky_or_broken_trunk)} checks failed but were likely due flakiness or broken trunk: "
+ ", ".join([x[0] for x in flaky_or_broken_trunk])
+ (
f" but this is greater than the threshold of {ok_failed_checks_threshold} so merge will fail"
if ok_failed_checks_threshold is not None
and len(ok_failed_checks) > ok_failed_checks_threshold
and len(flaky_or_broken_trunk) > ok_failed_checks_threshold
else ""
)
)
if (
ok_failed_checks_threshold is not None
and len(ok_failed_checks) > ok_failed_checks_threshold
and len(flaky_or_broken_trunk) > ok_failed_checks_threshold
):
failed_checks = failed_checks + ok_failed_checks
failed_checks = failed_checks + flaky_or_broken_trunk
# The list of ignorable_failed_checks is returned so that it can be saved into the Rockset merge record
return (pending_checks, failed_checks, ignorable_failed_checks)
# The list of failed_checks_categorization is returned so that it can be saved into the Rockset merge record
return (pending_checks, failed_checks, failed_checks_categorization)
def merge(

View File

@ -33,6 +33,8 @@ env:
# Needed for conda builds
{%- if "aarch64" in build_environment %}
ALPINE_IMAGE: "arm64v8/alpine"
{%- elif "s390x" in build_environment %}
ALPINE_IMAGE: "docker.io/s390x/alpine"
{%- else %}
ALPINE_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/alpine"
{%- endif %}
@ -56,8 +58,11 @@ jobs:
uses: ./.github/workflows/_binary-build-linux.yml
with:!{{ upload.binary_env_as_input(config) }}
{%- if "aarch64" in build_environment %}
runs_on: linux.arm64.2xlarge
runs_on: linux.arm64.m7g.4xlarge
ALPINE_IMAGE: "arm64v8/alpine"
{%- elif "s390x" in build_environment %}
runs_on: linux.s390x
ALPINE_IMAGE: "docker.io/s390x/alpine"
{%- elif "conda" in build_environment and config["gpu_arch_type"] == "cuda" %}
runs_on: linux.24xlarge
{%- endif %}
@ -66,12 +71,17 @@ jobs:
{%- if config.pytorch_extra_install_requirements is defined and config.pytorch_extra_install_requirements|d('')|length > 0 %}
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: !{{ config.pytorch_extra_install_requirements }}
{%- endif %}
{%- if config["gpu_arch_type"] == "cuda-aarch64" %}
timeout-minutes: 420
{%- endif %}
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
{%- if config["gpu_arch_type"] != "cuda-aarch64" %}
!{{ config["build_name"] }}-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs: !{{ config["build_name"] }}-build
{%- if config["gpu_arch_type"] != "rocm" %}
{%- if config["gpu_arch_type"] != "rocm" %}
uses: ./.github/workflows/_binary-test-linux.yml
with:!{{ upload.binary_env_as_input(config) }}
build_name: !{{ config["build_name"] }}
@ -79,6 +89,9 @@ jobs:
{%- if "aarch64" in build_environment %}
runs_on: linux.arm64.2xlarge
ALPINE_IMAGE: "arm64v8/alpine"
{%- elif "s390x" in build_environment %}
runs_on: linux.s390x
ALPINE_IMAGE: "docker.io/s390x/alpine"
{%- elif config["gpu_arch_type"] == "rocm" %}
runs_on: linux.rocm.gpu
{%- elif config["gpu_arch_type"] == "cuda" %}
@ -88,7 +101,7 @@ jobs:
{%- endif %}
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
{%- else %}
{%- else %}
runs-on: linux.rocm.gpu
timeout-minutes: !{{ common.timeout_minutes }}
!{{ upload.binary_env(config) }}
@ -113,7 +126,8 @@ jobs:
uses: ./pytorch/.github/actions/test-pytorch-binary
- name: Teardown ROCm
uses: ./.github/actions/teardown-rocm
{%- endif %}
{%- endif %}
{%- endif %}
{%- if branches == "nightly" %}
!{{ upload.upload_binaries(config) }}

View File

@ -57,7 +57,11 @@
id-token: write
contents: read
{%- if has_test %}
{%- if config["gpu_arch_type"] == "cuda-aarch64" %}
needs: !{{ config["build_name"] }}-build
{%- else %}
needs: !{{ config["build_name"] }}-test
{%- endif %}
{%- else %}
needs: !{{ config["build_name"] }}-build
{%- endif %}

View File

@ -12,10 +12,15 @@ on:
type: string
description: The build environment
runs_on:
required: false
default: linux.12xlarge
type: string
description: Hardware to run this "build"job on, linux.12xlarge or linux.arm64.2xlarge.
required: false
default: linux.12xlarge
type: string
description: Hardware to run this "build"job on, linux.12xlarge or linux.arm64.2xlarge.
timeout-minutes:
required: false
default: 210
type: number
description: timeout for the job
ALPINE_IMAGE:
required: false
type: string
@ -78,7 +83,7 @@ on:
jobs:
build:
runs-on: ${{ inputs.runs_on }}
timeout-minutes: 210
timeout-minutes: ${{ inputs.timeout-minutes }}
env:
PYTORCH_ROOT: ${{ inputs.PYTORCH_ROOT }}
BUILDER_ROOT: ${{ inputs.BUILDER_ROOT }}
@ -139,6 +144,7 @@ jobs:
run: env
- name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
if: inputs.build_environment != 'linux-s390x-binary-manywheel'
uses: pytorch/test-infra/.github/actions/setup-ssh@main
continue-on-error: true
with:
@ -147,12 +153,14 @@ jobs:
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
with:
no-sudo: ${{ inputs.build_environment == 'linux-aarch64-binary-manywheel' }}
no-sudo: ${{ inputs.build_environment == 'linux-aarch64-binary-manywheel' || inputs.build_environment == 'linux-s390x-binary-manywheel' }}
- name: Setup Linux
if: inputs.build_environment != 'linux-s390x-binary-manywheel'
uses: ./.github/actions/setup-linux
- name: Chown workspace
if: inputs.build_environment != 'linux-s390x-binary-manywheel'
uses: ./.github/actions/chown-workspace
with:
ALPINE_IMAGE: ${{ inputs.ALPINE_IMAGE }}
@ -165,7 +173,7 @@ jobs:
rm -rf "${GITHUB_WORKSPACE}"
mkdir "${GITHUB_WORKSPACE}"
if [[ ${{ inputs.build_environment }} == 'linux-aarch64-binary-manywheel' ]]; then
if [[ ${{ inputs.build_environment }} == 'linux-aarch64-binary-manywheel' ]] || [[ ${{ inputs.build_environment }} == 'linux-s390x-binary-manywheel' ]] ; then
rm -rf "${RUNNER_TEMP}/artifacts"
mkdir "${RUNNER_TEMP}/artifacts"
fi
@ -212,7 +220,7 @@ jobs:
]}
- name: Pull Docker image
if: ${{ steps.filter.outputs.is-test-matrix-empty == 'False' }}
if: ${{ steps.filter.outputs.is-test-matrix-empty == 'False' && inputs.build_environment != 'linux-s390x-binary-manywheel' }}
uses: pytorch/test-infra/.github/actions/pull-docker-image@main
with:
docker-image: ${{ inputs.DOCKER_IMAGE }}
@ -254,7 +262,7 @@ jobs:
fi
- name: Chown artifacts
if: ${{ steps.filter.outputs.is-test-matrix-empty == 'False' }}
if: ${{ steps.filter.outputs.is-test-matrix-empty == 'False' && inputs.build_environment != 'linux-s390x-binary-manywheel' }}
shell: bash
run: |
# Ensure the working directory gets chowned back to the current user
@ -269,11 +277,20 @@ jobs:
${{ runner.temp }}/artifacts/*
- name: Teardown Linux
if: always()
if: always() && inputs.build_environment != 'linux-s390x-binary-manywheel'
uses: pytorch/test-infra/.github/actions/teardown-linux@main
- name: Chown workspace
if: always()
if: always() && inputs.build_environment != 'linux-s390x-binary-manywheel'
uses: ./pytorch/.github/actions/chown-workspace
with:
ALPINE_IMAGE: ${{ inputs.ALPINE_IMAGE }}
- name: Cleanup docker
if: always() && inputs.build_environment == 'linux-s390x-binary-manywheel'
shell: bash
run: |
# on s390x stop the container for clean worker stop
# ignore expansion of "docker ps -q" since it could be empty
# shellcheck disable=SC2046
docker stop $(docker ps -q) || true

View File

@ -127,6 +127,7 @@ jobs:
} >> "${GITHUB_ENV} }}"
- name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
if: inputs.build_environment != 'linux-s390x-binary-manywheel'
uses: pytorch/test-infra/.github/actions/setup-ssh@main
continue-on-error: true
with:
@ -136,12 +137,14 @@ jobs:
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
with:
no-sudo: ${{ inputs.build_environment == 'linux-aarch64-binary-manywheel' }}
no-sudo: ${{ inputs.build_environment == 'linux-aarch64-binary-manywheel' || inputs.build_environment == 'linux-s390x-binary-manywheel' }}
- name: Setup Linux
if: inputs.build_environment != 'linux-s390x-binary-manywheel'
uses: ./.github/actions/setup-linux
- name: Chown workspace
if: inputs.build_environment != 'linux-s390x-binary-manywheel'
uses: ./.github/actions/chown-workspace
with:
ALPINE_IMAGE: ${{ inputs.ALPINE_IMAGE }}
@ -203,7 +206,7 @@ jobs:
if: ${{ inputs.GPU_ARCH_TYPE == 'cuda' && steps.filter.outputs.is-test-matrix-empty == 'False' }}
- name: Pull Docker image
if: ${{ steps.filter.outputs.is-test-matrix-empty == 'False' }}
if: ${{ steps.filter.outputs.is-test-matrix-empty == 'False' && inputs.build_environment != 'linux-s390x-binary-manywheel' }}
uses: pytorch/test-infra/.github/actions/pull-docker-image@main
with:
docker-image: ${{ inputs.DOCKER_IMAGE }}
@ -213,11 +216,11 @@ jobs:
uses: ./pytorch/.github/actions/test-pytorch-binary
- name: Teardown Linux
if: always()
if: always() && inputs.build_environment != 'linux-s390x-binary-manywheel'
uses: pytorch/test-infra/.github/actions/teardown-linux@main
- name: Chown workspace
if: always()
if: always() && inputs.build_environment != 'linux-s390x-binary-manywheel'
uses: ./pytorch/.github/actions/chown-workspace
with:
ALPINE_IMAGE: ${{ inputs.ALPINE_IMAGE }}

View File

@ -8,6 +8,8 @@ on:
jobs:
assign:
runs-on: ubuntu-latest
permissions:
issues: write
steps:
- name: Check for "/assigntome" in comment
uses: actions/github-script@v6
@ -26,14 +28,14 @@ jobs:
repo: context.repo.repo,
issue_number: issueNumber
});
const hasLabel = issue.labels.some(label => label.name === 'docathon-h2-2023');
const hasLabel = issue.labels.some(label => label.name === 'docathon-h1-2024');
if (hasLabel) {
if (issue.assignee !== null) {
await github.rest.issues.createComment({
owner: context.repo.owner,
repo: context.repo.repo,
issue_number: issueNumber,
body: "The issue is already assigned. Please pick an opened and unnasigned issue with the [docathon-h2-2023 label](https://github.com/pytorch/pytorch/issues?q=is%3Aopen+is%3Aissue+label%3Adocathon-h2-2023)."
body: "The issue is already assigned. Please pick an opened and unnasigned issue with the [docathon-h1-2024 label](https://github.com/pytorch/pytorch/issues?q=is%3Aopen+is%3Aissue+label%3Adocathon-h1-2024)."
});
} else {
await github.rest.issues.addAssignees({
@ -44,7 +46,7 @@ jobs:
});
}
} else {
const commmentMessage = "This issue does not have the correct label. Please pick an opened and unnasigned issue with the [docathon-h2-2023 label](https://github.com/pytorch/pytorch/issues?q=is%3Aopen+is%3Aissue+label%3Adocathon-h2-2023)."
const commmentMessage = "This issue does not have the correct label. Please pick an opened and unnasigned issue with the [docathon-h1-2024 label](https://github.com/pytorch/pytorch/issues?q=is%3Aopen+is%3Aissue+label%3Adocathon-h1-2024)."
await github.rest.issues.createComment({
owner: context.repo.owner,
repo: context.repo.repo,

View File

@ -49,7 +49,7 @@ jobs:
{ config: "default",
shard: 1,
num_shards: 1,
runner: "macos-13-xlarge",
runner: "macos-14-xlarge",
ios_platform: "SIMULATOR",
ios_arch: "arm64",
use_lite_interpreter: ${{ inputs.use_lite_interpreter || 1 }},
@ -60,7 +60,7 @@ jobs:
{ config: "default",
shard: 1,
num_shards: 1,
runner: "macos-13-xlarge",
runner: "macos-14-xlarge",
ios_platform: "OS",
ios_arch: "arm64",
use_lite_interpreter: ${{ inputs.use_lite_interpreter || 1 }},

View File

@ -18,6 +18,6 @@ jobs:
ROCKSET_API_KEY: ${{ secrets.ROCKSET_API_KEY }}
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: |
pip3 install requests==2.26
pip3 install requests==2.32.2
pip3 install rockset==1.0.3
python3 .github/scripts/close_nonexistent_disable_issues.py

View File

@ -29,7 +29,7 @@ jobs:
- name: Setup Python
uses: actions/setup-python@v4
with:
python-version: '3.8'
python-version: '3.11'
architecture: x64
check-latest: false

View File

@ -0,0 +1,30 @@
name: Docathon Labels Sync
on:
pull_request_target:
types: [opened, synchronize, edited]
branches: [main]
jobs:
check-labels:
runs-on: ubuntu-latest
permissions:
issues: write
pull-requests: write
steps:
- name: Check out the repo
uses: actions/checkout@v2
with:
fetch-depth: 1
- name: Set up Python
uses: actions/setup-python@v2
with:
python-version: 3.x
- name: Install dependencies
run: |
pip install requests==2.32.3
pip install PyGithub==2.3.0
- name: Run Python script
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: python ./.github/scripts/docathon-label-sync.py ${{ github.event.pull_request.number }}

View File

@ -38,15 +38,19 @@ jobs:
matrix:
runner: [linux.12xlarge]
docker-image-name: [
pytorch-linux-focal-cuda12.1-cudnn8-py3-gcc9,
pytorch-linux-focal-cuda12.1-cudnn8-py3-gcc9-inductor-benchmarks,
pytorch-linux-focal-cuda11.8-cudnn8-py3-gcc9,
pytorch-linux-focal-cuda12.4-cudnn9-py3-gcc9,
pytorch-linux-focal-cuda12.4-cudnn9-py3-gcc9-inductor-benchmarks,
pytorch-linux-focal-cuda12.4-cudnn9-py3.12-gcc9-inductor-benchmarks,
pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9,
pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9-inductor-benchmarks,
pytorch-linux-focal-cuda12.1-cudnn9-py3.12-gcc9-inductor-benchmarks,
pytorch-linux-focal-cuda11.8-cudnn9-py3-gcc9,
pytorch-linux-focal-py3.8-clang10,
pytorch-linux-focal-py3.11-clang10,
pytorch-linux-focal-py3.12-clang10,
pytorch-linux-focal-rocm-n-1-py3,
pytorch-linux-focal-rocm-n-py3,
pytorch-linux-jammy-cuda11.8-cudnn8-py3.8-clang12,
pytorch-linux-jammy-cuda11.8-cudnn9-py3.8-clang12,
pytorch-linux-focal-py3-clang9-android-ndk-r21e,
pytorch-linux-jammy-py3.8-gcc11,
pytorch-linux-jammy-py3.8-gcc11-inductor-benchmarks,
@ -54,7 +58,7 @@ jobs:
pytorch-linux-jammy-py3-clang15-asan,
pytorch-linux-focal-py3-clang10-onnx,
pytorch-linux-focal-linter,
pytorch-linux-jammy-cuda11.8-cudnn8-py3.9-linter,
pytorch-linux-jammy-cuda11.8-cudnn9-py3.9-linter,
pytorch-linux-jammy-py3-clang12-executorch
]
include:

View File

@ -149,3 +149,10 @@ jobs:
- name: Teardown Linux
uses: pytorch/test-infra/.github/actions/teardown-linux@main
if: always()
validate:
needs: build
uses: pytorch/builder/.github/workflows/validate-docker-images.yml@main
with:
channel: nightly
ref: main

View File

@ -50,11 +50,11 @@ jobs:
GPU_ARCH_TYPE: cpu-aarch64
DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cpu-aarch64-main
DESIRED_PYTHON: "3.8"
runs_on: linux.arm64.2xlarge
runs_on: linux.arm64.m7g.4xlarge
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_8-cpu-aarch64
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==8.9.2.26; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_8-cpu-aarch64-test: # Testing
@ -100,6 +100,51 @@ jobs:
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_8-cuda-aarch64-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cuda12.4-main
DESIRED_DEVTOOLSET: cxx11-abi
DESIRED_PYTHON: "3.8"
runs_on: linux.arm64.m7g.4xlarge
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_8-cuda-aarch64
build_environment: linux-aarch64-binary-manywheel
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_8-cuda-aarch64-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_8-cuda-aarch64-build
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cuda12.4-main
DESIRED_DEVTOOLSET: cxx11-abi
DESIRED_PYTHON: "3.8"
build_name: manywheel-py3_8-cuda-aarch64
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_9-cpu-aarch64-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
@ -113,11 +158,11 @@ jobs:
GPU_ARCH_TYPE: cpu-aarch64
DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cpu-aarch64-main
DESIRED_PYTHON: "3.9"
runs_on: linux.arm64.2xlarge
runs_on: linux.arm64.m7g.4xlarge
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_9-cpu-aarch64
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==8.9.2.26; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_9-cpu-aarch64-test: # Testing
@ -163,6 +208,51 @@ jobs:
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_9-cuda-aarch64-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cuda12.4-main
DESIRED_DEVTOOLSET: cxx11-abi
DESIRED_PYTHON: "3.9"
runs_on: linux.arm64.m7g.4xlarge
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_9-cuda-aarch64
build_environment: linux-aarch64-binary-manywheel
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_9-cuda-aarch64-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_9-cuda-aarch64-build
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cuda12.4-main
DESIRED_DEVTOOLSET: cxx11-abi
DESIRED_PYTHON: "3.9"
build_name: manywheel-py3_9-cuda-aarch64
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_10-cpu-aarch64-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
@ -176,11 +266,11 @@ jobs:
GPU_ARCH_TYPE: cpu-aarch64
DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cpu-aarch64-main
DESIRED_PYTHON: "3.10"
runs_on: linux.arm64.2xlarge
runs_on: linux.arm64.m7g.4xlarge
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_10-cpu-aarch64
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==8.9.2.26; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_10-cpu-aarch64-test: # Testing
@ -226,6 +316,51 @@ jobs:
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_10-cuda-aarch64-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cuda12.4-main
DESIRED_DEVTOOLSET: cxx11-abi
DESIRED_PYTHON: "3.10"
runs_on: linux.arm64.m7g.4xlarge
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_10-cuda-aarch64
build_environment: linux-aarch64-binary-manywheel
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_10-cuda-aarch64-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_10-cuda-aarch64-build
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cuda12.4-main
DESIRED_DEVTOOLSET: cxx11-abi
DESIRED_PYTHON: "3.10"
build_name: manywheel-py3_10-cuda-aarch64
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_11-cpu-aarch64-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
@ -239,11 +374,11 @@ jobs:
GPU_ARCH_TYPE: cpu-aarch64
DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cpu-aarch64-main
DESIRED_PYTHON: "3.11"
runs_on: linux.arm64.2xlarge
runs_on: linux.arm64.m7g.4xlarge
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_11-cpu-aarch64
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==8.9.2.26; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_11-cpu-aarch64-test: # Testing
@ -289,6 +424,51 @@ jobs:
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_11-cuda-aarch64-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cuda12.4-main
DESIRED_DEVTOOLSET: cxx11-abi
DESIRED_PYTHON: "3.11"
runs_on: linux.arm64.m7g.4xlarge
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_11-cuda-aarch64
build_environment: linux-aarch64-binary-manywheel
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_11-cuda-aarch64-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_11-cuda-aarch64-build
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cuda12.4-main
DESIRED_DEVTOOLSET: cxx11-abi
DESIRED_PYTHON: "3.11"
build_name: manywheel-py3_11-cuda-aarch64
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_12-cpu-aarch64-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
@ -302,11 +482,11 @@ jobs:
GPU_ARCH_TYPE: cpu-aarch64
DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cpu-aarch64-main
DESIRED_PYTHON: "3.12"
runs_on: linux.arm64.2xlarge
runs_on: linux.arm64.m7g.4xlarge
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_12-cpu-aarch64
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==8.9.2.26; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_12-cpu-aarch64-test: # Testing
@ -351,3 +531,48 @@ jobs:
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_12-cuda-aarch64-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cuda12.4-main
DESIRED_DEVTOOLSET: cxx11-abi
DESIRED_PYTHON: "3.12"
runs_on: linux.arm64.m7g.4xlarge
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_12-cuda-aarch64
build_environment: linux-aarch64-binary-manywheel
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_12-cuda-aarch64-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_12-cuda-aarch64-build
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cuda12.4-main
DESIRED_DEVTOOLSET: cxx11-abi
DESIRED_PYTHON: "3.12"
build_name: manywheel-py3_12-cuda-aarch64
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml

View File

@ -48,7 +48,7 @@ jobs:
DESIRED_PYTHON: "3.8"
build_name: manywheel-py3_8-cuda11_8
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu11==11.8.87; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu11==8.7.0.84; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu11==11.11.3.6; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu11==10.9.0.58; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu11==10.3.0.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu11==11.4.1.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu11==11.7.5.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu11==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu11==11.8.86; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu11==11.8.87; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu11==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu11==11.11.3.6; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu11==10.9.0.58; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu11==10.3.0.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu11==11.4.1.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu11==11.7.5.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu11==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu11==11.8.86; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_8-cuda11_8-test: # Testing
@ -88,7 +88,7 @@ jobs:
DESIRED_PYTHON: "3.8"
build_name: manywheel-py3_8-cuda12_1
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==8.9.2.26; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_8-cuda12_1-test: # Testing
@ -111,3 +111,43 @@ jobs:
runs_on: linux.4xlarge.nvidia.gpu
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_8-cuda12_4-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda12.4-main
DESIRED_PYTHON: "3.8"
build_name: manywheel-py3_8-cuda12_4
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.2.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.0.44; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.119; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.0.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.0.142; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_8-cuda12_4-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs: manywheel-py3_8-cuda12_4-build
uses: ./.github/workflows/_binary-test-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda12.4-main
DESIRED_PYTHON: "3.8"
build_name: manywheel-py3_8-cuda12_4
build_environment: linux-binary-manywheel
runs_on: linux.4xlarge.nvidia.gpu
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}

View File

@ -174,7 +174,7 @@ jobs:
DESIRED_PYTHON: "3.8"
build_name: manywheel-py3_8-cuda11_8
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu11==11.8.87; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu11==8.7.0.84; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu11==11.11.3.6; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu11==10.9.0.58; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu11==10.3.0.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu11==11.4.1.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu11==11.7.5.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu11==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu11==11.8.86; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu11==11.8.87; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu11==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu11==11.11.3.6; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu11==10.9.0.58; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu11==10.3.0.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu11==11.4.1.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu11==11.7.5.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu11==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu11==11.8.86; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_8-cuda11_8-test: # Testing
@ -237,7 +237,7 @@ jobs:
DESIRED_PYTHON: "3.8"
build_name: manywheel-py3_8-cuda12_1
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==8.9.2.26; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_8-cuda12_1-test: # Testing
@ -300,7 +300,7 @@ jobs:
DESIRED_PYTHON: "3.8"
build_name: manywheel-py3_8-cuda12_4
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==8.9.7.29; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.2.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.0.44; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.119; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.0.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.0.142; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.2.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.0.44; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.119; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.0.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.0.142; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_8-cuda12_4-test: # Testing
@ -690,7 +690,7 @@ jobs:
DESIRED_PYTHON: "3.9"
build_name: manywheel-py3_9-cuda11_8
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu11==11.8.87; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu11==8.7.0.84; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu11==11.11.3.6; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu11==10.9.0.58; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu11==10.3.0.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu11==11.4.1.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu11==11.7.5.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu11==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu11==11.8.86; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu11==11.8.87; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu11==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu11==11.11.3.6; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu11==10.9.0.58; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu11==10.3.0.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu11==11.4.1.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu11==11.7.5.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu11==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu11==11.8.86; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_9-cuda11_8-test: # Testing
@ -753,7 +753,7 @@ jobs:
DESIRED_PYTHON: "3.9"
build_name: manywheel-py3_9-cuda12_1
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==8.9.2.26; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_9-cuda12_1-test: # Testing
@ -816,7 +816,7 @@ jobs:
DESIRED_PYTHON: "3.9"
build_name: manywheel-py3_9-cuda12_4
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==8.9.7.29; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.2.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.0.44; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.119; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.0.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.0.142; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.2.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.0.44; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.119; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.0.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.0.142; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_9-cuda12_4-test: # Testing
@ -1206,7 +1206,7 @@ jobs:
DESIRED_PYTHON: "3.10"
build_name: manywheel-py3_10-cuda11_8
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu11==11.8.87; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu11==8.7.0.84; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu11==11.11.3.6; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu11==10.9.0.58; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu11==10.3.0.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu11==11.4.1.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu11==11.7.5.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu11==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu11==11.8.86; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu11==11.8.87; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu11==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu11==11.11.3.6; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu11==10.9.0.58; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu11==10.3.0.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu11==11.4.1.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu11==11.7.5.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu11==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu11==11.8.86; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_10-cuda11_8-test: # Testing
@ -1269,7 +1269,7 @@ jobs:
DESIRED_PYTHON: "3.10"
build_name: manywheel-py3_10-cuda12_1
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==8.9.2.26; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_10-cuda12_1-test: # Testing
@ -1332,7 +1332,7 @@ jobs:
DESIRED_PYTHON: "3.10"
build_name: manywheel-py3_10-cuda12_4
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==8.9.7.29; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.2.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.0.44; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.119; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.0.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.0.142; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.2.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.0.44; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.119; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.0.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.0.142; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_10-cuda12_4-test: # Testing
@ -1722,7 +1722,7 @@ jobs:
DESIRED_PYTHON: "3.11"
build_name: manywheel-py3_11-cuda11_8
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu11==11.8.87; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu11==8.7.0.84; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu11==11.11.3.6; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu11==10.9.0.58; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu11==10.3.0.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu11==11.4.1.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu11==11.7.5.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu11==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu11==11.8.86; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu11==11.8.87; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu11==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu11==11.11.3.6; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu11==10.9.0.58; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu11==10.3.0.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu11==11.4.1.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu11==11.7.5.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu11==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu11==11.8.86; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_11-cuda11_8-test: # Testing
@ -1785,7 +1785,7 @@ jobs:
DESIRED_PYTHON: "3.11"
build_name: manywheel-py3_11-cuda12_1
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==8.9.2.26; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_11-cuda12_1-test: # Testing
@ -1848,7 +1848,7 @@ jobs:
DESIRED_PYTHON: "3.11"
build_name: manywheel-py3_11-cuda12_4
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==8.9.7.29; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.2.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.0.44; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.119; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.0.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.0.142; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.2.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.0.44; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.119; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.0.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.0.142; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_11-cuda12_4-test: # Testing
@ -2238,7 +2238,7 @@ jobs:
DESIRED_PYTHON: "3.12"
build_name: manywheel-py3_12-cuda11_8
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu11==11.8.87; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu11==8.7.0.84; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu11==11.11.3.6; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu11==10.9.0.58; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu11==10.3.0.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu11==11.4.1.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu11==11.7.5.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu11==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu11==11.8.86; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu11==11.8.87; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu11==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu11==11.11.3.6; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu11==10.9.0.58; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu11==10.3.0.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu11==11.4.1.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu11==11.7.5.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu11==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu11==11.8.86; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_12-cuda11_8-test: # Testing
@ -2301,7 +2301,7 @@ jobs:
DESIRED_PYTHON: "3.12"
build_name: manywheel-py3_12-cuda12_1
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==8.9.2.26; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_12-cuda12_1-test: # Testing
@ -2364,7 +2364,7 @@ jobs:
DESIRED_PYTHON: "3.12"
build_name: manywheel-py3_12-cuda12_4
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==8.9.7.29; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.2.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.0.44; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.119; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.0.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.0.142; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.2.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.0.44; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.119; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.0.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.0.142; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_12-cuda12_4-test: # Testing
@ -2410,209 +2410,3 @@ jobs:
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_12-rocm6_0-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: rocm6.0
GPU_ARCH_VERSION: 6.0
GPU_ARCH_TYPE: rocm
DOCKER_IMAGE: pytorch/manylinux-builder:rocm6.0-main
DESIRED_PYTHON: "3.12"
build_name: manywheel-py3_12-rocm6_0
build_environment: linux-binary-manywheel
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_12-rocm6_0-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs: manywheel-py3_12-rocm6_0-build
runs-on: linux.rocm.gpu
timeout-minutes: 240
env:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: rocm6.0
GPU_ARCH_VERSION: 6.0
GPU_ARCH_TYPE: rocm
SKIP_ALL_TESTS: 1
DOCKER_IMAGE: pytorch/manylinux-builder:rocm6.0-main
DESIRED_PYTHON: "3.12"
steps:
- name: Setup ROCm
uses: ./.github/actions/setup-rocm
- uses: actions/download-artifact@v3
name: Download Build Artifacts
with:
name: manywheel-py3_12-rocm6_0
path: "${{ runner.temp }}/artifacts/"
- name: Checkout PyTorch
uses: malfet/checkout@silent-checkout
with:
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
submodules: recursive
path: pytorch
quiet-checkout: true
- name: Clean PyTorch checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: pytorch
- name: Checkout pytorch/builder
uses: malfet/checkout@silent-checkout
with:
ref: main
submodules: recursive
repository: pytorch/builder
path: builder
quiet-checkout: true
- name: Clean pytorch/builder checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: builder
- name: ROCm set GPU_FLAG
run: |
echo "GPU_FLAG=--device=/dev/mem --device=/dev/kfd --device=/dev/dri --group-add video --group-add daemon" >> "${GITHUB_ENV}"
- name: Pull Docker image
uses: pytorch/test-infra/.github/actions/pull-docker-image@main
with:
docker-image: pytorch/manylinux-builder:rocm6.0-main
- name: Test Pytorch binary
uses: ./pytorch/.github/actions/test-pytorch-binary
- name: Teardown ROCm
uses: ./.github/actions/teardown-rocm
manywheel-py3_12-rocm6_0-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_12-rocm6_0-test
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: rocm6.0
GPU_ARCH_VERSION: 6.0
GPU_ARCH_TYPE: rocm
DOCKER_IMAGE: pytorch/manylinux-builder:rocm6.0-main
DESIRED_PYTHON: "3.12"
build_name: manywheel-py3_12-rocm6_0
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_12-rocm6_1-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: rocm6.1
GPU_ARCH_VERSION: 6.1
GPU_ARCH_TYPE: rocm
DOCKER_IMAGE: pytorch/manylinux-builder:rocm6.1-main
DESIRED_PYTHON: "3.12"
build_name: manywheel-py3_12-rocm6_1
build_environment: linux-binary-manywheel
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_12-rocm6_1-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs: manywheel-py3_12-rocm6_1-build
runs-on: linux.rocm.gpu
timeout-minutes: 240
env:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: rocm6.1
GPU_ARCH_VERSION: 6.1
GPU_ARCH_TYPE: rocm
SKIP_ALL_TESTS: 1
DOCKER_IMAGE: pytorch/manylinux-builder:rocm6.1-main
DESIRED_PYTHON: "3.12"
steps:
- name: Setup ROCm
uses: ./.github/actions/setup-rocm
- uses: actions/download-artifact@v3
name: Download Build Artifacts
with:
name: manywheel-py3_12-rocm6_1
path: "${{ runner.temp }}/artifacts/"
- name: Checkout PyTorch
uses: malfet/checkout@silent-checkout
with:
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
submodules: recursive
path: pytorch
quiet-checkout: true
- name: Clean PyTorch checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: pytorch
- name: Checkout pytorch/builder
uses: malfet/checkout@silent-checkout
with:
ref: main
submodules: recursive
repository: pytorch/builder
path: builder
quiet-checkout: true
- name: Clean pytorch/builder checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: builder
- name: ROCm set GPU_FLAG
run: |
echo "GPU_FLAG=--device=/dev/mem --device=/dev/kfd --device=/dev/dri --group-add video --group-add daemon" >> "${GITHUB_ENV}"
- name: Pull Docker image
uses: pytorch/test-infra/.github/actions/pull-docker-image@main
with:
docker-image: pytorch/manylinux-builder:rocm6.1-main
- name: Test Pytorch binary
uses: ./pytorch/.github/actions/test-pytorch-binary
- name: Teardown ROCm
uses: ./.github/actions/teardown-rocm
manywheel-py3_12-rocm6_1-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_12-rocm6_1-test
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: rocm6.1
GPU_ARCH_VERSION: 6.1
GPU_ARCH_TYPE: rocm
DOCKER_IMAGE: pytorch/manylinux-builder:rocm6.1-main
DESIRED_PYTHON: "3.12"
build_name: manywheel-py3_12-rocm6_1
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml

View File

@ -0,0 +1,353 @@
# @generated DO NOT EDIT MANUALLY
# Template is at: .github/templates/linux_binary_build_workflow.yml.j2
# Generation script: .github/scripts/generate_ci_workflows.py
name: linux-s390x-binary-manywheel
on:
push:
# NOTE: Meta Employees can trigger new nightlies using: https://fburl.com/trigger_pytorch_nightly_build
branches:
- nightly
tags:
# NOTE: Binary build pipelines should only get triggered on release candidate builds
# Release candidate tags look like: v1.11.0-rc1
- v[0-9]+.[0-9]+.[0-9]+-rc[0-9]+
- 'ciflow/binaries/*'
- 'ciflow/binaries_wheel/*'
workflow_dispatch:
env:
# Needed for conda builds
ALPINE_IMAGE: "docker.io/s390x/alpine"
ANACONDA_USER: pytorch
AWS_DEFAULT_REGION: us-east-1
BINARY_ENV_FILE: /tmp/env
BUILD_ENVIRONMENT: linux-s390x-binary-manywheel
BUILDER_ROOT: /builder
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
PR_NUMBER: ${{ github.event.pull_request.number }}
PYTORCH_FINAL_PACKAGE_DIR: /artifacts
PYTORCH_ROOT: /pytorch
SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
SKIP_ALL_TESTS: 0
concurrency:
group: linux-s390x-binary-manywheel-${{ github.event.pull_request.number || github.ref_name }}-${{ github.ref_type == 'branch' && github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
cancel-in-progress: true
jobs:
manywheel-py3_8-cpu-s390x-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu-s390x
DOCKER_IMAGE: pytorch/manylinuxs390x-builder:cpu-s390x-main
DESIRED_PYTHON: "3.8"
runs_on: linux.s390x
ALPINE_IMAGE: "docker.io/s390x/alpine"
build_name: manywheel-py3_8-cpu-s390x
build_environment: linux-s390x-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_8-cpu-s390x-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs: manywheel-py3_8-cpu-s390x-build
uses: ./.github/workflows/_binary-test-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu-s390x
DOCKER_IMAGE: pytorch/manylinuxs390x-builder:cpu-s390x-main
DESIRED_PYTHON: "3.8"
build_name: manywheel-py3_8-cpu-s390x
build_environment: linux-s390x-binary-manywheel
runs_on: linux.s390x
ALPINE_IMAGE: "docker.io/s390x/alpine"
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_8-cpu-s390x-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_8-cpu-s390x-test
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu-s390x
DOCKER_IMAGE: pytorch/manylinuxs390x-builder:cpu-s390x-main
DESIRED_PYTHON: "3.8"
build_name: manywheel-py3_8-cpu-s390x
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_9-cpu-s390x-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu-s390x
DOCKER_IMAGE: pytorch/manylinuxs390x-builder:cpu-s390x-main
DESIRED_PYTHON: "3.9"
runs_on: linux.s390x
ALPINE_IMAGE: "docker.io/s390x/alpine"
build_name: manywheel-py3_9-cpu-s390x
build_environment: linux-s390x-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_9-cpu-s390x-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs: manywheel-py3_9-cpu-s390x-build
uses: ./.github/workflows/_binary-test-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu-s390x
DOCKER_IMAGE: pytorch/manylinuxs390x-builder:cpu-s390x-main
DESIRED_PYTHON: "3.9"
build_name: manywheel-py3_9-cpu-s390x
build_environment: linux-s390x-binary-manywheel
runs_on: linux.s390x
ALPINE_IMAGE: "docker.io/s390x/alpine"
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_9-cpu-s390x-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_9-cpu-s390x-test
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu-s390x
DOCKER_IMAGE: pytorch/manylinuxs390x-builder:cpu-s390x-main
DESIRED_PYTHON: "3.9"
build_name: manywheel-py3_9-cpu-s390x
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_10-cpu-s390x-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu-s390x
DOCKER_IMAGE: pytorch/manylinuxs390x-builder:cpu-s390x-main
DESIRED_PYTHON: "3.10"
runs_on: linux.s390x
ALPINE_IMAGE: "docker.io/s390x/alpine"
build_name: manywheel-py3_10-cpu-s390x
build_environment: linux-s390x-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_10-cpu-s390x-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs: manywheel-py3_10-cpu-s390x-build
uses: ./.github/workflows/_binary-test-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu-s390x
DOCKER_IMAGE: pytorch/manylinuxs390x-builder:cpu-s390x-main
DESIRED_PYTHON: "3.10"
build_name: manywheel-py3_10-cpu-s390x
build_environment: linux-s390x-binary-manywheel
runs_on: linux.s390x
ALPINE_IMAGE: "docker.io/s390x/alpine"
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_10-cpu-s390x-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_10-cpu-s390x-test
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu-s390x
DOCKER_IMAGE: pytorch/manylinuxs390x-builder:cpu-s390x-main
DESIRED_PYTHON: "3.10"
build_name: manywheel-py3_10-cpu-s390x
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_11-cpu-s390x-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu-s390x
DOCKER_IMAGE: pytorch/manylinuxs390x-builder:cpu-s390x-main
DESIRED_PYTHON: "3.11"
runs_on: linux.s390x
ALPINE_IMAGE: "docker.io/s390x/alpine"
build_name: manywheel-py3_11-cpu-s390x
build_environment: linux-s390x-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_11-cpu-s390x-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs: manywheel-py3_11-cpu-s390x-build
uses: ./.github/workflows/_binary-test-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu-s390x
DOCKER_IMAGE: pytorch/manylinuxs390x-builder:cpu-s390x-main
DESIRED_PYTHON: "3.11"
build_name: manywheel-py3_11-cpu-s390x
build_environment: linux-s390x-binary-manywheel
runs_on: linux.s390x
ALPINE_IMAGE: "docker.io/s390x/alpine"
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_11-cpu-s390x-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_11-cpu-s390x-test
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu-s390x
DOCKER_IMAGE: pytorch/manylinuxs390x-builder:cpu-s390x-main
DESIRED_PYTHON: "3.11"
build_name: manywheel-py3_11-cpu-s390x
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_12-cpu-s390x-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu-s390x
DOCKER_IMAGE: pytorch/manylinuxs390x-builder:cpu-s390x-main
DESIRED_PYTHON: "3.12"
runs_on: linux.s390x
ALPINE_IMAGE: "docker.io/s390x/alpine"
build_name: manywheel-py3_12-cpu-s390x
build_environment: linux-s390x-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_12-cpu-s390x-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs: manywheel-py3_12-cpu-s390x-build
uses: ./.github/workflows/_binary-test-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu-s390x
DOCKER_IMAGE: pytorch/manylinuxs390x-builder:cpu-s390x-main
DESIRED_PYTHON: "3.12"
build_name: manywheel-py3_12-cpu-s390x
build_environment: linux-s390x-binary-manywheel
runs_on: linux.s390x
ALPINE_IMAGE: "docker.io/s390x/alpine"
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_12-cpu-s390x-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_12-cpu-s390x-test
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu-s390x
DOCKER_IMAGE: pytorch/manylinuxs390x-builder:cpu-s390x-main
DESIRED_PYTHON: "3.12"
build_name: manywheel-py3_12-cpu-s390x
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml

View File

@ -34,7 +34,7 @@ concurrency:
jobs:
conda-py3_8-cpu-build:
if: ${{ github.repository_owner == 'pytorch' }}
runs-on: macos-13-xlarge
runs-on: macos-14-xlarge
timeout-minutes: 240
env:
PYTORCH_ROOT: ${{ github.workspace }}/pytorch
@ -152,7 +152,7 @@ jobs:
uses: ./.github/workflows/_binary-upload.yml
conda-py3_9-cpu-build:
if: ${{ github.repository_owner == 'pytorch' }}
runs-on: macos-13-xlarge
runs-on: macos-14-xlarge
timeout-minutes: 240
env:
PYTORCH_ROOT: ${{ github.workspace }}/pytorch
@ -270,7 +270,7 @@ jobs:
uses: ./.github/workflows/_binary-upload.yml
conda-py3_10-cpu-build:
if: ${{ github.repository_owner == 'pytorch' }}
runs-on: macos-13-xlarge
runs-on: macos-14-xlarge
timeout-minutes: 240
env:
PYTORCH_ROOT: ${{ github.workspace }}/pytorch
@ -388,7 +388,7 @@ jobs:
uses: ./.github/workflows/_binary-upload.yml
conda-py3_11-cpu-build:
if: ${{ github.repository_owner == 'pytorch' }}
runs-on: macos-13-xlarge
runs-on: macos-14-xlarge
timeout-minutes: 240
env:
PYTORCH_ROOT: ${{ github.workspace }}/pytorch
@ -506,7 +506,7 @@ jobs:
uses: ./.github/workflows/_binary-upload.yml
conda-py3_12-cpu-build:
if: ${{ github.repository_owner == 'pytorch' }}
runs-on: macos-13-xlarge
runs-on: macos-14-xlarge
timeout-minutes: 240
env:
PYTORCH_ROOT: ${{ github.workspace }}/pytorch

View File

@ -34,7 +34,7 @@ concurrency:
jobs:
libtorch-cpu-shared-with-deps-cxx11-abi-build:
if: ${{ github.repository_owner == 'pytorch' }}
runs-on: macos-13-xlarge
runs-on: macos-14-xlarge
timeout-minutes: 240
env:
PYTORCH_ROOT: ${{ github.workspace }}/pytorch

View File

@ -34,7 +34,7 @@ concurrency:
jobs:
wheel-py3_8-cpu-build:
if: ${{ github.repository_owner == 'pytorch' }}
runs-on: macos-13-xlarge
runs-on: macos-14-xlarge
timeout-minutes: 240
env:
PYTORCH_ROOT: ${{ github.workspace }}/pytorch
@ -46,7 +46,7 @@ jobs:
GPU_ARCH_TYPE: cpu
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.8"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==8.9.2.26; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
# For sccache access (only on non-forked PRs)
AWS_ACCESS_KEY_ID: ${{ secrets.MACOS_SCCACHE_S3_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.MACOS_SCCACHE_S3_SECRET_ACCESS_KEY }}
@ -153,7 +153,7 @@ jobs:
uses: ./.github/workflows/_binary-upload.yml
wheel-py3_9-cpu-build:
if: ${{ github.repository_owner == 'pytorch' }}
runs-on: macos-13-xlarge
runs-on: macos-14-xlarge
timeout-minutes: 240
env:
PYTORCH_ROOT: ${{ github.workspace }}/pytorch
@ -165,7 +165,7 @@ jobs:
GPU_ARCH_TYPE: cpu
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.9"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==8.9.2.26; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
# For sccache access (only on non-forked PRs)
AWS_ACCESS_KEY_ID: ${{ secrets.MACOS_SCCACHE_S3_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.MACOS_SCCACHE_S3_SECRET_ACCESS_KEY }}
@ -272,7 +272,7 @@ jobs:
uses: ./.github/workflows/_binary-upload.yml
wheel-py3_10-cpu-build:
if: ${{ github.repository_owner == 'pytorch' }}
runs-on: macos-13-xlarge
runs-on: macos-14-xlarge
timeout-minutes: 240
env:
PYTORCH_ROOT: ${{ github.workspace }}/pytorch
@ -284,7 +284,7 @@ jobs:
GPU_ARCH_TYPE: cpu
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.10"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==8.9.2.26; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
# For sccache access (only on non-forked PRs)
AWS_ACCESS_KEY_ID: ${{ secrets.MACOS_SCCACHE_S3_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.MACOS_SCCACHE_S3_SECRET_ACCESS_KEY }}
@ -391,7 +391,7 @@ jobs:
uses: ./.github/workflows/_binary-upload.yml
wheel-py3_11-cpu-build:
if: ${{ github.repository_owner == 'pytorch' }}
runs-on: macos-13-xlarge
runs-on: macos-14-xlarge
timeout-minutes: 240
env:
PYTORCH_ROOT: ${{ github.workspace }}/pytorch
@ -403,7 +403,7 @@ jobs:
GPU_ARCH_TYPE: cpu
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.11"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==8.9.2.26; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
# For sccache access (only on non-forked PRs)
AWS_ACCESS_KEY_ID: ${{ secrets.MACOS_SCCACHE_S3_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.MACOS_SCCACHE_S3_SECRET_ACCESS_KEY }}
@ -510,7 +510,7 @@ jobs:
uses: ./.github/workflows/_binary-upload.yml
wheel-py3_12-cpu-build:
if: ${{ github.repository_owner == 'pytorch' }}
runs-on: macos-13-xlarge
runs-on: macos-14-xlarge
timeout-minutes: 240
env:
PYTORCH_ROOT: ${{ github.workspace }}/pytorch
@ -522,7 +522,7 @@ jobs:
GPU_ARCH_TYPE: cpu
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.12"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==8.9.2.26; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
# For sccache access (only on non-forked PRs)
AWS_ACCESS_KEY_ID: ${{ secrets.MACOS_SCCACHE_S3_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.MACOS_SCCACHE_S3_SECRET_ACCESS_KEY }}

View File

@ -46,7 +46,7 @@ jobs:
GPU_ARCH_TYPE: cpu
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.8"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==8.9.2.26; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
steps:
- name: Display EC2 information
shell: bash
@ -290,7 +290,7 @@ jobs:
GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.8"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==8.9.2.26; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
steps:
- name: Display EC2 information
shell: bash
@ -536,7 +536,7 @@ jobs:
GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.8"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==8.9.2.26; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
steps:
- name: Display EC2 information
shell: bash
@ -782,7 +782,7 @@ jobs:
GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.8"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==8.9.2.26; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
steps:
- name: Display EC2 information
shell: bash
@ -1027,7 +1027,7 @@ jobs:
GPU_ARCH_TYPE: cpu
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.9"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==8.9.2.26; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
steps:
- name: Display EC2 information
shell: bash
@ -1271,7 +1271,7 @@ jobs:
GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.9"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==8.9.2.26; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
steps:
- name: Display EC2 information
shell: bash
@ -1517,7 +1517,7 @@ jobs:
GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.9"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==8.9.2.26; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
steps:
- name: Display EC2 information
shell: bash
@ -1763,7 +1763,7 @@ jobs:
GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.9"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==8.9.2.26; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
steps:
- name: Display EC2 information
shell: bash
@ -2008,7 +2008,7 @@ jobs:
GPU_ARCH_TYPE: cpu
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.10"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==8.9.2.26; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
steps:
- name: Display EC2 information
shell: bash
@ -2252,7 +2252,7 @@ jobs:
GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.10"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==8.9.2.26; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
steps:
- name: Display EC2 information
shell: bash
@ -2498,7 +2498,7 @@ jobs:
GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.10"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==8.9.2.26; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
steps:
- name: Display EC2 information
shell: bash
@ -2744,7 +2744,7 @@ jobs:
GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.10"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==8.9.2.26; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
steps:
- name: Display EC2 information
shell: bash
@ -2989,7 +2989,7 @@ jobs:
GPU_ARCH_TYPE: cpu
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.11"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==8.9.2.26; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
steps:
- name: Display EC2 information
shell: bash
@ -3233,7 +3233,7 @@ jobs:
GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.11"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==8.9.2.26; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
steps:
- name: Display EC2 information
shell: bash
@ -3479,7 +3479,7 @@ jobs:
GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.11"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==8.9.2.26; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
steps:
- name: Display EC2 information
shell: bash
@ -3725,7 +3725,7 @@ jobs:
GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.11"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==8.9.2.26; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
steps:
- name: Display EC2 information
shell: bash
@ -3970,7 +3970,7 @@ jobs:
GPU_ARCH_TYPE: cpu
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.12"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==8.9.2.26; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
steps:
- name: Display EC2 information
shell: bash
@ -4214,7 +4214,7 @@ jobs:
GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.12"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==8.9.2.26; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
steps:
- name: Display EC2 information
shell: bash
@ -4460,7 +4460,7 @@ jobs:
GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.12"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==8.9.2.26; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
steps:
- name: Display EC2 information
shell: bash
@ -4706,7 +4706,7 @@ jobs:
GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.12"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==8.9.2.26; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
steps:
- name: Display EC2 information
shell: bash

108
.github/workflows/inductor-cu124.yml vendored Normal file
View File

@ -0,0 +1,108 @@
name: inductor-cu124
on:
push:
tags:
- ciflow/inductor-cu124/*
workflow_dispatch:
schedule:
# Run every 4 hours during the week and every 12 hours on the weekend
- cron: 45 0,4,8,12,16,20 * * 1-5
- cron: 45 4,12 * * 0,6
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref_name }}-${{ github.ref_type == 'branch' && github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
cancel-in-progress: true
permissions: read-all
jobs:
linux-focal-cuda12_4-py3_10-gcc9-inductor-build:
# Should be synced with the one in inductor.yml, but this doesn't run inductor_timm
name: cuda12.4-py3.10-gcc9-sm86
uses: ./.github/workflows/_linux-build.yml
with:
sync-tag: linux-focal-cuda12_4-py3_10-gcc9-inductor-build
build-environment: linux-focal-cuda12.4-py3.10-gcc9-sm86
docker-image-name: pytorch-linux-focal-cuda12.4-cudnn9-py3-gcc9-inductor-benchmarks
cuda-arch-list: '8.6'
test-matrix: |
{ include: [
{ config: "inductor", shard: 1, num_shards: 1, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "inductor_distributed", shard: 1, num_shards: 1, runner: "linux.g5.12xlarge.nvidia.gpu" },
{ config: "inductor_huggingface", shard: 1, num_shards: 1, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "inductor_torchbench", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "inductor_torchbench", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "dynamic_inductor_huggingface", shard: 1, num_shards: 1, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "dynamic_inductor_timm", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "dynamic_inductor_timm", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "dynamic_inductor_torchbench", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "dynamic_inductor_torchbench", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "aot_inductor_huggingface", shard: 1, num_shards: 1, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "aot_inductor_timm", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "aot_inductor_timm", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "aot_inductor_torchbench", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "aot_inductor_torchbench", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "inductor_cpp_wrapper_abi_compatible", shard: 1, num_shards: 1, runner: "linux.g5.4xlarge.nvidia.gpu" },
]}
secrets:
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
linux-focal-cuda12_4-py3_10-gcc9-inductor-test:
name: cuda12.4-py3.10-gcc9-sm86
uses: ./.github/workflows/_linux-test.yml
needs: linux-focal-cuda12_4-py3_10-gcc9-inductor-build
with:
sync-tag: linux-focal-cuda12_4-py3_10-gcc9-inductor-test
build-environment: linux-focal-cuda12.4-py3.10-gcc9-sm86
docker-image: ${{ needs.linux-focal-cuda12_4-py3_10-gcc9-inductor-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-cuda12_4-py3_10-gcc9-inductor-build.outputs.test-matrix }}
secrets:
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
linux-focal-cuda12_4-py3_10-gcc9-inductor-build-gcp:
name: cuda12.4-py3.10-gcc9-sm80
uses: ./.github/workflows/_linux-build.yml
with:
build-environment: linux-focal-cuda12.4-py3.10-gcc9-sm80
docker-image-name: pytorch-linux-focal-cuda12.4-cudnn9-py3-gcc9-inductor-benchmarks
cuda-arch-list: '8.0'
test-matrix: |
{ include: [
{ config: "inductor_torchbench_smoketest_perf", shard: 1, num_shards: 1, runner: "linux.gcp.a100" },
]}
secrets:
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
linux-focal-cuda12_4-py3_10-gcc9-inductor-test-gcp:
name: cuda12.4-py3.10-gcc9-sm80
uses: ./.github/workflows/_linux-test.yml
needs: linux-focal-cuda12_4-py3_10-gcc9-inductor-build-gcp
with:
build-environment: linux-focal-cuda12.4-py3.10-gcc9-sm80
docker-image: ${{ needs.linux-focal-cuda12_4-py3_10-gcc9-inductor-build-gcp.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-cuda12_4-py3_10-gcc9-inductor-build-gcp.outputs.test-matrix }}
use-gha: anything-non-empty-to-use-gha
secrets:
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
linux-focal-cuda12_4-py3_12-gcc9-inductor-build:
name: cuda12.4-py3.12-gcc9-sm86
uses: ./.github/workflows/_linux-build.yml
with:
build-environment: linux-focal-cuda12.4-py3.12-gcc9-sm86
docker-image-name: pytorch-linux-focal-cuda12.4-cudnn9-py3.12-gcc9-inductor-benchmarks
cuda-arch-list: '8.6'
test-matrix: |
{ include: [
{ config: "inductor", shard: 1, num_shards: 1, runner: "linux.g5.4xlarge.nvidia.gpu" },
]}
linux-focal-cuda12_4-py3_12-gcc9-inductor-test:
name: cuda12.4-py3.12-gcc9-sm86
uses: ./.github/workflows/_linux-test.yml
needs: linux-focal-cuda12_4-py3_12-gcc9-inductor-build
with:
build-environment: linux-focal-cuda12.4-py3.12-gcc9-sm86
docker-image: ${{ needs.linux-focal-cuda12_4-py3_12-gcc9-inductor-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-cuda12_4-py3_12-gcc9-inductor-build.outputs.test-matrix }}

View File

@ -21,7 +21,7 @@ jobs:
uses: ./.github/workflows/_linux-build.yml
with:
build-environment: linux-focal-cuda12.1-py3.10-gcc9-sm80
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn8-py3-gcc9-inductor-benchmarks
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9-inductor-benchmarks
cuda-arch-list: '8.0'
test-matrix: |
{ include: [

View File

@ -18,7 +18,7 @@ jobs:
uses: ./.github/workflows/_linux-build.yml
with:
build-environment: linux-focal-cuda12.1-py3.10-gcc9-sm80
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn8-py3-gcc9-inductor-benchmarks
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9-inductor-benchmarks
cuda-arch-list: '8.0'
test-matrix: |
{ include: [

View File

@ -71,7 +71,7 @@ jobs:
uses: ./.github/workflows/_linux-build.yml
with:
build-environment: linux-focal-cuda12.1-py3.10-gcc9-sm80
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn8-py3-gcc9-inductor-benchmarks
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9-inductor-benchmarks
cuda-arch-list: '8.0'
test-matrix: |
{ include: [

View File

@ -23,7 +23,7 @@ jobs:
uses: ./.github/workflows/_linux-build.yml
with:
build-environment: linux-focal-cuda12.1-py3.10-gcc9-sm86
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn8-py3-gcc9-inductor-benchmarks
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9-inductor-benchmarks
cuda-arch-list: '8.6'
test-matrix: |
{ include: [

View File

@ -44,7 +44,7 @@ jobs:
uses: ./.github/workflows/_linux-build.yml
with:
build-environment: linux-focal-cuda12.1-py3.10-gcc9-sm86
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn8-py3-gcc9-inductor-benchmarks
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9-inductor-benchmarks
cuda-arch-list: '8.6'
test-matrix: |
{ include: [
@ -86,7 +86,7 @@ jobs:
uses: ./.github/workflows/_linux-build.yml
with:
build-environment: linux-focal-cuda12.1-py3.10-gcc9-sm80
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn8-py3-gcc9-inductor-benchmarks
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9-inductor-benchmarks
cuda-arch-list: '8.0'
test-matrix: |
{ include: [
@ -107,6 +107,56 @@ jobs:
secrets:
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
linux-focal-cuda12_1-py3_12-gcc9-inductor-build:
name: cuda12.1-py3.12-gcc9-sm86
uses: ./.github/workflows/_linux-build.yml
with:
build-environment: linux-focal-cuda12.1-py3.12-gcc9-sm86
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn9-py3.12-gcc9-inductor-benchmarks
cuda-arch-list: '8.6'
test-matrix: |
{ include: [
{ config: "inductor", shard: 1, num_shards: 1, runner: "linux.g5.4xlarge.nvidia.gpu" },
]}
linux-focal-cuda12_1-py3_12-gcc9-inductor-test:
name: cuda12.1-py3.12-gcc9-sm86
uses: ./.github/workflows/_linux-test.yml
needs: linux-focal-cuda12_1-py3_12-gcc9-inductor-build
with:
build-environment: linux-focal-cuda12.1-py3.12-gcc9-sm86
docker-image: ${{ needs.linux-focal-cuda12_1-py3_12-gcc9-inductor-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-cuda12_1-py3_12-gcc9-inductor-build.outputs.test-matrix }}
linux-focal-cuda12_4-py3_10-gcc9-inductor-build:
# Should be synced with the one in inductor-periodic.yml but this only runs inductor_timm
name: cuda12.4-py3.10-gcc9-sm86
uses: ./.github/workflows/_linux-build.yml
with:
sync-tag: linux-focal-cuda12_4-py3_10-gcc9-inductor-build
build-environment: linux-focal-cuda12.4-py3.10-gcc9-sm86
docker-image-name: pytorch-linux-focal-cuda12.4-cudnn9-py3-gcc9-inductor-benchmarks
cuda-arch-list: '8.6'
test-matrix: |
{ include: [
{ config: "inductor_timm", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "inductor_timm", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
]}
secrets:
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
linux-focal-cuda12_4-py3_10-gcc9-inductor-test:
name: cuda12.4-py3.10-gcc9-sm86
uses: ./.github/workflows/_linux-test.yml
needs: linux-focal-cuda12_4-py3_10-gcc9-inductor-build
with:
sync-tag: linux-focal-cuda12_4-py3_10-gcc9-inductor-test
build-environment: linux-focal-cuda12.4-py3.10-gcc9-sm86
docker-image: ${{ needs.linux-focal-cuda12_4-py3_10-gcc9-inductor-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-cuda12_4-py3_10-gcc9-inductor-build.outputs.test-matrix }}
secrets:
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
linux-jammy-cpu-py3_8-gcc11-inductor-build:
name: linux-jammy-cpu-py3.8-gcc11-inductor
uses: ./.github/workflows/_linux-build.yml
@ -120,12 +170,17 @@ jobs:
{ config: "cpu_inductor_timm", shard: 2, num_shards: 2, runner: "linux.12xlarge" },
{ config: "cpu_inductor_torchbench", shard: 1, num_shards: 2, runner: "linux.12xlarge" },
{ config: "cpu_inductor_torchbench", shard: 2, num_shards: 2, runner: "linux.12xlarge" },
{ config: "cpu_inductor_huggingface_freezing", shard: 1, num_shards: 1, runner: "linux.12xlarge" },
{ config: "cpu_inductor_timm_freezing", shard: 1, num_shards: 2, runner: "linux.12xlarge" },
{ config: "cpu_inductor_timm_freezing", shard: 2, num_shards: 2, runner: "linux.12xlarge" },
{ config: "cpu_inductor_torchbench_freezing", shard: 1, num_shards: 2, runner: "linux.12xlarge" },
{ config: "cpu_inductor_torchbench_freezing", shard: 2, num_shards: 2, runner: "linux.12xlarge" },
{ config: "dynamic_cpu_inductor_huggingface", shard: 1, num_shards: 1, runner: "linux.12xlarge" },
{ config: "dynamic_cpu_inductor_timm", shard: 1, num_shards: 2, runner: "linux.12xlarge" },
{ config: "dynamic_cpu_inductor_timm", shard: 2, num_shards: 2, runner: "linux.12xlarge" },
{ config: "dynamic_cpu_inductor_torchbench", shard: 1, num_shards: 2, runner: "linux.12xlarge" },
{ config: "dynamic_cpu_inductor_torchbench", shard: 2, num_shards: 2, runner: "linux.12xlarge" },
{ config: "inductor_torchbench_cpu_smoketest_perf", shard: 1, num_shards: 1, runner: "linux.12xlarge" },
{ config: "inductor_torchbench_cpu_smoketest_perf", shard: 1, num_shards: 1, runner: "linux.24xl.spr-metal" },
]}
secrets:
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}

View File

@ -20,7 +20,7 @@ jobs:
with:
timeout: 120
runner: linux.2xlarge
docker-image: pytorch-linux-jammy-cuda11.8-cudnn8-py3.9-linter
docker-image: pytorch-linux-jammy-cuda11.8-cudnn9-py3.9-linter
# NB: A shallow checkout won't work here because calculate-docker-image requires a full checkout
# to run git rev-parse HEAD~:.ci/docker when a new image is needed
fetch-depth: 0
@ -36,7 +36,7 @@ jobs:
with:
timeout: 120
runner: linux.2xlarge
docker-image: pytorch-linux-jammy-cuda11.8-cudnn8-py3.9-linter
docker-image: pytorch-linux-jammy-cuda11.8-cudnn9-py3.9-linter
# NB: A shallow checkout won't work here because calculate-docker-image requires a full checkout
# to run git rev-parse HEAD~:.ci/docker when a new image is needed
fetch-depth: 0

View File

@ -13,29 +13,31 @@ concurrency:
permissions: read-all
jobs:
macos-13-py3-arm64-build:
name: macos-13-py3-arm64
macos-py3-arm64-build:
name: macos-py3-arm64
uses: ./.github/workflows/_mac-build.yml
with:
sync-tag: macos-py3-arm64-build
build-environment: macos-13-py3-arm64
build-environment: macos-py3-arm64
runner-type: macos-m1-stable
build-generates-artifacts: true
# To match the one pre-installed in the m1 runners
python-version: 3.9.12
# The runner macos-m2-14 is not a typo, it's a custom runner that is different
# than our AWS macos-m1-14 runners
test-matrix: |
{ include: [
{ config: "mps", shard: 1, num_shards: 1, runner: "macos-m1-stable" },
{ config: "mps", shard: 1, num_shards: 1, runner: "macos-m1-13" },
{ config: "mps", shard: 1, num_shards: 1, runner: "macos-m2-14" },
]}
macos-py3-arm64-mps-test:
name: macos-py3-arm64-mps
uses: ./.github/workflows/_mac-test-mps.yml
needs: macos-13-py3-arm64-build
needs: macos-py3-arm64-build
with:
sync-tag: macos-py3-arm64-mps-test
build-environment: macos-13-py3-arm64
build-environment: macos-py3-arm64
# Same as the build job
python-version: 3.9.12
test-matrix: ${{ needs.macos-13-py3-arm64-build.outputs.test-matrix }}
test-matrix: ${{ needs.macos-py3-arm64-build.outputs.test-matrix }}

View File

@ -32,7 +32,7 @@ jobs:
cache: pip
- run: |
pip3 install requests==2.26 rockset==1.0.3 boto3==1.19.12
pip3 install requests==2.32.2 rockset==1.0.3 boto3==1.19.12
- name: Upload external contribution stats
uses: nick-fields/retry@v2.8.2

View File

@ -37,6 +37,59 @@ jobs:
permissions:
id-token: write
contents: read
linux-focal-cuda12_1-py3_10-gcc9-build:
name: linux-focal-cuda12.1-py3.10-gcc9
uses: ./.github/workflows/_linux-build.yml
with:
build-environment: linux-focal-cuda12.1-py3.10-gcc9
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9
test-matrix: |
{ include: [
{ config: "nogpu_AVX512", shard: 1, num_shards: 1, runner: "linux.2xlarge" },
{ config: "nogpu_NO_AVX2", shard: 1, num_shards: 1, runner: "linux.2xlarge" },
{ config: "jit_legacy", shard: 1, num_shards: 1, runner: "linux.4xlarge.nvidia.gpu" },
]}
linux-focal-cuda12_1-py3_10-gcc9-test:
name: linux-focal-cuda12.1-py3.10-gcc9
uses: ./.github/workflows/_linux-test.yml
needs:
- linux-focal-cuda12_1-py3_10-gcc9-build
- target-determination
with:
build-environment: linux-focal-cuda12.1-py3.10-gcc9
docker-image: ${{ needs.linux-focal-cuda12_1-py3_10-gcc9-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-cuda12_1-py3_10-gcc9-build.outputs.test-matrix }}
linux-focal-cuda12_4-py3_10-gcc9-build:
name: linux-focal-cuda12.4-py3.10-gcc9
uses: ./.github/workflows/_linux-build-label.yml
with:
build-environment: linux-focal-cuda12.4-py3.10-gcc9
docker-image-name: pytorch-linux-focal-cuda12.4-cudnn9-py3-gcc9
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 5, runner: "linux.4xlarge.nvidia.gpu" },
{ config: "default", shard: 2, num_shards: 5, runner: "linux.4xlarge.nvidia.gpu" },
{ config: "default", shard: 3, num_shards: 5, runner: "linux.4xlarge.nvidia.gpu" },
{ config: "default", shard: 4, num_shards: 5, runner: "linux.4xlarge.nvidia.gpu" },
{ config: "default", shard: 5, num_shards: 5, runner: "linux.4xlarge.nvidia.gpu" },
{ config: "deploy", shard: 1, num_shards: 1, runner: "linux.4xlarge.nvidia.gpu" },
{ config: "nogpu_AVX512", shard: 1, num_shards: 1, runner: "linux.2xlarge" },
{ config: "nogpu_NO_AVX2", shard: 1, num_shards: 1, runner: "linux.2xlarge" },
{ config: "jit_legacy", shard: 1, num_shards: 1, runner: "linux.4xlarge.nvidia.gpu" },
]}
linux-focal-cuda12_4-py3_10-gcc9-test:
name: linux-focal-cuda12.4-py3.10-gcc9
uses: ./.github/workflows/_linux-test.yml
needs:
- linux-focal-cuda12_4-py3_10-gcc9-build
- target-determination
with:
timeout-minutes: 360
build-environment: linux-focal-cuda12.4-py3.10-gcc9
docker-image: ${{ needs.linux-focal-cuda12_4-py3_10-gcc9-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-cuda12_4-py3_10-gcc9-build.outputs.test-matrix }}
parallelnative-linux-jammy-py3_8-gcc11-build:
name: parallelnative-linux-jammy-py3.8-gcc11
@ -67,7 +120,7 @@ jobs:
uses: ./.github/workflows/_linux-build.yml
with:
build-environment: linux-focal-cuda11.8-py3.9-gcc9
docker-image-name: pytorch-linux-focal-cuda11.8-cudnn8-py3-gcc9
docker-image-name: pytorch-linux-focal-cuda11.8-cudnn9-py3-gcc9
cuda-arch-list: 8.6
test-matrix: |
{ include: [
@ -89,7 +142,7 @@ jobs:
uses: ./.github/workflows/_linux-build.yml
with:
build-environment: linux-focal-cuda11.8-py3.10-gcc9-debug
docker-image-name: pytorch-linux-focal-cuda11.8-cudnn8-py3-gcc9
docker-image-name: pytorch-linux-focal-cuda11.8-cudnn9-py3-gcc9
build-with-debug: true
test-matrix: |
{ include: [
@ -151,7 +204,7 @@ jobs:
{ config: "default",
shard: 1,
num_shards: 1,
runner: "macos-13-xlarge",
runner: "macos-14-xlarge",
ios_platform: "SIMULATOR",
ios_arch: "arm64",
use_lite_interpreter: 1,
@ -162,7 +215,7 @@ jobs:
{ config: "default",
shard: 1,
num_shards: 1,
runner: "macos-13-xlarge",
runner: "macos-14-xlarge",
ios_platform: "OS",
ios_arch: "arm64",
use_lite_interpreter: 1,

View File

@ -237,7 +237,7 @@ jobs:
uses: ./.github/workflows/_linux-build-label.yml
with:
build-environment: linux-focal-cuda11.8-py3.10-gcc9
docker-image-name: pytorch-linux-focal-cuda11.8-cudnn8-py3-gcc9
docker-image-name: pytorch-linux-focal-cuda11.8-cudnn9-py3-gcc9
test-matrix: |
{ include: [
{ config: "distributed", shard: 1, num_shards: 3, runner: "linux.8xlarge.nvidia.gpu" },
@ -262,7 +262,7 @@ jobs:
uses: ./.github/workflows/_linux-build-label.yml
with:
build-environment: linux-focal-cuda12.1-py3.10-gcc9
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn8-py3-gcc9
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 5, runner: "linux.4xlarge.nvidia.gpu" },
@ -297,12 +297,12 @@ jobs:
{ config: "default", shard: 1, num_shards: 1 },
]}
linux-jammy-cuda-11_8-cudnn8-py3_8-clang12-build:
name: linux-jammy-cuda11.8-cudnn8-py3.8-clang12
linux-jammy-cuda-11_8-cudnn9-py3_8-clang12-build:
name: linux-jammy-cuda11.8-cudnn9-py3.8-clang12
uses: ./.github/workflows/_linux-build-label.yml
with:
build-environment: linux-jammy-cuda11.8-cudnn8-py3.8-clang12
docker-image-name: pytorch-linux-jammy-cuda11.8-cudnn8-py3.8-clang12
build-environment: linux-jammy-cuda11.8-cudnn9-py3.8-clang12
docker-image-name: pytorch-linux-jammy-cuda11.8-cudnn9-py3.8-clang12
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 1 },
@ -361,7 +361,7 @@ jobs:
uses: ./.github/workflows/_bazel-build-test.yml
with:
build-environment: linux-focal-cuda12.1-py3.10-gcc9-bazel-test
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn8-py3-gcc9
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9
cuda-version: cpu
test-matrix: |
{ include: [
@ -373,13 +373,25 @@ jobs:
uses: ./.github/workflows/_bazel-build-test.yml
with:
build-environment: linux-focal-cuda12.1-py3.10-gcc9-bazel-test
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn8-py3-gcc9
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9
cuda-version: "12.1"
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 1, runner: "linux.4xlarge.nvidia.gpu" },
]}
linux-focal-cuda12_4-py3_10-gcc9-bazel-test:
name: linux-focal-cuda12.4-py3.10-gcc9-bazel-test
uses: ./.github/workflows/_bazel-build-test.yml
with:
build-environment: linux-focal-cuda12.4-py3.10-gcc9-bazel-test
docker-image-name: pytorch-linux-focal-cuda12.4-cudnn9-py3-gcc9
cuda-version: "12.4"
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 1, runner: "linux.4xlarge.nvidia.gpu" },
]}
linux-focal-py3-clang9-android-ndk-r21e-gradle-custom-build-single:
name: linux-focal-py3-clang9-android-ndk-r21e-gradle-custom-build-single
uses: ./.github/workflows/_android-build-test.yml
@ -435,7 +447,7 @@ jobs:
uses: ./.github/workflows/_linux-build-label.yml
with:
build-environment: linux-focal-cuda12.1-py3.10-gcc9-sm86
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn8-py3-gcc9
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9
cuda-arch-list: 8.6
test-matrix: |
{ include: [

View File

@ -41,7 +41,7 @@ jobs:
uses: ./.github/workflows/_linux-build.yml
with:
build-environment: linux-focal-cuda12.1-py3-gcc9-slow-gradcheck
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn8-py3-gcc9
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9
cuda-arch-list: 8.6
test-matrix: |
{ include: [
@ -70,7 +70,7 @@ jobs:
uses: ./.github/workflows/_linux-build.yml
with:
build-environment: linux-focal-cuda12.1-py3.10-gcc9-sm86
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn8-py3-gcc9
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9
cuda-arch-list: 8.6
test-matrix: |
{ include: [

View File

@ -0,0 +1,30 @@
name: Sync Distributed Folder
on:
#push:
# branches:
# - 'main'
# paths:
# - 'torch/distributed/**'
workflow_dispatch:
pull_request:
paths:
- '.github/scripts/sync_distributed_folder_prototype.sh'
- '.github/workflows/sync_distributed_folder_prototype.yml'
env:
WITH_PUSH: ${{ github.event_name == 'push' && github.ref == 'refs/heads/main' }}
permissions:
contents: write
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
cancel-in-progress: true
jobs:
sync:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: .github/scripts/sync_distributed_folder_prototype.sh

View File

@ -26,7 +26,7 @@ jobs:
id: calculate-docker-image
uses: pytorch/test-infra/.github/actions/calculate-docker-image@main
with:
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn8-py3-gcc9
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9
working-directory: pytorch
- name: Use following to pull public copy of the image

View File

@ -16,7 +16,7 @@ jobs:
uses: ./.github/workflows/_linux-build.yml
with:
build-environment: linux-focal-cuda12.1-py3.10-gcc9-sm80
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn8-py3-gcc9-inductor-benchmarks
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9-inductor-benchmarks
cuda-arch-list: '8.0'
test-matrix: |
{ include: [

View File

@ -34,36 +34,39 @@ jobs:
id-token: write
contents: read
linux-focal-cuda12_1-py3_10-gcc9-build:
name: linux-focal-cuda12.1-py3.10-gcc9
uses: ./.github/workflows/_linux-build.yml
linux-focal-cuda12_4-py3_10-gcc9-sm86-build:
name: linux-focal-cuda12.4-py3.10-gcc9-sm86
uses: ./.github/workflows/_linux-build-label.yml
with:
build-environment: linux-focal-cuda12.1-py3.10-gcc9
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn8-py3-gcc9
build-environment: linux-focal-cuda12.4-py3.10-gcc9-sm86
docker-image-name: pytorch-linux-focal-cuda12.4-cudnn9-py3-gcc9
cuda-arch-list: 8.6
test-matrix: |
{ include: [
{ config: "nogpu_AVX512", shard: 1, num_shards: 1, runner: "linux.2xlarge" },
{ config: "nogpu_NO_AVX2", shard: 1, num_shards: 1, runner: "linux.2xlarge" },
{ config: "jit_legacy", shard: 1, num_shards: 1, runner: "linux.4xlarge.nvidia.gpu" },
{ config: "default", shard: 1, num_shards: 5, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "default", shard: 2, num_shards: 5, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "default", shard: 3, num_shards: 5, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "default", shard: 4, num_shards: 5, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "default", shard: 5, num_shards: 5, runner: "linux.g5.4xlarge.nvidia.gpu" },
]}
linux-focal-cuda12_1-py3_10-gcc9-test:
name: linux-focal-cuda12.1-py3.10-gcc9
linux-focal-cuda12_4-py3_10-gcc9-sm86-test:
name: linux-focal-cuda12.4-py3.10-gcc9-sm86
uses: ./.github/workflows/_linux-test.yml
needs:
- linux-focal-cuda12_1-py3_10-gcc9-build
- linux-focal-cuda12_4-py3_10-gcc9-sm86-build
- target-determination
with:
build-environment: linux-focal-cuda12.1-py3.10-gcc9
docker-image: ${{ needs.linux-focal-cuda12_1-py3_10-gcc9-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-cuda12_1-py3_10-gcc9-build.outputs.test-matrix }}
build-environment: linux-focal-cuda12.4-py3.10-gcc9-sm86
docker-image: ${{ needs.linux-focal-cuda12_4-py3_10-gcc9-sm86-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-cuda12_4-py3_10-gcc9-sm86-build.outputs.test-matrix }}
libtorch-linux-focal-cuda12_1-py3_7-gcc9-debug-build:
name: libtorch-linux-focal-cuda12.1-py3.7-gcc9-debug
uses: ./.github/workflows/_linux-build.yml
with:
build-environment: libtorch-linux-focal-cuda12.1-py3.7-gcc9
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn8-py3-gcc9
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9
build-generates-artifacts: false
runner: linux.4xlarge
test-matrix: |
@ -77,7 +80,32 @@ jobs:
uses: ./.github/workflows/_linux-build.yml
with:
build-environment: linux-focal-cuda12.1-py3.10-gcc9-no-ops
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn8-py3-gcc9
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 1 },
]}
libtorch-linux-focal-cuda12_4-py3_7-gcc9-debug-build:
name: libtorch-linux-focal-cuda12.4-py3.7-gcc9-debug
uses: ./.github/workflows/_linux-build.yml
with:
build-environment: libtorch-linux-focal-cuda12.4-py3.7-gcc9
docker-image-name: pytorch-linux-focal-cuda12.4-cudnn9-py3-gcc9
build-generates-artifacts: false
runner: linux.4xlarge
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 1 },
]}
# no-ops builds test USE_PER_OPERATOR_HEADERS=0 where ATen/ops is not generated
linux-focal-cuda12_4-py3_10-gcc9-no-ops-build:
name: linux-focal-cuda12.4-py3.10-gcc9-no-ops
uses: ./.github/workflows/_linux-build.yml
with:
build-environment: linux-focal-cuda12.4-py3.10-gcc9-no-ops
docker-image-name: pytorch-linux-focal-cuda12.4-cudnn9-py3-gcc9
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 1 },
@ -94,12 +122,12 @@ jobs:
{ config: "default", shard: 1, num_shards: 1, runner: "linux.2xlarge" },
]}
macos-13-py3-arm64-build:
name: macos-13-py3-arm64
macos-py3-arm64-build:
name: macos-py3-arm64
uses: ./.github/workflows/_mac-build.yml
with:
sync-tag: macos-py3-arm64-build
build-environment: macos-13-py3-arm64
build-environment: macos-py3-arm64
runner-type: macos-m1-stable
build-generates-artifacts: true
# To match the one pre-installed in the m1 runners
@ -114,31 +142,30 @@ jobs:
macos-py3-arm64-mps-test:
name: macos-py3-arm64-mps
uses: ./.github/workflows/_mac-test-mps.yml
needs: macos-13-py3-arm64-build
if: needs.macos-13-py3-arm64-build.outputs.build-outcome == 'success'
needs: macos-py3-arm64-build
if: needs.macos-py3-arm64-build.outputs.build-outcome == 'success'
with:
sync-tag: macos-py3-arm64-mps-test
build-environment: macos-13-py3-arm64
build-environment: macos-py3-arm64
# Same as the build job
python-version: 3.9.12
test-matrix: |
{ include: [
{ config: "mps", shard: 1, num_shards: 1, runner: "macos-m1-stable" },
{ config: "mps", shard: 1, num_shards: 1, runner: "macos-m1-13" },
{ config: "mps", shard: 1, num_shards: 1, runner: "macos-m1-14" },
]}
macos-13-py3-arm64-test:
name: macos-13-py3-arm64
macos-py3-arm64-test:
name: macos-py3-arm64
uses: ./.github/workflows/_mac-test.yml
needs:
- macos-13-py3-arm64-build
- macos-py3-arm64-build
- target-determination
with:
build-environment: macos-13-py3-arm64
build-environment: macos-py3-arm64
# Same as the build job
python-version: 3.9.12
test-matrix: ${{ needs.macos-13-py3-arm64-build.outputs.test-matrix }}
test-matrix: ${{ needs.macos-py3-arm64-build.outputs.test-matrix }}
win-vs2019-cpu-py3-build:
name: win-vs2019-cpu-py3
@ -192,7 +219,9 @@ jobs:
sync-tag: rocm-build
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 1, runner: "linux.rocm.gpu" },
{ config: "default", shard: 1, num_shards: 2, runner: "linux.rocm.gpu" },
{ config: "default", shard: 2, num_shards: 2, runner: "linux.rocm.gpu" },
{ config: "distributed", shard: 1, num_shards: 1, runner: "linux.rocm.gpu" },
]}
linux-focal-rocm6_1-py3_8-test:
@ -208,4 +237,4 @@ jobs:
build-environment: linux-focal-rocm6.1-py3.8
docker-image: ${{ needs.linux-focal-rocm6_1-py3_8-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-rocm6_1-py3_8-build.outputs.test-matrix }}
tests-to-include: "test_nn test_torch test_cuda test_ops test_unary_ufuncs test_binary_ufuncs test_autograd inductor/test_torchinductor"
tests-to-include: "test_nn test_torch test_cuda test_ops test_unary_ufuncs test_binary_ufuncs test_autograd inductor/test_torchinductor distributed/test_c10d_common distributed/test_c10d_nccl"

View File

@ -32,174 +32,3 @@ jobs:
echo
echo "Once the jobs are deemed stable enough (% red signal < 5% and TTS < 3h),"
echo " they can graduate and move back to pull or trunk."
#
# Experimental ARC jobs
#
llm-td:
name: before-test
uses: ./.github/workflows/llm_td_retrieval.yml
permissions:
id-token: write
contents: read
target-determination:
name: before-test
uses: ./.github/workflows/target_determination.yml
needs: llm-td
permissions:
id-token: write
contents: read
linux-jammy-py3_8-gcc11-build:
name: linux-jammy-py3.8-gcc11
uses: ./.github/workflows/_linux-build-rg.yml
with:
build-environment: linux-jammy-py3.8-gcc11
docker-image-name: pytorch-linux-jammy-py3.8-gcc11
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 3, runner: "arc-lf-linux.2xlarge.avx512" },
{ config: "default", shard: 2, num_shards: 3, runner: "arc-lf-linux.2xlarge.avx512" },
{ config: "default", shard: 3, num_shards: 3, runner: "arc-lf-linux.2xlarge.avx512" },
{ config: "docs_test", shard: 1, num_shards: 1, runner: "arc-lf-linux.2xlarge.avx512" },
{ config: "jit_legacy", shard: 1, num_shards: 1, runner: "arc-lf-linux.2xlarge.avx512" },
{ config: "backwards_compat", shard: 1, num_shards: 1, runner: "arc-lf-linux.2xlarge.avx512" },
{ config: "distributed", shard: 1, num_shards: 2, runner: "arc-lf-linux.2xlarge.avx512" },
{ config: "distributed", shard: 2, num_shards: 2, runner: "arc-lf-linux.2xlarge.avx512" },
]}
linux-jammy-py3_8-gcc11-test:
name: linux-jammy-py3.8-gcc11
uses: ./.github/workflows/_linux-test-rg.yml
needs:
- linux-jammy-py3_8-gcc11-build
- target-determination
with:
build-environment: linux-jammy-py3.8-gcc11
docker-image: ${{ needs.linux-jammy-py3_8-gcc11-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-jammy-py3_8-gcc11-build.outputs.test-matrix }}
linux-jammy-py3_8-gcc11-no-ops:
name: linux-jammy-py3.8-gcc11-no-ops
uses: ./.github/workflows/_linux-build-rg.yml
with:
build-environment: linux-jammy-py3.8-gcc11-no-ops
docker-image-name: pytorch-linux-jammy-py3.8-gcc11
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 1 },
]}
linux-jammy-py3_8-gcc11-pch:
name: linux-jammy-py3.8-gcc11-pch
uses: ./.github/workflows/_linux-build-rg.yml
with:
build-environment: linux-jammy-py3.8-gcc11-pch
docker-image-name: pytorch-linux-jammy-py3.8-gcc11
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 1 },
]}
linux-focal-py3_8-clang10-onnx-build:
name: linux-focal-py3.8-clang10-onnx
uses: ./.github/workflows/_linux-build-rg.yml
with:
build-environment: linux-focal-py3.8-clang10-onnx
docker-image-name: pytorch-linux-focal-py3-clang10-onnx
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 2, runner: "arc-lf-linux.2xlarge.avx512" },
{ config: "default", shard: 2, num_shards: 2, runner: "arc-lf-linux.2xlarge.avx512" },
]}
linux-focal-py3_8-clang10-onnx-test:
name: linux-focal-py3.8-clang10-onnx
uses: ./.github/workflows/_linux-test-rg.yml
needs:
- linux-focal-py3_8-clang10-onnx-build
- target-determination
with:
build-environment: linux-focal-py3.8-clang10-onnx
docker-image: ${{ needs.linux-focal-py3_8-clang10-onnx-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-py3_8-clang10-onnx-build.outputs.test-matrix }}
linux-jammy-py3_10-clang15-asan-build:
name: linux-jammy-py3.10-clang15-asan
uses: ./.github/workflows/_linux-build-rg.yml
with:
build-environment: linux-jammy-py3.10-clang15-asan
docker-image-name: pytorch-linux-jammy-py3-clang15-asan
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 6, runner: "linux.4xlarge" },
{ config: "default", shard: 2, num_shards: 6, runner: "linux.4xlarge" },
{ config: "default", shard: 3, num_shards: 6, runner: "linux.4xlarge" },
{ config: "default", shard: 4, num_shards: 6, runner: "linux.4xlarge" },
{ config: "default", shard: 5, num_shards: 6, runner: "linux.4xlarge" },
{ config: "default", shard: 6, num_shards: 6, runner: "linux.4xlarge" },
]}
sync-tag: asan-build-arc
linux-focal-py3_8-clang10-build:
name: linux-focal-py3.8-clang10
uses: ./.github/workflows/_linux-build-rg.yml
with:
build-environment: linux-focal-py3.8-clang10
docker-image-name: pytorch-linux-focal-py3.8-clang10
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 3, runner: "arc-lf-linux.2xlarge.avx512" },
{ config: "default", shard: 2, num_shards: 3, runner: "arc-lf-linux.2xlarge.avx512" },
{ config: "default", shard: 3, num_shards: 3, runner: "arc-lf-linux.2xlarge.avx512" },
{ config: "crossref", shard: 1, num_shards: 2, runner: "arc-lf-linux.2xlarge.avx512" },
{ config: "crossref", shard: 2, num_shards: 2, runner: "arc-lf-linux.2xlarge.avx512" },
{ config: "dynamo", shard: 1, num_shards: 3, runner: "arc-lf-linux.2xlarge.avx512" },
{ config: "dynamo", shard: 2, num_shards: 3, runner: "arc-lf-linux.2xlarge.avx512" },
{ config: "dynamo", shard: 3, num_shards: 3, runner: "arc-lf-linux.2xlarge.avx512" },
]}
linux-focal-py3_8-clang10-test:
name: linux-focal-py3.8-clang10
uses: ./.github/workflows/_linux-test-rg.yml
needs:
- linux-focal-py3_8-clang10-build
- target-determination
with:
build-environment: linux-focal-py3.8-clang10
docker-image: ${{ needs.linux-focal-py3_8-clang10-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-py3_8-clang10-build.outputs.test-matrix }}
linux-focal-py3_11-clang10-build:
name: linux-focal-py3.11-clang10
uses: ./.github/workflows/_linux-build-rg.yml
with:
build-environment: linux-focal-py3.11-clang10
docker-image-name: pytorch-linux-focal-py3.11-clang10
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 3, runner: "arc-lf-linux.2xlarge.avx512" },
{ config: "default", shard: 2, num_shards: 3, runner: "arc-lf-linux.2xlarge.avx512" },
{ config: "default", shard: 3, num_shards: 3, runner: "arc-lf-linux.2xlarge.avx512" },
{ config: "crossref", shard: 1, num_shards: 2, runner: "arc-lf-linux.2xlarge.avx512" },
{ config: "crossref", shard: 2, num_shards: 2, runner: "arc-lf-linux.2xlarge.avx512" },
{ config: "dynamo", shard: 1, num_shards: 3, runner: "arc-lf-linux.2xlarge.avx512" },
{ config: "dynamo", shard: 2, num_shards: 3, runner: "arc-lf-linux.2xlarge.avx512" },
{ config: "dynamo", shard: 3, num_shards: 3, runner: "arc-lf-linux.2xlarge.avx512" },
]}
linux-focal-py3_11-clang10-test:
name: linux-focal-py3.11-clang10
uses: ./.github/workflows/_linux-test-rg.yml
needs:
- linux-focal-py3_11-clang10-build
- target-determination
with:
build-environment: linux-focal-py3.11-clang10
docker-image: ${{ needs.linux-focal-py3_11-clang10-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-py3_11-clang10-build.outputs.test-matrix }}
#
# End of Experimental ARC jobs
#

View File

@ -28,7 +28,7 @@ jobs:
- name: Install Python Packages
run: |
pip3 install rockset==1.0.3 boto3==1.19.12 requests==2.27.1
pip3 install rockset==1.0.3 boto3==1.19.12 requests==2.32.2
- name: Create alerts
run: |

View File

@ -2,7 +2,7 @@ name: Upload test stats
on:
workflow_run:
workflows: [pull, trunk, periodic, inductor, unstable, slow, unstable-periodic, inductor-periodic, rocm]
workflows: [pull, trunk, periodic, inductor, unstable, slow, unstable-periodic, inductor-periodic, rocm, inductor-micro-benchmark]
types:
- completed
@ -47,9 +47,10 @@ jobs:
cache: pip
- run: |
pip3 install requests==2.26 rockset==1.0.3 boto3==1.19.12
pip3 install requests==2.32.2 rockset==1.0.3 boto3==1.19.12
- name: Upload test artifacts
id: upload-s3
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
@ -94,6 +95,19 @@ jobs:
# Analyze the results from disable tests rerun and upload them to S3
python3 -m tools.stats.check_disabled_tests --workflow-run-id "${WORKFLOW_RUN_ID}" --workflow-run-attempt "${WORKFLOW_RUN_ATTEMPT}" --repo "${REPO_FULLNAME}"
- name: Upload gpt-fast benchmark results to Rockset
if: steps.upload-s3.outcome && steps.upload-s3.outcome == 'success' && github.event.workflow_run.name == 'inductor-micro-benchmark'
env:
ROCKSET_API_KEY: ${{ secrets.ROCKSET_API_KEY }}
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
WORKFLOW_RUN_ID: ${{ github.event.workflow_run.id }}
WORKFLOW_RUN_ATTEMPT: ${{ github.event.workflow_run.run_attempt }}
REPO_FULLNAME: ${{ github.event.workflow_run.repository.full_name }}
HEAD_BRANCH: ${{ github.event.workflow_run.head_branch }}
run: |
python3 -m tools.stats.upload_dynamo_perf_stats --workflow-run-id "${WORKFLOW_RUN_ID}" --workflow-run-attempt "${WORKFLOW_RUN_ATTEMPT}" --repo "${REPO_FULLNAME}" --head-branch "${HEAD_BRANCH}" --rockset-collection oss_ci_benchmark --rockset-workspace benchmarks --match-filename "^gpt_fast_benchmark"
check-api-rate:
if: ${{ always() && github.repository_owner == 'pytorch' }}
runs-on: ubuntu-latest

View File

@ -40,7 +40,7 @@ jobs:
cache: pip
- run: |
pip3 install requests==2.26 rockset==1.0.3 boto3==1.19.12
pip3 install requests==2.32.2 rockset==1.0.3 boto3==1.19.12
- name: Upload torch dynamo performance stats to S3
id: upload-s3
@ -68,4 +68,4 @@ jobs:
REPO_FULLNAME: ${{ github.event.workflow_run.repository.full_name }}
HEAD_BRANCH: ${{ github.event.workflow_run.head_branch }}
run: |
python3 -m tools.stats.upload_dynamo_perf_stats --workflow-run-id "${WORKFLOW_RUN_ID}" --workflow-run-attempt "${WORKFLOW_RUN_ATTEMPT}" --repo "${REPO_FULLNAME}" --head-branch "${HEAD_BRANCH}"
python3 -m tools.stats.upload_dynamo_perf_stats --workflow-run-id "${WORKFLOW_RUN_ID}" --workflow-run-attempt "${WORKFLOW_RUN_ATTEMPT}" --repo "${REPO_FULLNAME}" --head-branch "${HEAD_BRANCH}" --rockset-collection torch_dynamo_perf_stats --rockset-workspace inductor --match-filename "^inductor_"

View File

@ -0,0 +1,43 @@
name: Upload test stats intermediate
on:
workflow_dispatch:
inputs:
workflow_id:
description: workflow_id of the run
required: true
workflow_run_attempt:
description: workflow_run_attempt of the run
required: true
jobs:
intermediate_upload_test_stats:
name: Intermediate upload test stats for ${{ inputs.workflow_id }}
runs-on: ubuntu-22.04
environment: upload-stats
steps:
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
with:
fetch-depth: 1
submodules: false
- uses: actions/setup-python@v4
with:
python-version: '3.11'
cache: pip
- run: |
pip3 install requests==2.32.2 rockset==1.0.3 boto3==1.19.12
- name: Upload test stats
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
WORKFLOW_RUN_ID: ${{ inputs.workflow_id }}
WORKFLOW_RUN_ATTEMPT: ${{ inputs.workflow_run_attempt }}
run: |
python3 -m tools.stats.upload_test_stats_intermediate \
--workflow-run-id "${WORKFLOW_RUN_ID}" \
--workflow-run-attempt "${WORKFLOW_RUN_ATTEMPT}" \

32
.gitmodules vendored
View File

@ -2,10 +2,6 @@
ignore = dirty
path = third_party/pybind11
url = https://github.com/pybind/pybind11.git
[submodule "third_party/cub"]
ignore = dirty
path = third_party/cub
url = https://github.com/NVlabs/cub.git
[submodule "third_party/eigen"]
ignore = dirty
path = third_party/eigen
@ -22,10 +18,6 @@
ignore = dirty
path = third_party/protobuf
url = https://github.com/protocolbuffers/protobuf.git
[submodule "third_party/ios-cmake"]
ignore = dirty
path = third_party/ios-cmake
url = https://github.com/Yangqing/ios-cmake.git
[submodule "third_party/NNPACK"]
ignore = dirty
path = third_party/NNPACK
@ -50,10 +42,6 @@
ignore = dirty
path = third_party/psimd
url = https://github.com/Maratyszcza/psimd.git
[submodule "third_party/zstd"]
ignore = dirty
path = third_party/zstd
url = https://github.com/facebook/zstd.git
[submodule "third_party/cpuinfo"]
ignore = dirty
path = third_party/cpuinfo
@ -66,10 +54,6 @@
ignore = dirty
path = third_party/onnx
url = https://github.com/onnx/onnx.git
[submodule "third_party/onnx-tensorrt"]
ignore = dirty
path = third_party/onnx-tensorrt
url = https://github.com/onnx/onnx-tensorrt
[submodule "third_party/sleef"]
ignore = dirty
path = third_party/sleef
@ -86,14 +70,6 @@
ignore = dirty
path = third_party/gemmlowp/gemmlowp
url = https://github.com/google/gemmlowp.git
[submodule "third_party/QNNPACK"]
ignore = dirty
path = third_party/QNNPACK
url = https://github.com/pytorch/QNNPACK
[submodule "third_party/neon2sse"]
ignore = dirty
path = third_party/neon2sse
url = https://github.com/intel/ARM_NEON_2_x86_SSE.git
[submodule "third_party/fbgemm"]
ignore = dirty
path = third_party/fbgemm
@ -102,10 +78,6 @@
ignore = dirty
path = third_party/foxi
url = https://github.com/houseroad/foxi.git
[submodule "third_party/tbb"]
path = third_party/tbb
url = https://github.com/01org/tbb
branch = tbb_2018
[submodule "android/libs/fbjni"]
ignore = dirty
path = android/libs/fbjni
@ -152,3 +124,7 @@
[submodule "third_party/opentelemetry-cpp"]
path = third_party/opentelemetry-cpp
url = https://github.com/open-telemetry/opentelemetry-cpp.git
[submodule "third_party/cpp-httplib"]
path = third_party/cpp-httplib
url = https://github.com/yhirose/cpp-httplib.git
branch = v0.15.3

View File

@ -193,6 +193,7 @@ include_patterns = [
'aten/src/ATen/*.cpp',
'aten/src/ATen/core/*.h',
'aten/src/ATen/core/*.cpp',
'aten/src/ATen/detail/*',
'aten/src/ATen/functorch/*.h',
'aten/src/ATen/functorch/*.cpp',
'c10/**/*.cpp',
@ -234,7 +235,6 @@ exclude_patterns = [
'torch/csrc/jit/serialization/import_legacy.cpp',
'torch/csrc/jit/serialization/export.cpp',
'torch/csrc/lazy/**/*',
'torch/csrc/onnx/init.cpp',
'torch/csrc/mps/**/*',
]
init_command = [
@ -1052,19 +1052,18 @@ exclude_patterns = [
'test/quantization/fx/test_numeric_suite_fx.py',
'test/quantization/fx/test_quantize_fx.py',
'test/quantization/fx/test_subgraph_rewriter.py',
'test/test_datapipe.py',
'test/test_fake_tensor.py',
'test/test_flop_counter.py',
'test/test_function_schema.py',
'test/test_functional_autograd_benchmark.py',
'test/test_functional_optim.py',
'test/test_functionalization_of_rng_ops.py',
'test/test_datapipe.py',
'test/test_futures.py',
'test/test_fx.py',
'test/test_fx_experimental.py',
'test/test_fx_passes.py',
'test/test_fx_reinplace_pass.py',
'test/test_hub.py',
'test/test_import_stats.py',
'test/test_itt.py',
'test/test_jit.py',
@ -1073,7 +1072,6 @@ exclude_patterns = [
'test/test_jit_disabled.py',
'test/test_jit_fuser.py',
'test/test_jit_fuser_legacy.py',
'test/test_jit_fuser_te.py',
'test/test_jit_legacy.py',
'test/test_jit_llga_fuser.py',
'test/test_jit_profiling.py',
@ -1081,9 +1079,7 @@ exclude_patterns = [
'test/test_jit_string.py',
'test/test_jiterator.py',
'test/test_kernel_launch_checks.py',
'test/test_license.py',
'test/test_linalg.py',
'test/test_logging.py',
'test/test_masked.py',
'test/test_maskedtensor.py',
'test/test_matmul_cuda.py',
@ -1105,13 +1101,6 @@ exclude_patterns = [
'test/test_native_mha.py',
'test/test_nestedtensor.py',
'test/test_nn.py',
'test/test_nnapi.py',
'test/test_numba_integration.py',
'test/test_numpy_interop.py',
'test/test_nvfuser_dynamo.py',
'test/test_nvfuser_frontend.py',
'test/test_openmp.py',
'test/test_optim.py',
'test/test_out_dtype_op.py',
'test/test_overrides.py',
'test/test_prims.py',
@ -1125,9 +1114,6 @@ exclude_patterns = [
'test/test_segment_reductions.py',
'test/test_serialization.py',
'test/test_set_default_mobile_cpu_allocator.py',
'test/test_shape_ops.py',
'test/test_show_pickle.py',
'test/test_sort_and_select.py',
'test/test_sparse.py',
'test/test_sparse_csr.py',
'test/test_sparse_semi_structured.py',
@ -1141,30 +1127,13 @@ exclude_patterns = [
'test/test_tensorexpr.py',
'test/test_tensorexpr_pybind.py',
'test/test_testing.py',
'test/test_throughput_benchmark.py',
'test/test_torch.py',
'test/test_transformers.py',
'test/test_type_hints.py',
'test/test_type_info.py',
'test/test_type_promotion.py',
'test/test_unary_ufuncs.py',
'test/test_utils.py',
'test/test_vulkan.py',
'test/test_xnnpack_integration.py',
'test/torch_np/numpy_test/**/*.py',
'test/typing/fail/bitwise_ops.py',
'test/typing/fail/creation_ops.py',
'test/typing/fail/random.py',
'test/typing/pass/creation_ops.py',
'test/typing/pass/math_ops.py',
'test/typing/reveal/module_list.py',
'test/typing/reveal/namedtuple.py',
'test/typing/reveal/opt_size.py',
'test/typing/reveal/size.py',
'test/typing/reveal/tensor_constructors.py',
'test/typing/reveal/tensor_copy.py',
'test/typing/reveal/tensor_sampling.py',
'test/typing/reveal/torch_optim.py',
'torch/_awaits/__init__.py',
'torch/_custom_op/__init__.py',
'torch/_custom_op/autograd.py',
@ -1563,28 +1532,6 @@ exclude_patterns = [
'torch/distributed/optim/post_localSGD_optimizer.py',
'torch/distributed/optim/utils.py',
'torch/distributed/optim/zero_redundancy_optimizer.py',
'torch/distributed/pipeline/__init__.py',
'torch/distributed/pipeline/sync/__init__.py',
'torch/distributed/pipeline/sync/_balance/__init__.py',
'torch/distributed/pipeline/sync/_balance/blockpartition.py',
'torch/distributed/pipeline/sync/_balance/profile.py',
'torch/distributed/pipeline/sync/batchnorm.py',
'torch/distributed/pipeline/sync/checkpoint.py',
'torch/distributed/pipeline/sync/copy.py',
'torch/distributed/pipeline/sync/dependency.py',
'torch/distributed/pipeline/sync/microbatch.py',
'torch/distributed/pipeline/sync/phony.py',
'torch/distributed/pipeline/sync/pipe.py',
'torch/distributed/pipeline/sync/pipeline.py',
'torch/distributed/pipeline/sync/skip/__init__.py',
'torch/distributed/pipeline/sync/skip/layout.py',
'torch/distributed/pipeline/sync/skip/namespace.py',
'torch/distributed/pipeline/sync/skip/portal.py',
'torch/distributed/pipeline/sync/skip/skippable.py',
'torch/distributed/pipeline/sync/skip/tracker.py',
'torch/distributed/pipeline/sync/stream.py',
'torch/distributed/pipeline/sync/utils.py',
'torch/distributed/pipeline/sync/worker.py',
'torch/distributed/remote_device.py',
'torch/distributed/rendezvous.py',
'torch/distributed/rpc/__init__.py',
@ -1609,7 +1556,6 @@ exclude_patterns = [
'torch/distributed/tensor/parallel/input_reshard.py',
'torch/distributed/tensor/parallel/multihead_attention_tp.py',
'torch/distributed/tensor/parallel/style.py',
'torch/distributed/utils.py',
'torch/fft/__init__.py',
'torch/func/__init__.py',
'torch/functional.py',
@ -1701,18 +1647,6 @@ exclude_patterns = [
'torch/hub.py',
'torch/library.py',
'torch/linalg/__init__.py',
# UFMT causes import cycle on masked
'torch/masked/__init__.py',
'torch/masked/_docs.py',
'torch/masked/_ops.py',
'torch/masked/maskedtensor/__init__.py',
'torch/masked/maskedtensor/_ops_refs.py',
'torch/masked/maskedtensor/binary.py',
'torch/masked/maskedtensor/core.py',
'torch/masked/maskedtensor/creation.py',
'torch/masked/maskedtensor/passthrough.py',
'torch/masked/maskedtensor/reductions.py',
'torch/masked/maskedtensor/unary.py',
'torch/monitor/__init__.py',
'torch/nested/__init__.py',
'torch/nn/__init__.py',
@ -1891,8 +1825,6 @@ exclude_patterns = [
'torch/testing/_internal/distributed/nn/__init__.py',
'torch/testing/_internal/distributed/nn/api/__init__.py',
'torch/testing/_internal/distributed/nn/api/remote_module_test.py',
'torch/testing/_internal/distributed/pipe_with_ddp_test.py',
'torch/testing/_internal/distributed/pipeline/__init__.py',
'torch/testing/_internal/distributed/rpc/__init__.py',
'torch/testing/_internal/distributed/rpc/dist_autograd_test.py',
'torch/testing/_internal/distributed/rpc/dist_optimizer_test.py',
@ -1935,8 +1867,6 @@ exclude_patterns = [
'torch/utils/_mode_utils.py',
'torch/utils/_python_dispatch.py',
'torch/utils/_stats.py',
'torch/utils/_sympy/__init__.py',
'torch/utils/_sympy/functions.py',
'torch/utils/_traceback.py',
'torch/utils/_zip.py',
'torch/utils/backcompat/__init__.py',
@ -2149,7 +2079,7 @@ init_command = [
'python3',
'tools/linter/adapters/pip_init.py',
'--dry-run={{DRYRUN}}',
'ruff==0.4.1',
'ruff==0.4.8',
]
is_formatter = true

View File

@ -125,10 +125,6 @@ filegroup(
data = [":generate-code"],
)
exports_files(
srcs = ["aten/src/ATen/cpu/tbb/extra/version_string.ver.in"],
)
# ATen
filegroup(
name = "aten_base_cpp",
@ -275,7 +271,6 @@ header_template_rule(
"@AT_BUILD_WITH_LAPACK@": "1",
"@AT_PARALLEL_OPENMP@": "0",
"@AT_PARALLEL_NATIVE@": "1",
"@AT_PARALLEL_NATIVE_TBB@": "0",
"@AT_BLAS_F2C@": "0",
"@AT_BLAS_USE_CBLAS_DOT@": "1",
},
@ -359,6 +354,9 @@ cc_library(
":aten_src_ATen_config",
] + generated_cpu_cpp + aten_ufunc_generated_cpu_sources("aten/src/ATen/{}"),
copts = ATEN_COPTS,
linkopts = [
"-ldl",
],
data = if_cuda(
[":libcaffe2_nvrtc.so"],
[],
@ -456,65 +454,15 @@ CAFFE2_COPTS = COMMON_COPTS + [
filegroup(
name = "caffe2_core_srcs",
srcs = [
"caffe2/core/allocator.cc",
"caffe2/core/blob_serialization.cc",
"caffe2/core/blob_stats.cc",
"caffe2/core/common.cc",
"caffe2/core/context.cc",
"caffe2/core/context_base.cc",
"caffe2/core/db.cc",
"caffe2/core/event.cc",
"caffe2/core/export_c10_op_to_caffe2.cc",
"caffe2/core/graph.cc",
"caffe2/core/init.cc",
"caffe2/core/init_denormals.cc",
"caffe2/core/init_intrinsics_check.cc",
"caffe2/core/init_omp.cc",
"caffe2/core/int8_serialization.cc",
"caffe2/core/memonger.cc",
"caffe2/core/module.cc",
"caffe2/core/net.cc",
"caffe2/core/net_async_base.cc",
"caffe2/core/net_async_scheduling.cc",
"caffe2/core/net_async_task.cc",
"caffe2/core/net_async_task_future.cc",
"caffe2/core/net_async_task_graph.cc",
"caffe2/core/net_async_tracing.cc",
"caffe2/core/net_dag_utils.cc",
"caffe2/core/net_parallel.cc",
"caffe2/core/net_simple.cc",
"caffe2/core/net_simple_refcount.cc",
"caffe2/core/nomnigraph/Representations/NeuralNet.cc",
"caffe2/core/nomnigraph/tests/test_util.cc",
"caffe2/core/numa.cc",
"caffe2/core/operator.cc",
"caffe2/core/operator_schema.cc",
"caffe2/core/plan_executor.cc",
"caffe2/core/prof_dag_counters.cc",
"caffe2/core/qtensor.cc",
"caffe2/core/qtensor_serialization.cc",
"caffe2/core/stats.cc",
"caffe2/core/tensor.cc",
"caffe2/core/tensor_int8.cc",
"caffe2/core/test_utils.cc",
"caffe2/core/transform.cc",
"caffe2/core/types.cc",
"caffe2/core/workspace.cc",
],
)
filegroup(
name = "caffe2_perfkernels_srcs",
srcs = [
"caffe2/perfkernels/adagrad.cc",
"caffe2/perfkernels/embedding_lookup.cc",
"caffe2/perfkernels/embedding_lookup_idx.cc",
"caffe2/perfkernels/fused_8bit_rowwise_embedding_lookup.cc",
"caffe2/perfkernels/fused_8bit_rowwise_embedding_lookup_idx.cc",
"caffe2/perfkernels/fused_nbit_rowwise_conversion.cc",
"caffe2/perfkernels/lstm_unit_cpu_common.cc",
"caffe2/perfkernels/math_cpu_base.cc",
"caffe2/perfkernels/typed_axpy.cc",
],
)
@ -532,19 +480,7 @@ filegroup(
filegroup(
name = "caffe2_utils_srcs",
srcs = [
"caffe2/utils/bench_utils.cc",
"caffe2/utils/cpuid.cc",
"caffe2/utils/math/broadcast.cc",
"caffe2/utils/math/elementwise.cc",
"caffe2/utils/math/reduce.cc",
"caffe2/utils/math/transpose.cc",
"caffe2/utils/math/utils.cc",
"caffe2/utils/math_cpu.cc",
"caffe2/utils/murmur_hash3.cc",
"caffe2/utils/proto_utils.cc",
"caffe2/utils/proto_wrap.cc",
"caffe2/utils/signal_handler.cc",
"caffe2/utils/smart_tensor_printer.cc",
"caffe2/utils/string_utils.cc",
"caffe2/utils/threadpool/ThreadPool.cc",
"caffe2/utils/threadpool/pthreadpool.cc",
@ -562,12 +498,9 @@ cc_library(
name = "caffe2_for_aten_headers",
hdrs = [
"caffe2/core/common.h",
"caffe2/core/logging.h",
"caffe2/core/types.h",
"caffe2/perfkernels/common.h",
"caffe2/perfkernels/embedding_lookup.h",
"caffe2/perfkernels/embedding_lookup_idx.h",
"caffe2/utils/cpuid.h",
"caffe2/utils/fixed_divisor.h",
] + glob([
"caffe2/utils/threadpool/*.h",
@ -577,7 +510,6 @@ cc_library(
deps = [
":caffe2_core_macros",
"//c10",
"//caffe2/proto:caffe2_pb",
],
)
@ -585,18 +517,9 @@ cc_library(
name = "caffe2_headers",
hdrs = glob(
[
"caffe2/core/*.h",
"caffe2/core/nomnigraph/include/nomnigraph/Converters/*.h",
"caffe2/core/nomnigraph/include/nomnigraph/Generated/*.h",
"caffe2/core/nomnigraph/include/nomnigraph/Graph/*.h",
"caffe2/core/nomnigraph/include/nomnigraph/Representations/*.h",
"caffe2/core/nomnigraph/include/nomnigraph/Support/*.h",
"caffe2/core/nomnigraph/include/nomnigraph/Transformations/*.h",
"caffe2/core/nomnigraph/tests/*.h",
"caffe2/perfkernels/*.h",
"caffe2/serialize/*.h",
"caffe2/utils/*.h",
"caffe2/utils/math/*.h",
"caffe2/utils/threadpool/*.h",
"modules/**/*.h",
],
@ -605,18 +528,12 @@ cc_library(
],
) + if_cuda(glob([
"caffe2/**/*.cuh",
"caffe2/image/*.h",
])),
copts = CAFFE2_COPTS,
includes = [
"caffe2/core/nomnigraph/include",
],
visibility = ["//visibility:public"],
deps = [
":caffe2_core_macros",
":caffe2_for_aten_headers",
"//caffe2/proto:caffe2_pb",
"//caffe2/proto:cc_proto",
],
)
@ -637,8 +554,6 @@ cc_library(
":caffe2_perfkernels_avx",
":caffe2_perfkernels_avx2",
":caffe2_perfkernels_avx512",
"//caffe2/proto:caffe2_pb",
"//caffe2/proto:cc_proto",
"//third_party/miniz-2.1.0:miniz",
"@com_google_protobuf//:protobuf",
"@eigen",
@ -663,6 +578,7 @@ cu_library(
name = "torch_cuda",
srcs = [
"torch/csrc/distributed/c10d/intra_node_comm.cu",
"torch/csrc/distributed/c10d/Utils.cu",
"torch/csrc/distributed/c10d/quantization/quantization_gpu.cu",
],
copts = torch_cuda_half_options,
@ -771,7 +687,7 @@ cc_library(
[
"torch/*.h",
"torch/csrc/**/*.h",
"torch/csrc/distributed/c10d/*.hpp",
"torch/csrc/distributed/c10d/**/*.hpp",
"torch/lib/libshm/*.h",
],
exclude = [
@ -830,10 +746,14 @@ cc_library(
"torch/csrc/cuda/python_nccl.cpp",
"torch/csrc/cuda/nccl.cpp",
"torch/csrc/distributed/c10d/intra_node_comm.cu",
"torch/csrc/distributed/c10d/Utils.cu",
"torch/csrc/distributed/c10d/quantization/quantization_gpu.cu",
],
)) + torch_sources,
copts = TORCH_COPTS,
linkopts = [
"-lrt",
],
defines = [
"CAFFE2_NIGHTLY_VERSION=20200115",
],
@ -841,8 +761,8 @@ cc_library(
deps = [
":caffe2",
":torch_headers",
"//caffe2/proto:torch_cc_proto",
"@kineto",
"@cpp-httplib",
] + if_cuda([
"@cuda//:nvToolsExt",
"@cutlass",
@ -854,6 +774,9 @@ cc_library(
cc_library(
name = "shm",
srcs = glob(["torch/lib/libshm/*.cpp"]),
linkopts = [
"-lrt",
],
deps = [
":torch",
],

File diff suppressed because it is too large Load Diff

View File

@ -667,7 +667,6 @@ only interested in a specific component.
- Working on a test binary? Run `(cd build && ninja bin/test_binary_name)` to
rebuild only that test binary (without rerunning cmake). (Replace `ninja` with
`make` if you don't have ninja installed).
- Don't need Caffe2? Pass `BUILD_CAFFE2=0` to disable Caffe2 build.
On the initial build, you can also speed things up with the environment
variables `DEBUG`, `USE_DISTRIBUTED`, `USE_MKLDNN`, `USE_CUDA`, `USE_FLASH_ATTENTION`, `USE_MEM_EFF_ATTENTION`, `BUILD_TEST`, `USE_FBGEMM`, `USE_NNPACK` and `USE_QNNPACK`.
@ -790,7 +789,7 @@ USE_PRECOMPILED_HEADERS=1 python setup.py develop
```
This adds a build step where the compiler takes `<ATen/ATen.h>` and essentially
dumps it's internal AST to a file so the compiler can avoid repeating itself for
dumps its internal AST to a file so the compiler can avoid repeating itself for
every `.cpp` file.
One caveat is that when enabled, this header gets included in every file by default.
@ -1196,7 +1195,7 @@ build_with_asan()
LDFLAGS="-stdlib=libstdc++" \
CFLAGS="-fsanitize=address -fno-sanitize-recover=all -shared-libasan -pthread" \
CXX_FLAGS="-pthread" \
USE_CUDA=0 USE_OPENMP=0 BUILD_CAFFE2_OPS=0 USE_DISTRIBUTED=0 DEBUG=1 \
USE_CUDA=0 USE_OPENMP=0 USE_DISTRIBUTED=0 DEBUG=1 \
python setup.py develop
}
@ -1321,7 +1320,7 @@ There are two possible choices for which commit to use:
1. Checkout commit `B`, the head of the PR (manually committed by the PR
author).
2. Checkout commit `C`, the hypothetical result of what would happen if the PR
were merged into it's destination (usually `main`).
were merged into its destination (usually `main`).
For all practical purposes, most people can think of the commit being used as
commit `B` (choice **1**).

View File

@ -1,4 +1,4 @@
![PyTorch Logo](https://github.com/pytorch/pytorch/blob/main/docs/source/_static/img/pytorch-logo-dark.png)
![PyTorch Logo](https://github.com/pytorch/pytorch/raw/main/docs/source/_static/img/pytorch-logo-dark.png)
--------------------------------------------------------------------------------
@ -24,6 +24,9 @@ Our trunk health (Continuous Integration signals) can be found at [hud.pytorch.o
- [NVIDIA Jetson Platforms](#nvidia-jetson-platforms)
- [From Source](#from-source)
- [Prerequisites](#prerequisites)
- [NVIDIA CUDA Support](#nvidia-cuda-support)
- [AMD ROCm Support](#amd-rocm-support)
- [Intel GPU Support](#intel-gpu-support)
- [Install Dependencies](#install-dependencies)
- [Get the PyTorch Source](#get-the-pytorch-source)
- [Install PyTorch](#install-pytorch)
@ -95,7 +98,7 @@ from several research papers on this topic, as well as current and past work suc
While this technique is not unique to PyTorch, it's one of the fastest implementations of it to date.
You get the best of speed and flexibility for your crazy research.
![Dynamic graph](https://github.com/pytorch/pytorch/blob/main/docs/source/_static/img/dynamic_graph.gif)
![Dynamic graph](https://github.com/pytorch/pytorch/raw/main/docs/source/_static/img/dynamic_graph.gif)
### Python First
@ -162,6 +165,7 @@ If you are installing from source, you will need:
We highly recommend installing an [Anaconda](https://www.anaconda.com/download) environment. You will get a high-quality BLAS library (MKL) and you get controlled dependency versions regardless of your Linux distro.
##### NVIDIA CUDA Support
If you want to compile with CUDA support, [select a supported version of CUDA from our support matrix](https://pytorch.org/get-started/locally/), then install the following:
- [NVIDIA CUDA](https://developer.nvidia.com/cuda-downloads)
- [NVIDIA cuDNN](https://developer.nvidia.com/cudnn) v8.5 or above
@ -174,6 +178,7 @@ Other potentially useful environment variables may be found in `setup.py`.
If you are building for NVIDIA's Jetson platforms (Jetson Nano, TX1, TX2, AGX Xavier), Instructions to install PyTorch for Jetson Nano are [available here](https://devtalk.nvidia.com/default/topic/1049071/jetson-nano/pytorch-for-jetson-nano/)
##### AMD ROCm Support
If you want to compile with ROCm support, install
- [AMD ROCm](https://rocm.docs.amd.com/en/latest/deploy/linux/quick_start.html) 4.0 and above installation
- ROCm is currently supported only for Linux systems.
@ -181,6 +186,14 @@ If you want to compile with ROCm support, install
If you want to disable ROCm support, export the environment variable `USE_ROCM=0`.
Other potentially useful environment variables may be found in `setup.py`.
##### Intel GPU Support
If you want to compile with Intel GPU support, follow these
- [PyTorch Prerequisites for Intel GPUs](https://www.intel.com/content/www/us/en/developer/articles/tool/pytorch-prerequisites-for-intel-gpus.html) instructions.
- Intel GPU is supported for Linux and Windows.
If you want to disable Intel GPU support, export the environment variable `USE_XPU=0`.
Other potentially useful environment variables may be found in `setup.py`.
#### Install Dependencies
**Common**
@ -196,10 +209,11 @@ pip install -r requirements.txt
```bash
conda install intel::mkl-static intel::mkl-include
# CUDA only: Add LAPACK support for the GPU if needed
conda install -c pytorch magma-cuda110 # or the magma-cuda* that matches your CUDA version from https://anaconda.org/pytorch/repo
conda install -c pytorch magma-cuda121 # or the magma-cuda* that matches your CUDA version from https://anaconda.org/pytorch/repo
# (optional) If using torch.compile with inductor/triton, install the matching version of triton
# Run from the pytorch directory after cloning
# For Intel GPU support, please explicitly `export USE_XPU=1` before running command.
make triton
```
@ -379,7 +393,7 @@ You can also pass the `CMAKE_VARS="..."` environment variable to specify additio
See [setup.py](./setup.py) for the list of available variables.
```bash
CMAKE_VARS="BUILD_CAFFE2=ON BUILD_CAFFE2_OPS=ON" make -f docker.Makefile
make -f docker.Makefile
```
### Building the Documentation

View File

@ -37,6 +37,7 @@
- [TL;DR](#tldr)
- [Accelerator Software](#accelerator-software)
- [Special support cases](#special-support-cases)
- [Operating Systems](#operating-systems)
- [Submitting Tutorials](#submitting-tutorials)
- [Special Topics](#special-topics)
- [Updating submodules for a release](#updating-submodules-for-a-release)
@ -48,14 +49,14 @@
Following is the Release Compatibility Matrix for PyTorch releases:
| PyTorch version | Python | Stable CUDA | Experimental CUDA |
| --- | --- | --- | --- |
| 2.3 | >=3.8, <=3.11, (3.12 experimental) | CUDA 11.8, CUDNN 8.7.0.84 | CUDA 12.1, CUDNN 8.9.2.26 |
| 2.2 | >=3.8, <=3.11, (3.12 experimental) | CUDA 11.8, CUDNN 8.7.0.84 | CUDA 12.1, CUDNN 8.9.2.26 |
| 2.1 | >=3.8, <=3.11 | CUDA 11.8, CUDNN 8.7.0.84 | CUDA 12.1, CUDNN 8.9.2.26 |
| 2.0 | >=3.8, <=3.11 | CUDA 11.7, CUDNN 8.5.0.96 | CUDA 11.8, CUDNN 8.7.0.84 |
| 1.13 | >=3.7, <=3.10 | CUDA 11.6, CUDNN 8.3.2.44 | CUDA 11.7, CUDNN 8.5.0.96 |
| 1.12 | >=3.7, <=3.10 | CUDA 11.3, CUDNN 8.3.2.44 | CUDA 11.6, CUDNN 8.3.2.44 |
| PyTorch version | Python | Stable CUDA | Experimental CUDA | Stable ROCm |
| --- | --- | --- | --- | --- |
| 2.3 | >=3.8, <=3.11, (3.12 experimental) | CUDA 11.8, CUDNN 8.7.0.84 | CUDA 12.1, CUDNN 8.9.2.26 | ROCm 6.0 |
| 2.2 | >=3.8, <=3.11, (3.12 experimental) | CUDA 11.8, CUDNN 8.7.0.84 | CUDA 12.1, CUDNN 8.9.2.26 | ROCm 5.7 |
| 2.1 | >=3.8, <=3.11 | CUDA 11.8, CUDNN 8.7.0.84 | CUDA 12.1, CUDNN 8.9.2.26 | ROCm 5.6 |
| 2.0 | >=3.8, <=3.11 | CUDA 11.7, CUDNN 8.5.0.96 | CUDA 11.8, CUDNN 8.7.0.84 | ROCm 5.4 |
| 1.13 | >=3.7, <=3.10 | CUDA 11.6, CUDNN 8.3.2.44 | CUDA 11.7, CUDNN 8.5.0.96 | ROCm 5.2 |
| 1.12 | >=3.7, <=3.10 | CUDA 11.3, CUDNN 8.3.2.44 | CUDA 11.6, CUDNN 8.3.2.44 | ROCm 5.0 |
## Release Cadence
@ -262,7 +263,7 @@ requires `pytorchbot`, so it's only available in PyTorch atm.
### Cherry Picking Reverts
If PR that has been cherry-picked into release branch has been reverted, it's cherry-pick must be reverted as well.
If PR that has been cherry-picked into release branch has been reverted, its cherry-pick must be reverted as well.
Reverts for changes that was committed into the main branch prior to the branch cut, must be propagated into release branch as well.
@ -426,6 +427,15 @@ the size restrictions for publishing on PyPI so the default version that is publ
These special support cases will be handled on a case by case basis and support may be continued if current PyTorch maintainers feel as though there may still be a
need to support these particular versions of software.
## Operating Systems
Supported OS flavors are summarized in the table below:
| Operating System family | Architectrue | Notes |
| --- | --- | --- |
| Linux | aarch64, x86_64 | Wheels are manylinux2014 compatible, i.e. they should be runnable on any Linux system with glibc-2.17 or above. |
| MacOS | arm64 | Builds should be compatible with MacOS 11 (Big Sur) or newer, but are actively tested against MacOS 14 (Sonoma). |
| MacOS | x86_64 | Requires MacOS Catalina or above, not supported after 2.2, see https://github.com/pytorch/pytorch/issues/114602 |
| Windows | x86_64 | Buils are compatible with Windows-10 or newer. |
# Submitting Tutorials
Tutorials in support of a release feature must be submitted to the [pytorch/tutorials](https://github.com/pytorch/tutorials) repo at least two weeks before the release date to allow for editorial and technical review. There is no cherry-pick process for tutorials. All tutorials will be merged around the release day and published at [pytorch.org/tutorials](https://pytorch.org/tutorials/).

View File

@ -5,6 +5,7 @@
- [Untrusted models](#untrusted-models)
- [Untrusted inputs](#untrusted-inputs)
- [Data privacy](#data-privacy)
- [Using distributed features](#using-distributed-features)
## Reporting Security Issues
@ -39,7 +40,7 @@ Important Note: The trustworthiness of a model is not binary. You must always de
### Untrusted inputs during training and prediction
If you plan to open your model to untrusted inputs, be aware that inputs can also be used as vectors by malicious agents. To minimize risks, make sure to give your model only the permisisons strictly required, and keep your libraries updated with the lates security patches.
If you plan to open your model to untrusted inputs, be aware that inputs can also be used as vectors by malicious agents. To minimize risks, make sure to give your model only the permissions strictly required, and keep your libraries updated with the latest security patches.
If applicable, prepare your model against bad inputs and prompt injections. Some recommendations:
- Pre-analysis: check how the model performs by default when exposed to prompt injection (e.g. using fuzzing for prompt injection).
@ -54,3 +55,9 @@ If applicable, prepare your model against bad inputs and prompt injections. Some
**Take special security measures if your model if you train models with sensitive data**. Prioritize [sandboxing](https://developers.google.com/code-sandboxing) your models and:
- Do not feed sensitive data to untrusted model (even if runs in a sandboxed environment)
- If you consider publishing a model that was partially trained with sensitive data, be aware that data can potentially be recovered from the trained weights (especially if model overfits).
### Using distributed features
PyTorch can be used for distributed computing, and as such there is a `torch.distributed` package. PyTorch Distributed features are intended for internal communication only. They are not built for use in untrusted environments or networks.
For performance reasons, none of the PyTorch Distributed primitives (including c10d, RPC, and TCPStore) include any authorization protocol and will send messages unencrypted. They accept connections from anywhere, and execute the workload sent without performing any checks. Therefore, if you run a PyTorch Distributed program on your network, anybody with access to the network can execute arbitrary code with the privileges of the user running PyTorch.

View File

@ -168,14 +168,10 @@ new_local_repository(
path = "third_party/opentelemetry-cpp",
)
new_patched_local_repository(
name = "tbb",
build_file = "//third_party:tbb.BUILD",
patch_strip = 1,
patches = [
"@//third_party:tbb.patch",
],
path = "third_party/tbb",
new_local_repository(
name = "cpp-httplib",
build_file = "//third_party:cpp-httplib.BUILD",
path = "third_party/cpp-httplib",
)
new_local_repository(
@ -355,9 +351,4 @@ local_repository(
path = "third_party/onnx/third_party/benchmark",
)
local_repository(
name = "unused_onnx_tensorrt_benchmark",
path = "third_party/onnx-tensorrt/third_party/onnx/third_party/benchmark",
)
### Unused repos end

View File

@ -1,6 +1,7 @@
import torch
from torchvision import models
import torch
print(torch.version.__version__)
resnet18 = models.resnet18(weights=models.ResNet18_Weights.IMAGENET1K_V1)

Some files were not shown because too many files have changed in this diff Show More