By introducing `DynamicMetalShaderLibrary` and `MetalShaderFunction`
Add unittests that also serves as an example of how API works
Using this primitive, one can compile and dispatch any 1D or 2D shader over MPS tensor using the following pattern
```cpp
auto x = torch::empty({8, 16}, at::device(at::kMPS));
DynamicMetalShaderLibrary lib(R"MTL(
kernel void full(device float* t, constant ulong2& strides, uint2 idx [[thread_position_in_grid]]) {
t[idx.x*strides.x + idx.y*strides.y] = idx.x + 33.0 * idx.y;
}
)MTL");
auto func = lib.getKernelFunction("full");
func->runCommandBlock([&] {
func->startEncoding();
func->setArg(0, x);
func->setArg(1, x.strides());
func->dispatch({8, 16});
});
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141547
Approved by: https://github.com/Skylion007
Fixes https://github.com/pytorch/pytorch/issues/141332
`F.logsigmoid` will return two outputs: `output` and `buffer`.
For `F.logsigmoid` cpu path, it will use buffer to store some intermediate values and use them when computing gradients, so it returns a `buffer` tensor with nonzero size. For cuda and xpu paths, buffer is useless, so the `buffer ` tensor size of xpu `F.logsigmoid` will be zero, just like cuda. The root cause of the issue is that the codes in `decompositions.py` (ref:https://github.com/pytorch/pytorch/blob/main/torch/_decomp/decompositions.py#L2803) only handle the cuda cases, when the a fake tensor with device is xpu run to here, it will use the cpu path and return a `buffer` with nonzero size, which is conflict to the implementation of intel xpu concrete tensor. Therefore this pr add conditions to handle xpu cases. Make sure the two returned buffer sizes match each other.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141333
Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/ezyang
This patch
1. removes `AutoDerefLocalSource` in favor of `LocalSource`, thereby
removing its special handling in guards.
2. introduces a `LocalCellSource` for cells from the root frame, with
only `reconstruct` implemented, to programmatically enforce that thse
cells should never be used by other components like guards.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141629
Approved by: https://github.com/jansel
ghstack dependencies: #141628
This gets a bit messy, but this appears to be the best spot to make a
true / false decision.
Note that since we're looking at whether or not it's used, if the pool
doesn't warm up within the time it takes for a compile, we will mark the
feature use as false.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141074
Approved by: https://github.com/masnesral
ghstack dependencies: #141059
Fixes#139970, #139812.
Revise mkldnn pattern matcher UTs, to check the relevant specific matched patterns instead of the total matched number.
1) Add the missing specific counters in pattern matchers, e.g. `mkldnn_unary_fusion_matcher_nodes`/`mkldnn_conv_weight_pack_matcher_count`.
2) In UTs, change the general `matcher_count`/`matcher_nodes` checks to the specific ones, e.g. `mkldnn_unary_fusion_matcher_nodes`/`mkldnn_conv_weight_pack_matcher_count`.
3) In UTs, remove the option of `matcher_count`/`matcher_nodes` params in _test_common and make `matcher_check_fn` a necessary param.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141334
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5
**Summary**
When addressing [issue #134998](https://github.com/pytorch/pytorch/issues/134998), we will verify if any node in the current graph shares the same storage as the node we intend to prune. In the implementation, we assumed that when creating the `GraphLowering` in post-grad phase, there would be no `submodules`, and all `get_attr` nodes would correspond to a `torch.Tensor`. However, this assumption proves incorrect when enabling `FlexAttention`. In this scenario, `submodules` are present as `get_attr` node in post-grad phase. For example:
```
V1128 23:23:47.071000 1965794 torch/_inductor/compile_fx.py:875] [0/1] [__post_grad_graphs] class sdpa_score30(torch.nn.Module):
V1128 23:23:47.071000 1965794 torch/_inductor/compile_fx.py:875] [0/1] [__post_grad_graphs] def forward(self, arg0_1: "bf16[][]cpu", arg1_1: "i32[][]cpu", arg2_1: "i32[][]cpu", arg3_1: "i32[][]cpu", arg4_1: "i32[][]cpu"):
V1128 23:23:47.071000 1965794 torch/_inductor/compile_fx.py:875] [0/1] [__post_grad_graphs] return arg0_1
V1128 23:23:45.482000 1965794 torch/_inductor/freezing.py:118] [0/1] sdpa_score30 = self.sdpa_score30
V1128 23:23:45.482000 1965794 torch/_inductor/freezing.py:118] [0/1] sdpa_mask30 = self.sdpa_mask30
V1128 23:23:45.482000 1965794 torch/_inductor/freezing.py:118] [0/1] flex_attention_30 = torch.ops.higher_order.flex_attention(add_276, index_put_60, index_put_61, sdpa_score30, (_frozen_param293, _frozen_param295, _frozen_param296, _frozen_param297, _frozen_param298, _frozen_param299, _frozen_param300, _frozen_param301, 64, 64, sdpa_mask30), 0.08838834764831843, {'SKIP_MASK_SCORE': True, 'PRESCALE_QK': False, 'ROWS_GUARANTEED_SAFE': False, 'BLOCKS_ARE_CONTIGUOUS': False, 'OUTPUT_LOGSUMEXP': False}, (), (_frozen_param294,)); add_276 = sdpa_score30 = sdpa_mask30 = None
V1128 23:23:45.482000 1965794 torch/_inductor/freezing.py:118] [0/1] getitem_60: "bf16[1, 32, 1, 128]" = flex_attention_30[0]; flex_attention_30 = None
```
We added an extra check in the implementation to ensure only comparing the `get_attr` node with `torch.Tensor`. It is difficult to reproduce this issue using pure high-order operators. Adding a unit test after https://github.com/pytorch/pytorch/pull/141453 lands would be more straightforward.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141798
Approved by: https://github.com/jgong5
When the input tensor to Conv3d is in the channels_last_3d memory format the Conv3d op will generate incorrect output (see example image in #141471). This PR checks if the op is 3d, and then attempts to convert the input tensor to contiguous.
Added a regression test that verifies the output by running the same op on the CPU.
I'm unsure if Conv3d supports the channels last memory format after #128393. If it does, we should consider updating the logic to utilize this as it would be more efficient. Perhaps @DenisVieriu97 knows or has more context?
Fixes#141471
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141780
Approved by: https://github.com/malfet
Preparatory refactor for https://github.com/pytorch/pytorch/pull/137243. Previously, we would typically check for reductions by `tree.prefix == "r"`. This PR moves the check into a helper function. This makes it easier to generalize the code to multi-dimensional reductions, which could have multiple prefixes like `("r0_", "r1_")`.
Tested by the existing CI.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141738
Approved by: https://github.com/jansel