Commit Graph

81624 Commits

Author SHA1 Message Date
c17ba69ba5 [submodule] Revert "Adds support for accelerated sorting with x86-simd-sort (#127936) (#141901)
Looks like the original PR caused: https://github.com/pytorch/pytorch/issues/140590

Please see comment: https://github.com/pytorch/pytorch/issues/140590#issuecomment-2508704480

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141901
Approved by: https://github.com/andrewor14, https://github.com/malfet
2024-12-03 00:16:35 +00:00
e41a0b33ec Allow Fakified subclass to have different device for inner and outer tensor (#141839)
Previously if a wrapper tensor subclass is fakified, the inner tensors would end up having the same device as the outer tensor. This PR makes it so that inner and outer tensors can have different devices.

See OffloadTensor PR https://github.com/pytorch/pytorch/pull/141840/files#diff-3bc0cf540b694f4ec0a3749f78b047456657a53a5657e495ffb68e5970c5fdaaR1955 for an application. A simpler test has been added in this PR.

This is technically bc-breaking because now the callback passed to MetaConverter needs to accept an extra argument, but no one external should be using this anyway?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141839
Approved by: https://github.com/bdhirsh
ghstack dependencies: #141166
2024-12-03 00:09:41 +00:00
9830e7b1e4 Update OpenBLAS to 0.3.28 (#137263)
This includes a number of performance improvements, such as threading optimisations and forwarding GEMM calls to GEMV for calls where N=1 or M=1.

See: https://github.com/OpenMathLib/OpenBLAS/releases

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137263
Approved by: https://github.com/malfet
2024-12-03 00:05:34 +00:00
9f9105a67b [MPS] Write/Invoke Metal shaders from C++ (#141547)
By introducing `DynamicMetalShaderLibrary` and `MetalShaderFunction`
Add unittests that also serves as an example of how API works

Using this primitive, one can compile and dispatch any 1D or 2D shader over MPS tensor using the following pattern
```cpp
auto x = torch::empty({8, 16}, at::device(at::kMPS));
DynamicMetalShaderLibrary lib(R"MTL(
  kernel void full(device float* t, constant ulong2& strides, uint2 idx [[thread_position_in_grid]]) {
    t[idx.x*strides.x + idx.y*strides.y] = idx.x + 33.0 * idx.y;
  }
)MTL");
auto func = lib.getKernelFunction("full");
func->runCommandBlock([&] {
   func->startEncoding();
   func->setArg(0, x);
   func->setArg(1, x.strides());
   func->dispatch({8, 16});
});

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141547
Approved by: https://github.com/Skylion007
2024-12-02 23:57:59 +00:00
5c2584a14c [ROCm] Enable inductor GEMM lowering for gfx11 (#141687)
This check doesn't make sense for some of the AMD gpus since they have the right amount of CUs but multi_processor_count returns WGPs on RDNA while still performing adequately. A lot of tests fail on modern archs due to this check defaulting them to not using the GEMMs backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141687
Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily, https://github.com/malfet

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2024-12-02 22:13:34 +00:00
1f3d8896bc Fix mismatched tensor metadata between FakeTensor and Intel XPU concrete tensor when running F.logsigmoid (#141333)
Fixes https://github.com/pytorch/pytorch/issues/141332
`F.logsigmoid` will return two outputs: `output` and `buffer`.
For `F.logsigmoid` cpu path, it will use buffer to store some intermediate values and use them when computing gradients, so it returns a `buffer` tensor with nonzero size. For cuda and xpu paths, buffer is useless, so the `buffer ` tensor size of xpu `F.logsigmoid`  will be zero, just like cuda. The root cause of the issue is that the codes in `decompositions.py` (ref:https://github.com/pytorch/pytorch/blob/main/torch/_decomp/decompositions.py#L2803) only handle the cuda cases, when the a fake tensor with device is xpu run to here, it will use the cpu path and return a `buffer` with nonzero size, which is conflict to the  implementation of intel xpu concrete tensor. Therefore this pr add conditions to handle xpu cases. Make sure the two returned buffer sizes match each other.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141333
Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/ezyang
2024-12-02 22:09:20 +00:00
74eb92ed6e fix deep copy of empty graph (#141660)
Differential Revision: D66532131

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141660
Approved by: https://github.com/ezyang
2024-12-02 22:03:13 +00:00
41e59754b4 [CI] Remove inductor-perf-test-nightly-a10g.yml (#141895)
Summary: Deprecate the A10g nightly perf run. The workflow was introduced as an experiment and doesn't seem to be used by developers.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141895
Approved by: https://github.com/huydhn
2024-12-02 21:55:20 +00:00
cyy
55250b324d [1/N] Apply py39 ruff fixes (#138578)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138578
Approved by: https://github.com/Skylion007
2024-12-02 21:46:18 +00:00
b47bdb06d8 Revert "[inductor][pattern matcher] revise mkldnn pattern matcher UT (#141334)"
This reverts commit 942a2438e263a2632b8934dd245060c9b237f4be.

Reverted https://github.com/pytorch/pytorch/pull/141334 on behalf of https://github.com/atalman due to Failing internally ([comment](https://github.com/pytorch/pytorch/pull/141334#issuecomment-2512891840))
2024-12-02 21:29:02 +00:00
6b05e31042 Revert "[REFACTOR] Inline FxGraphCache.post_compile into sole call site (#141877)"
This reverts commit 61534391ba8204286f5c9ed15ab636e94bd3daf2.

Reverted https://github.com/pytorch/pytorch/pull/141877 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but a lot of failures shows up after this lands ([comment](https://github.com/pytorch/pytorch/pull/141877#issuecomment-2512890426))
2024-12-02 21:26:13 +00:00
64d44a39a1 remote_cache: Add a waitcounter for gets and sets (#141307)
This adds a basic waitcounter to help show if we're spending a lot of
time doing gets and sets to remote caches

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141307
Approved by: https://github.com/masnesral
2024-12-02 20:48:47 +00:00
daa77f3d9f Revert "[BE]: Update mypy to 1.13.0 (#140808)"
This reverts commit 00134d68af2ce50560fa5a74473665ea229e6c9d.

Reverted https://github.com/pytorch/pytorch/pull/140808 on behalf of https://github.com/huydhn due to This is failing a distributed test in trunk, target determination missed this test and did not run it on PR ([comment](https://github.com/pytorch/pytorch/pull/140808#issuecomment-2512788426))
2024-12-02 20:47:43 +00:00
54adbbf6b8 cpp_wrapper: Add support for MemoryFormat arguments (#141367)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141367
Approved by: https://github.com/desertfire
2024-12-02 20:40:24 +00:00
30574380a3 [REFACTOR] Factor _fx_graph_cache_key and _time_taken_ns to common base class (#141878)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141878
Approved by: https://github.com/jamesjwu
ghstack dependencies: #141877
2024-12-02 20:07:12 +00:00
61534391ba [REFACTOR] Inline FxGraphCache.post_compile into sole call site (#141877)
I am going to break apart the arguments passed to the constituents
to only pass exactly what is needed, so easy access to the insides
is helpful here.

This also moves two helper functions to output_code.py as well.

Also set _boxed_call at constructor.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141877
Approved by: https://github.com/jamesjwu, https://github.com/jansel
2024-12-02 19:48:05 +00:00
fe68f61c59 Migrate micro benchmark results to benchmark database schema v3 (#141745)
Similar to https://github.com/pytorch/pytorch/pull/141087, this uploads the micro benchmark results to benchmark database with its new schema v3. The data can then be queried.

~I'm testing with `inductor-micro-benchmark-x86` which should be sufficient because `inductor-micro-benchmark` is broken atm.  The CSV output stays for now until the dashboard is migrated to schema v3.~ https://github.com/pytorch/pytorch/issues/141747 has been resolved, so inductor-micro-benchmark should work now

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141745
Approved by: https://github.com/yanboliang
2024-12-02 19:45:51 +00:00
cyy
ab5467897a Fix NOLINTNEXTLINE (#141794)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141794
Approved by: https://github.com/Skylion007

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2024-12-02 19:22:00 +00:00
161a2340ee Switch to using Python nested int (#141166)
Doesn't seem to noticeably slow down eager - TestNestedTensorSubclass tests with and without the PR finished in similar amounts of time (around 57s, 58s)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141166
Approved by: https://github.com/ezyang
2024-12-02 19:17:30 +00:00
2d708752f0 [dynamo] Remove AutoDerefLocalSource and simplify cell handling (#141629)
This patch
1. removes `AutoDerefLocalSource` in favor of `LocalSource`, thereby
   removing its special handling in guards.
2. introduces a `LocalCellSource` for cells from the root frame, with
   only `reconstruct` implemented, to programmatically enforce that thse
   cells should never be used by other components like guards.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141629
Approved by: https://github.com/jansel
ghstack dependencies: #141628
2024-12-02 19:09:30 +00:00
e14d8c980f [dynamo][NFC] Rename NewCellVariable to CellVariable (#141628)
It was named `NewCellVariable` because we originally used it to
represent cells by the code Dynamo is tracing through. However, now we
use it to represent pre-existing cells as well, so this patch renames it
to avoid confusion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141628
Approved by: https://github.com/williamwen42, https://github.com/jansel
2024-12-02 19:09:30 +00:00
0989871ac9 pytorch/feature: Record if parallel compile is enabled (#141074)
This gets a bit messy, but this appears to be the best spot to make a
true / false decision.

Note that since we're looking at whether or not it's used, if the pool
doesn't warm up within the time it takes for a compile, we will mark the
feature use as false.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141074
Approved by: https://github.com/masnesral
ghstack dependencies: #141059
2024-12-02 19:09:11 +00:00
00134d68af [BE]: Update mypy to 1.13.0 (#140808)
Update mypy to 1.13.0 . Should hopefully reduce linting time. Has support for orjson cache serialization which should improve mypy cache perf if orjson is installed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140808
Approved by: https://github.com/ezyang, https://github.com/malfet
2024-12-02 18:47:54 +00:00
9012e7a62f Revert "[dynamo][pytree][1/N] make CXX pytree traceable: tree_iter / tree_leaves (#137397)"
This reverts commit 07850bb2c1771ba3f5578b0aa85792e5cd70de1c.

Reverted https://github.com/pytorch/pytorch/pull/137397 on behalf of https://github.com/atalman due to Failing internal test ([comment](https://github.com/pytorch/pytorch/pull/137397#issuecomment-2511934283))
2024-12-02 16:05:14 +00:00
eb7deb2db5 Revert "Fix NOLINTNEXTLINE (#141794)"
This reverts commit 7dd9b5fc4343d101294dbbab4b4172f2859460bc.

Reverted https://github.com/pytorch/pytorch/pull/141794 on behalf of https://github.com/atalman due to [GH job link](https://github.com/pytorch/pytorch/actions/runs/12087979418/job/33711943084) [HUD commit link](7dd9b5fc43) ([comment](https://github.com/pytorch/pytorch/pull/141794#issuecomment-2511789484))
2024-12-02 15:07:50 +00:00
a34a56f69f Revert "Ensure that BlockMask length must always exactly match the sequence length in flex_attention (#141625)"
This reverts commit 795f28ac552eb61d02ea02fd64637ba814133bd8.

Reverted https://github.com/pytorch/pytorch/pull/141625 on behalf of https://github.com/albanD due to Broken main ([comment](https://github.com/pytorch/pytorch/pull/141625#issuecomment-2511639687))
2024-12-02 14:10:38 +00:00
ec96597e47 Revert "ILP for auto FSDP wrapping (#140298)"
This reverts commit d4cdc098817a0af10b478256b524533ed67285a9.

Reverted https://github.com/pytorch/pytorch/pull/140298 on behalf of https://github.com/xuanzhang816 due to for other PR ([comment](https://github.com/pytorch/pytorch/pull/140298#issuecomment-2511638743))
2024-12-02 14:08:04 +00:00
942a2438e2 [inductor][pattern matcher] revise mkldnn pattern matcher UT (#141334)
Fixes #139970, #139812.

Revise mkldnn pattern matcher UTs, to check the relevant specific matched patterns instead of the total matched number.
1) Add the missing specific counters in pattern matchers, e.g. `mkldnn_unary_fusion_matcher_nodes`/`mkldnn_conv_weight_pack_matcher_count`.
2) In UTs, change the general `matcher_count`/`matcher_nodes` checks to the specific ones, e.g. `mkldnn_unary_fusion_matcher_nodes`/`mkldnn_conv_weight_pack_matcher_count`.
3) In UTs, remove the option of `matcher_count`/`matcher_nodes` params in _test_common and make `matcher_check_fn` a necessary param.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141334
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5
2024-12-02 08:42:10 +00:00
96d2a511ce [Inductor][CPP] Fix issue in CPP GEMM Template Prune Tensor (#141798)
**Summary**
When addressing [issue #134998](https://github.com/pytorch/pytorch/issues/134998), we will verify if any node in the current graph shares the same storage as the node we intend to prune. In the implementation, we assumed that when creating the `GraphLowering` in post-grad phase, there would be no `submodules`, and all `get_attr` nodes would correspond to a `torch.Tensor`. However, this assumption proves incorrect when enabling `FlexAttention`. In this scenario, `submodules` are present as `get_attr` node in post-grad phase. For example:

```
V1128 23:23:47.071000 1965794 torch/_inductor/compile_fx.py:875] [0/1] [__post_grad_graphs]     class sdpa_score30(torch.nn.Module):
V1128 23:23:47.071000 1965794 torch/_inductor/compile_fx.py:875] [0/1] [__post_grad_graphs]         def forward(self, arg0_1: "bf16[][]cpu", arg1_1: "i32[][]cpu", arg2_1: "i32[][]cpu", arg3_1: "i32[][]cpu", arg4_1: "i32[][]cpu"):
V1128 23:23:47.071000 1965794 torch/_inductor/compile_fx.py:875] [0/1] [__post_grad_graphs]             return arg0_1

V1128 23:23:45.482000 1965794 torch/_inductor/freezing.py:118] [0/1]         sdpa_score30 = self.sdpa_score30
V1128 23:23:45.482000 1965794 torch/_inductor/freezing.py:118] [0/1]         sdpa_mask30 = self.sdpa_mask30
V1128 23:23:45.482000 1965794 torch/_inductor/freezing.py:118] [0/1]         flex_attention_30 = torch.ops.higher_order.flex_attention(add_276, index_put_60, index_put_61, sdpa_score30, (_frozen_param293, _frozen_param295, _frozen_param296, _frozen_param297, _frozen_param298, _frozen_param299, _frozen_param300, _frozen_param301, 64, 64, sdpa_mask30), 0.08838834764831843, {'SKIP_MASK_SCORE': True, 'PRESCALE_QK': False, 'ROWS_GUARANTEED_SAFE': False, 'BLOCKS_ARE_CONTIGUOUS': False, 'OUTPUT_LOGSUMEXP': False}, (), (_frozen_param294,));  add_276 = sdpa_score30 = sdpa_mask30 = None
V1128 23:23:45.482000 1965794 torch/_inductor/freezing.py:118] [0/1]         getitem_60: "bf16[1, 32, 1, 128]" = flex_attention_30[0];  flex_attention_30 = None
```
We added an extra check in the implementation to ensure only comparing the `get_attr` node with `torch.Tensor`. It is difficult to reproduce this issue using pure high-order operators. Adding a unit test after https://github.com/pytorch/pytorch/pull/141453 lands would be more straightforward.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141798
Approved by: https://github.com/jgong5
2024-12-02 07:38:57 +00:00
90f4d60672 Revert "export AOTI_TORCH_EXPORT on Windows. (#140030)"
This reverts commit daed864f7b3ca3b3e64ed13624369fd3007ad47d.

Reverted https://github.com/pytorch/pytorch/pull/140030 on behalf of https://github.com/xuhancn due to need to fix on XPU. ([comment](https://github.com/pytorch/pytorch/pull/140030#issuecomment-2510737212))
2024-12-02 07:10:41 +00:00
cyy
8cada5cbe5 Use std::apply (#141834)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141834
Approved by: https://github.com/Skylion007
2024-12-02 05:49:10 +00:00
f16e08042c [user triton] Fix grid codegen for configs with empty kwargs (#141824)
Fixes #141823 by adding special handling of the codegen `if <config kwargs>: return <grid>` for the cases when there are no kwargs in the config.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141824
Approved by: https://github.com/Chillee
2024-12-02 04:17:21 +00:00
daed864f7b export AOTI_TORCH_EXPORT on Windows. (#140030)
Fixes #139954

reproduce UT:
```cmd
pytest test/inductor/test_torchinductor_codegen_dynamic_shapes.py -k test_device_assert_dynamic_shapes_cpu
```
Issue:
<img width="856" alt="image" src="https://github.com/user-attachments/assets/5fc501a9-54e5-45ac-9fb3-509ec11a7abe">

After fixing:
![Image](https://github.com/user-attachments/assets/883846fb-8e92-4b9c-9400-daab32382a3a)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140030
Approved by: https://github.com/jgong5, https://github.com/desertfire
2024-12-02 03:20:29 +00:00
81ab2cc757 Update torch-xpu-ops commit pin (#141201)
Update the torch-xpu-ops commit to [1e32bbc](1e32bbc3d9), includes:

- Improve XPU aten operator coverage
- Support basic `SparseXPU` operators

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141201
Approved by: https://github.com/EikanWang, https://github.com/jansel
2024-12-02 01:49:07 +00:00
795f28ac55 Ensure that BlockMask length must always exactly match the sequence length in flex_attention (#141625)
Fixes https://github.com/pytorch/pytorch/issues/141435

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141625
Approved by: https://github.com/drisspg
ghstack dependencies: #138788
2024-12-02 00:35:29 +00:00
8eb259fdc3 Added option to control number of kernel options displayed (#138788)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138788
Approved by: https://github.com/drisspg
2024-12-02 00:35:29 +00:00
fc74ec4989 [2/N] Avoid copy in std::get (#141826)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141826
Approved by: https://github.com/Skylion007, https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-12-02 00:16:48 +00:00
b2fe1b9409 [inductor] Fix 3d tiling (#141709)
Fixes #141121

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141709
Approved by: https://github.com/eellison
2024-12-01 19:47:41 +00:00
90f19fee8a [MPS] Convert channels_last_3d to contiguous for input tensor in nn.Conv3d (#141780)
When the input tensor to Conv3d is in the channels_last_3d memory format the Conv3d op will generate incorrect output (see example image in #141471). This PR checks if the op is 3d, and then attempts to convert the input tensor to contiguous.

Added a regression test that verifies the output by running the same op on the CPU.

I'm unsure if Conv3d supports the channels last memory format after #128393. If it does, we should consider updating the logic to utilize this as it would be more efficient. Perhaps @DenisVieriu97 knows or has more context?

Fixes #141471
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141780
Approved by: https://github.com/malfet
2024-12-01 18:36:53 +00:00
5deca07c0d [Inductor] Represent tiling as a dict (#141751)
# Summary

Preparatory refactor for https://github.com/pytorch/pytorch/pull/137243. This makes it easier to generalize to multi-dimensional reductions.

This diff refactors `self.numels` from a tuple like `(8,16)` to a dict like `{"x": 8, "r": 16}`.

Note: this is based off of https://github.com/pytorch/pytorch/pull/141738, which enables `tree.is_reduction`. That PR should land first.

# Test plan
The existing CI provides good coverage.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141751
Approved by: https://github.com/jansel
2024-12-01 09:54:34 +00:00
cyy
96be048f06 [1/N] Avoid copy in std::get (#141812)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141812
Approved by: https://github.com/Skylion007
2024-12-01 03:53:35 +00:00
c2fa544472 [Inductor] move block pointer analysis to a new module (#141733)
# Summary

Preparatory refactor for https://github.com/pytorch/pytorch/pull/137243. This refactors the ModularIndexing block pointer analysis into its own module. That way, we can call it from other places besides Triton codegen. In the parent PR, we will use this to find tiling splits that simplify the indexing.

# Test plan

Tested by the existing CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141733
Approved by: https://github.com/jansel
2024-11-30 23:21:24 +00:00
49fde426ba [Inductor] Use a helper function to tell if a tree or prefix is a reduction (#141738)
Preparatory refactor for https://github.com/pytorch/pytorch/pull/137243. Previously, we would typically check for reductions by `tree.prefix == "r"`. This PR moves the check into a helper function. This makes it easier to generalize the code to multi-dimensional reductions, which could have multiple prefixes like `("r0_", "r1_")`.

Tested by the existing CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141738
Approved by: https://github.com/jansel
2024-11-30 22:38:13 +00:00
394c339691 improve typings in unflatten (#141817)
A first follow-up to https://github.com/pytorch/pytorch/pull/115074 / https://github.com/pytorch/pytorch/pull/141240 following the strategy discussed there (https://github.com/pytorch/pytorch/pull/115074#issuecomment-2480992230).

This PR improves the type annotations around `unflatten.py` which had been inaccurate due to the previously suppressed type checking on `torch.nn.Module`.

CC @Skylion007 @ezyang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141817
Approved by: https://github.com/Skylion007
2024-11-30 22:12:15 +00:00
8a81f7a4b6 Refactor functions in functorch for functional (#141808)
As the title stated
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141808
Approved by: https://github.com/Skylion007
2024-11-30 20:15:40 +00:00
0f3f801fc2 Add windows CUDA 12.6 nightly builds (#141805)
Windows AMI was published to prod. This PR adds CUDA 12.6 nightly builds

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141805
Approved by: https://github.com/huydhn, https://github.com/Skylion007
2024-11-30 14:39:47 +00:00
eqy
9532589b53 [CUDA][64-bit indexing] Support 64-bit indexing in distribution_elementwise_grid_stride_kernel (#141613)
For #141544
Overhead doesn't seem to be noticeable even on small sizes (e.g., 2**10 elements)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141613
Approved by: https://github.com/Skylion007, https://github.com/ngimel
2024-11-30 06:55:02 +00:00
7fafaa9c82 Introduce CompiledAOTI (#141695)
Stacked on https://github.com/pytorch/pytorch/pull/141691

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141695
Approved by: https://github.com/aorenste
ghstack dependencies: #141681, #141683, #141685, #141688, #141689, #141691
2024-11-30 00:05:41 +00:00
2f72635a5c automatic dynamic unspecialize float (#141647)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141647
Approved by: https://github.com/ezyang
2024-11-29 22:36:53 +00:00
cyy
e29dabbd71 Fix performance-unnecessary-copy-initialization (#141792)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141792
Approved by: https://github.com/Skylion007
2024-11-29 22:10:06 +00:00