87 Commits

Author SHA1 Message Date
979e10f7d6 [Bugfix] Match eager stride semantics for cloned tensors with preserve_format in compile (#163017)
Fixes #161010 by making `clone_meta` match the semantics of strides for eager mode.

This is:
  * Case 1: Tensor is_non_overlapping_and_dense; in this case, stride should match input tensor stride
  * Case 2: Otherwise, stride should be contiguous computed from input tensor using `compute_elementwise_output_strides`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163017
Approved by: https://github.com/williamwen42, https://github.com/xmfan

Co-authored-by: morrison-turnansky <mturnans@redhat.com>
2025-09-19 19:41:33 +00:00
e4174b1fd7 remove gso from collapse_view_helper (#162212)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162212
Approved by: https://github.com/aorenste

Co-authored-by: Aaron Orenstein <aorenste@fb.com>
2025-09-10 00:17:15 +00:00
d8c8ba2440 Fix unused Python variables in test/[e-z]* (#136964)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136964
Approved by: https://github.com/justinchuby, https://github.com/albanD
2024-12-18 23:02:30 +00:00
8d2c272e5a properly register conjugate/neg fallthroughs to prim ops (#132699)
A few aten ops (like `clone` and `copy_` get fallthrough registrations to the Conjugate/Negative keys. We haven't been giving the same treatment to their corresponding `prims` variants, which can cause infinite loops in some cases.

Fixes an infinite loop that showed up in tests from https://github.com/pytorch/pytorch/pull/132563

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132699
Approved by: https://github.com/albanD
2024-08-06 17:57:04 +00:00
fc872e98f3 Infer prim tags from equivalent aten ones (#130367)
Take intersection of all the tags for corresponding aten op overloads. Previously, some of the rng ops not having tags caused issues with constant folding (they should get decomposed but thats a separate issue).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130367
Approved by: https://github.com/ezyang
2024-07-11 20:53:52 +00:00
67ef2683d9 [BE] wrap deprecated function/class with typing_extensions.deprecated (#127689)
Use `typing_extensions.deprecated` for deprecation annotation if possible. Otherwise, add `category=FutureWarning` to `warnings.warn("message")` if the category is missing.

Note that only warnings that their messages contain `[Dd]eprecat(ed|ion)` are updated in this PR.

Resolves #126888

- #126888

This PR is split from PR #126898.

- #126898

------

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127689
Approved by: https://github.com/Skylion007
2024-06-02 12:30:43 +00:00
033e733021 Revert "[BE] wrap deprecated function/class with typing_extensions.deprecated (#126898)"
This reverts commit 749a132fb0a8325cbad4734a563aa459ca611991.

Reverted https://github.com/pytorch/pytorch/pull/126898 on behalf of https://github.com/fbgheith due to switching typing-extensions=4.3.0 to 4.9.0 causes internal failure ([comment](https://github.com/pytorch/pytorch/pull/126898#issuecomment-2142884456))
2024-05-31 19:47:24 +00:00
749a132fb0 [BE] wrap deprecated function/class with typing_extensions.deprecated (#126898)
Use `typing_extensions.deprecated` for deprecation annotation if possible. Otherwise, add `category=FutureWarning` to `warnings.warn("message")` if the category is missing.

Note that only warnings that their messages contain `[Dd]eprecat(ed|ion)` are updated in this PR.

UPDATE: Use `FutureWarning` instead of `DeprecationWarning`.

Resolves #126888

- #126888

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126898
Approved by: https://github.com/albanD
2024-05-29 12:09:27 +00:00
c006c8b50e Revert "markDynamoStrictTest some more (#115885)"
This reverts commit 55ce4693ff2c0b6e50b8af323f36ecc7ff929638.

Reverted https://github.com/pytorch/pytorch/pull/115885 on behalf of https://github.com/atalman due to OSSCI oncall, broke inductor ([comment](https://github.com/pytorch/pytorch/pull/115885#issuecomment-1858409669))
2023-12-15 19:51:24 +00:00
55ce4693ff markDynamoStrictTest some more (#115885)
Featuring
test_native_mha.py
test_nn.py
test_prims.py
test_schema_check.py
test_serialization.py
test_show_pickle.py
test_sort_and_select.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115885
Approved by: https://github.com/voznesenskym
ghstack dependencies: #115845, #115855, #115856, #115857, #115858, #115870, #115871, #115879
2023-12-15 13:19:52 +00:00
ba4285bd9e Deprecate primTorch module, replace it with decompositions in module Owners (#114754)
Context: pt2 oncall is revamping its labeling system. One of the guidelines is to remove duplicate labeling in our system. Both primTorch and decomposition labels are referring to the same thing. primTorch was the legacy name (and we no longer have a primTorch project), so using decomposition as the label name makes more sense.

Right now, the only open issues that use "module: primTorch" are the ones generated by the DISABLED bots. Once we replace the label in the bot, we can safely remove the primTorch label.

Here an example of the issue that has primTorch label :
https://github.com/pytorch/pytorch/issues/112719

Torchbot uses following logic to auto extract module owners:
https://github.com/pytorch/test-infra/blob/main/torchci/pages/api/flaky-tests/disable.ts#L391

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114754
Approved by: https://github.com/huydhn
2023-11-29 18:27:20 +00:00
7a21e960c6 fix infinite loop with primtorch and .to(meta) (#109632)
Fixes https://github.com/pytorch/pytorch/issues/103532, which I needed in order to more easily/properly test that python functionalization is at parity with C++ functionalization for conj/neg.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109632
Approved by: https://github.com/ezyang
ghstack dependencies: #108654, #109662
2023-09-22 07:09:04 +00:00
c913f3857f Remove dynamo+nvfuser (#105789)
This PR removes unmaintained Dynamo+nvFuser.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105789
Approved by: https://github.com/jansel, https://github.com/jjsjann123, https://github.com/albanD
2023-08-08 22:29:32 +00:00
891bb259f8 Revert "Remove dynamo+nvfuser (#105789)"
This reverts commit 6030151d3758715097b89026e9b3b3f839fbd544.

Reverted https://github.com/pytorch/pytorch/pull/105789 on behalf of https://github.com/DanilBaibak due to Break a lot of tests on main. ([comment](https://github.com/pytorch/pytorch/pull/105789#issuecomment-1669710571))
2023-08-08 14:20:32 +00:00
6030151d37 Remove dynamo+nvfuser (#105789)
This PR removes unmaintained Dynamo+nvFuser.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105789
Approved by: https://github.com/jansel, https://github.com/jjsjann123, https://github.com/albanD
2023-08-08 13:29:31 +00:00
0ad93a3d56 Fix aten.logspace decomposition (#105201)
Fixes #104118

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105201
Approved by: https://github.com/ezyang
2023-07-22 04:10:20 +00:00
3912b722f3 Upgrade LoggingTensor mode and add traceback collection (#103734)
Parts borrowed from: https://github.com/albanD/subclass_zoo/blob/main/logging_mode.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103734
Approved by: https://github.com/albanD
2023-06-21 22:04:30 +00:00
ee83c646bb Replace _prims_common.check with torch._check* (#103240)
This relands most of the changes from #102219 which were backed out by #103128. However, instead of removing `_prims_common.check`, it adds a warning and a comment mentioning that it will be removed in the future and `torch._check*` should be used instead. As mentioned in https://github.com/pytorch/pytorch/pull/103128#pullrequestreview-1466414415, `_prims_common.check` cannot yet be removed because of some internal usage

Part of #72948

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103240
Approved by: https://github.com/albanD
2023-06-21 00:46:17 +00:00
58d2c66a70 [activation checkpointing] Higher order functional rng op wrappers (#102934)
Introduces two higher order operators
* run_and_save_rng_state - Saves the current rng state and then runs the op.
* run_with_rng_state - Runs the op with the rng state supplied as an input

Ideally, we would like to use torch.compile for these operators. But currently the plan is to introduce these operators at the partitioner level, obviating the need to support them fully through the torch.compile stack. To ensure that we have good enough debugging with minifiers, we have ensure that they work with make_fx. In future, we can move on torch.compile.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102934
Approved by: https://github.com/jansel, https://github.com/zou3519
2023-06-12 22:54:17 +00:00
4c6f7cbc86 Fix prims unbind if given dimension size is 0 (#100122)
Fixes #99832

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100122
Approved by: https://github.com/ngimel
2023-04-26 23:40:21 +00:00
6bc4651193 [philox_rand] Dynamic shape support (#99290)
Extends the functionalization of rng work to Dynamic shapes. An example of the generated graph looks like this

~~~

[2023-04-24 21:41:37,446] torch._functorch.aot_autograd.__aot_graphs: [INFO] TRACED GRAPH
 ===== Forward graph 1 =====
 <eval_with_key>.7 class <lambda>(torch.nn.Module):
    def forward(self, arg0_1: i64[], arg1_1: i64[], arg2_1: Sym(s0), arg3_1: Sym(s1), arg4_1: f32[s0, s1]):
        # File: /scratch/anijain/work/pytorch/test/test_functionalization_of_rng_ops.py:46, code: a = torch.rand_like(x) * x
        add: i64[] = torch.ops.aten.add.Tensor(arg1_1, 0)
        philox_rand = torch.ops.rngprims.philox_rand.default([arg2_1, arg3_1], arg0_1, add, None, device(type='cuda', index=0), torch.float32);  add = None
        getitem: f32[s0, s1] = philox_rand[0]
        getitem_1: i64[] = philox_rand[1];  philox_rand = None
        add_1: i64[] = torch.ops.aten.add.Tensor(getitem_1, 0);  getitem_1 = None
        mul: f32[s0, s1] = torch.ops.aten.mul.Tensor(getitem, arg4_1);  getitem = arg4_1 = None

        # File: /scratch/anijain/work/pytorch/test/test_functionalization_of_rng_ops.py:47, code: a = torch.rand_like(x) * a
        add_2: i64[] = torch.ops.aten.add.Tensor(arg1_1, add_1)
        philox_rand_1 = torch.ops.rngprims.philox_rand.default([arg2_1, arg3_1], arg0_1, add_2, None, device(type='cuda', index=0), torch.float32);  arg2_1 = arg3_1 = arg0_1 = add_2 = None
        getitem_2: f32[s0, s1] = philox_rand_1[0]
        getitem_3: i64[] = philox_rand_1[1];  philox_rand_1 = None
        add_3: i64[] = torch.ops.aten.add.Tensor(add_1, getitem_3);  add_1 = getitem_3 = None
        mul_1: f32[s0, s1] = torch.ops.aten.mul.Tensor(getitem_2, mul);  getitem_2 = mul = None

        # No stacktrace found for following nodes
        add_4: i64[] = torch.ops.aten.add.Tensor(arg1_1, add_3);  arg1_1 = add_3 = None
        return (mul_1, add_4)

 ~~~

Each rand op is accompanied by its offset calculation op.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99290
Approved by: https://github.com/ezyang, https://github.com/bdhirsh
2023-04-25 22:40:28 +00:00
fdbc8625a1 Functionalization of torch.rand/rand_like ops (#97377)
This PR introduces the functionalization of RNG ops. Key points are

* Introduces a new `philox_rand` prim operator that accepts seed, offset.
* Adds decompositions for random operators that use these philox_rand prims
* Adds a PhiloxStateTracker to track the offset for each occurence of rand ops
* Changes calling convention of AOT Autograd and adds <fwd_seed, fwd_base_offset> and <bwd_seed, bwd_base_offset>
* Monkeypatches set_rng_state and get_rng_state while AOT Autograd tracing to record the rng state behavior
* Raises assertion for CPU because CPU does not Philox RNG.

Not dealt in this PR
* dropout op - offset calculation is different
* other distributions like normal, poisson etc
* Inductor support
* Cudagraph support
* Dynamic shape support

An example
~~~

class Custom(torch.autograd.Function):
    @staticmethod
    def forward(ctx, x):
        ctx.save_for_backward(x)
        a = torch.rand_like(x) * x
        a = torch.rand_like(x) * a
        return a

    @staticmethod
    def backward(ctx, grad_out):
        x, = ctx.saved_tensors
        return grad_out * torch.rand_like(grad_out) * torch.cos(x)

====== Forward graph 0 ======
def forward(self, fwd_seed_1: i64[], fwd_base_offset_1: i64[], primals_1: f32[16, 16]):
    # No stacktrace found for following nodes
    add: i64[] = torch.ops.aten.add.Tensor(fwd_base_offset_1, 0)
    philox_rand: f32[16, 16] = torch.ops.prims.philox_rand.default([16, 16], fwd_seed_1, add, [16, 1], device(type='cuda', index=0), torch.float32);  add = None
    mul: f32[16, 16] = torch.ops.aten.mul.Tensor(philox_rand, primals_1);  philox_rand = None
    add_1: i64[] = torch.ops.aten.add.Tensor(fwd_base_offset_1, 4);  fwd_base_offset_1 = None
    philox_rand_1: f32[16, 16] = torch.ops.prims.philox_rand.default([16, 16], fwd_seed_1, add_1, [16, 1], device(type='cuda', index=0), torch.float32);  fwd_seed_1 = add_1 = None
    mul_1: f32[16, 16] = torch.ops.aten.mul.Tensor(philox_rand_1, mul);  philox_rand_1 = mul = None
    return [mul_1, primals_1]

====== Backward graph 0 ======
def forward(self, bwd_seed_1: i64[], bwd_base_offset_1: i64[], primals_1: f32[16, 16], tangents_1: f32[16, 16]):
    # No stacktrace found for following nodes
    add_2: i64[] = torch.ops.aten.add.Tensor(bwd_base_offset_1, 0);  bwd_base_offset_1 = None
    philox_rand_2: f32[16, 16] = torch.ops.prims.philox_rand.default([16, 16], bwd_seed_1, add_2, [16, 1], device(type='cuda', index=0), torch.float32);  bwd_seed_1 = add_2 = None
    mul_2: f32[16, 16] = torch.ops.aten.mul.Tensor(tangents_1, philox_rand_2);  tangents_1 = philox_rand_2 = None
    cos: f32[16, 16] = torch.ops.aten.cos.default(primals_1);  primals_1 = None
    mul_3: f32[16, 16] = torch.ops.aten.mul.Tensor(mul_2, cos);  mul_2 = cos = None
    return [mul_3]

~~~

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97377
Approved by: https://github.com/ezyang
2023-04-16 09:55:56 +00:00
640b9c80f9 [primTorch] Redefine prim.collapse{,_view} end point to be inclusive (#92017)
This makes `prims.collapse(a, start, end)` match the behavior of
`torch.flatten(a, start, end)` more closely.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92017
Approved by: https://github.com/mruberry
2023-02-21 20:36:50 +00:00
2622adb980 [primTorch] Make prims.collapse a real prim (#91748)
`prims.collapse` is currently just a plain python function wrapping
`prims.reshape`. This turns it into a real prim, and also factors out some of
the code duplicated with `_collapse_view_aten`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91748
Approved by: https://github.com/lezcano, https://github.com/ngimel
2023-02-21 20:36:50 +00:00
21eb7f70f1 Nvfuser python API import fix (#94036)
1. Having nvfuser python API import working with both devel and upstream;
2. Add environment variable to allow custom nvfuser code base to be built with upstream pytorch core.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94036
Approved by: https://github.com/malfet, https://github.com/davidberard98
2023-02-16 20:10:40 +00:00
c11b301bcd [NVFUSER] refactor nvfuser build (#89621)
This PR is the first step towards refactors the build for nvfuser in order to have the coegen being a standalone library.

Contents inside this PR:
1. nvfuser code base has been moved to `./nvfuser`, from `./torch/csrc/jit/codegen/cuda/`, except for registration code for integration (interface.h/interface.cpp)
2. splits the build system so nvfuser is generating its own `.so` files. Currently there are:
    - `libnvfuser_codegen.so`, which contains the integration, codegen and runtime system of nvfuser
    - `nvfuser.so`, which is nvfuser's python API via pybind. Python frontend is now exposed via `nvfuser._C.XXX` instead of `torch._C._nvfuser`
3. nvfuser cpp tests is currently being compiled into `nvfuser_tests`
4. cmake is refactored so that:
    - nvfuser now has its own `CMakeLists.txt`, which is under `torch/csrc/jit/codegen/cuda/`.
    - nvfuser backend code is not compiled inside `libtorch_cuda_xxx` any more
    - nvfuser is added as a subdirectory under `./CMakeLists.txt` at the very end after torch is built.
    - since nvfuser has dependency on torch, the registration of nvfuser at runtime is done via dlopen (`at::DynamicLibrary`). This avoids circular dependency in cmake, which will be a nightmare to handle. For details, look at `torch/csrc/jit/codegen/cuda/interface.cpp::LoadingNvfuserLibrary`

Future work that's scoped in following PR:
- Currently since nvfuser codegen has dependency on torch, we need to refactor that out so we can move nvfuser into a submodule and not rely on dlopen to load the library. @malfet
- Since we moved nvfuser into a cmake build, we effectively disabled bazel build for nvfuser. This could impact internal workload at Meta, so we need to put support back. cc'ing @vors

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89621
Approved by: https://github.com/davidberard98
2023-01-26 02:50:44 +00:00
f0b592dae7 Make masked_fill reference traceable (#90972)
As the comment states, `item()` cannot be used since you can't trace through a
scalar.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90972
Approved by: https://github.com/ngimel
2023-01-18 10:54:42 +00:00
15949fc248 [ROCm] Enable few test_prim UTs for ROCm (#88983)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88983
Approved by: https://github.com/IvanYashchuk, https://github.com/jeffdaily, https://github.com/malfet
2022-12-07 06:21:31 +00:00
3c9431f505 Add factory functions to python frontend (#89230)
- Add `full` nvprim to support factory functions because the full reference uses `empty` and `fill` while we have a full factory function.
- Change `full_like` reference to call `full` to avoid defining another nvprim.
- Enable support for new_zeros to enable `cudnn_batch_norm` decomposition.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89230
Approved by: https://github.com/kevinstephano, https://github.com/mruberry
2022-12-06 07:16:21 +00:00
fe3a226d74 [minor] use set_default_dtype instead of try and finally (#88295)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88295
Approved by: https://github.com/mruberry
2022-11-03 19:28:33 +00:00
4a8382b58e Update caching of tensor arguments for nvFuser's fusion creation (#87860)
Previously nvFuser's fusion definition was cached based on concrete shape and strides of tensor inputs for simplicity and correctness. This PR changes Python's cache to check the number of dimensions, size-1 dimensions, and contiguity information based on given strides and shapes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87860
Approved by: https://github.com/kevinstephano, https://github.com/jjsjann123, https://github.com/ngimel
2022-11-02 09:29:20 +00:00
c0761a835b Revert "[dynamo] Error when user nests FX with dynamo (#87797)"
This reverts commit 1da5aeb97b73664ff0fe2f4bb48379655cede969.

Reverted https://github.com/pytorch/pytorch/pull/87797 on behalf of https://github.com/ezyang due to breaks nvfuser stack, needs more investigation
2022-10-31 23:49:37 +00:00
1da5aeb97b [dynamo] Error when user nests FX with dynamo (#87797)
Today, this doesn't work and dynamo errors out in a very non-obvious way (see:
https://gist.github.com/suo/dde04830372ab51a4a34ea760f14200a).

Here, we detect the error early and exit with a nicer msg. Also add a
config option to just no-op dynamo (which need to unblock internal
enablement).

cc @jansel @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87797
Approved by: https://github.com/yf225, https://github.com/soumith, https://github.com/jansel
2022-10-28 04:59:08 +00:00
ae4fbac819 Enable nvprims.transpose fusions for nvFuser (#86967)
This PR allows transposes to be fused with other operations. If a fusion group is formed only from operations that just manipulate metadata in PyTorch (transpose, view, etc.) then this group is not sent to nvFuser.
On top of that if we have converted to `nvprims` but then decided to not form a fusion group we modify the graph use `prim.impl_aten` attribute instead of calling `prim(*args, **kwargs)` that has a higher overhead.

cc @kevinstephano @jjsjann123
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86967
Approved by: https://github.com/jjsjann123, https://github.com/SherlockNoMad
2022-10-26 17:00:07 +00:00
72f446b9bc Remove getitem special handling in the partitioner (#87073)
This special handling of getitem unnecessary splits fusions at functions with tuple outputs.

Example script:
```py
import torch
from torch.fx.passes.infra.partitioner import CapabilityBasedPartitioner
from torch._prims.nvfuser_executor import NvfuserPrimOperatorSupport
from torch.fx.experimental.proxy_tensor import make_fx

def func(x):
    xx = torch.ops.nvprims.add(x, 1)
    var, mean = torch.ops.nvprims.var_mean(x, correction=0)
    var_cos = torch.ops.nvprims.cos(var)
    mean_sin = torch.ops.nvprims.sin(mean)
    return torch.ops.nvprims.add(var_cos, mean_sin)

a = torch.randn(5, 3, 3, device="cuda")
gm = make_fx(func)(a)
gm.graph.print_tabular()

supported_ops = NvfuserPrimOperatorSupport()
partitioner = CapabilityBasedPartitioner(
    gm, supported_ops, allows_single_node_partition=False
)
partitions = partitioner.propose_partitions()
print(partitions)
partitioned_graph = partitioner.fuse_partitions(partitions)
partitioned_graph.graph.print_tabular()
```
Output on master:
```py
opcode         name       target                       args              kwargs
-------------  ---------  ---------------------------  ----------------  -----------------
placeholder    x_1        x_1                          ()                {}
call_function  add        nvprims.add.default          (x_1, 1)          {}
call_function  var_mean   nvprims.var_mean.main        (x_1, [0, 1, 2])  {'correction': 0}
call_function  getitem    <built-in function getitem>  (var_mean, 0)     {}
call_function  getitem_1  <built-in function getitem>  (var_mean, 1)     {}
call_function  cos        nvprims.cos.default          (getitem,)        {}
call_function  sin        nvprims.sin.default          (getitem_1,)      {}
call_function  add_1      nvprims.add.default          (cos, sin)        {}
output         output     output                       (add_1,)          {}
[{cos, sin, add_1}, {var_mean, add, getitem, getitem_1}]
opcode         name       target                       args                    kwargs
-------------  ---------  ---------------------------  ----------------------  --------
placeholder    x_1        x_1                          ()                      {}
call_module    fused_1    fused_1                      (x_1,)                  {}
call_function  getitem_2  <built-in function getitem>  (fused_1, 0)            {}
call_function  getitem_3  <built-in function getitem>  (fused_1, 1)            {}
call_module    fused_0    fused_0                      (getitem_2, getitem_3)  {}
output         output     output                       (fused_0,)              {}
```
Output with this PR:
```
[{var_mean, add_1, cos, sin, add, getitem_1, getitem}]
opcode       name     target    args        kwargs
-----------  -------  --------  ----------  --------
placeholder  x_1      x_1       ()          {}
call_module  fused_0  fused_0   (x_1,)      {}
output       output   output    (fused_0,)  {}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87073
Approved by: https://github.com/jjsjann123, https://github.com/SherlockNoMad
2022-10-26 14:18:46 +00:00
31931515bc Workarounds for cudnn_batch_norm with TorchRefsNvfuserCapabilityMode (#86796)
This PR adds workarounds to support AOT Autograd's graphs containing `aten.cudnn_batch_norm` and `aten.cudnn_batch_norm_backward` with `TorchRefsNvfuserCapabilityMode`.

The problem with the decomposition of `aten.cudnn_batch_norm` is that it uses a `new_empty` call that is not supported by nvFuser and we are conservative with lowering functions to nvprims by default.

The problem with the decomposition of `aten.cudnn_batch_norm_backward` is described here https://github.com/pytorch/pytorch/pull/86115#issue-1394883782, but changing the decomposition directly in that PR makes many tests fail.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86796
Approved by: https://github.com/mruberry
2022-10-17 18:46:28 +00:00
fc3afc8407 Remove empty_like+fill from AOT Autograd graphs for nvFuser (#86908)
AOT Autograd records C++ code `1 - tensor` as a sequence of empty_like, fill, and sub (see https://github.com/pytorch/pytorch/issues/86612).

Both empty_like and fill are not supported yet. This PR is a workaround for enabling fusions of `silu_backward`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86908
Approved by: https://github.com/ngimel
2022-10-14 19:49:39 +00:00
fd80684784 Add nvFuser support for torch.Tensor.view (#84634)
This is an alternative to https://github.com/pytorch/pytorch/pull/83739. While PrimTorch has `view` as a reference, we would like to use nvFuser's implementation for `view` for now. Later we might transition to PrimTorch's `torch._refs.view`.

See `test_nvprims_view` for examples of things that are now sent to nvFuser. Note that nvFuser's `view` is a copy-like operation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84634
Approved by: https://github.com/kevinstephano, https://github.com/mruberry
2022-10-14 12:08:02 +00:00
68a6113248 Add nvFuser support for torch.native_batch_norm (#85562)
This PR adds nvFuser's implementation for batch_norm as there's no reference yet (https://github.com/pytorch/pytorch/pull/81191) and no in-place copy support (https://github.com/pytorch/pytorch/pull/84545).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/85562
Approved by: https://github.com/kevinstephano, https://github.com/ngimel
2022-10-03 15:03:08 +00:00
fd553c46f4 nvprim op support runtime checks on dtype compatibility on prims.convert_element_type (#85566)
I'm seeing issue that we lower `_to_copy` into `nvprims.convert_element_type`. In cases where we are casting to a dtype that's not supported by nvfuser, this raise runtime error.

I added a quick check in the lowering part where each op can peek at fx.node and make a runtime decision on whether the given op should be lowered to nvprim.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85566
Approved by: https://github.com/IvanYashchuk, https://github.com/ngimel
2022-09-30 23:19:25 +00:00
b00a5359f7 Add a way to skip lowering to nvprims (#85811)
This PR adds `skip_ops` argument to `TorchRefsNvfuserCapabilityMode` and `NvfuserPrimsMode` which is an iterable of function names to be skipped in the translation to nvprims process.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/85811
Approved by: https://github.com/mruberry, https://github.com/jjsjann123
2022-09-30 12:01:45 +00:00
cab6ffa0f7 catches failure on nvprim speculative lowering (#85580)
Fixes #85517

Added a try/catch exception during tracing `get_isolated_graphmodule` inside `_is_func_unsupported_nvfuser`. Stops speculative lowering to nvprim when query errors out.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85580
Approved by: https://github.com/mruberry, https://github.com/IvanYashchuk
2022-09-29 15:22:45 +00:00
795028a3ce Make Python reference for permute accept varargs (#85460)
Fixes #85452

Pull Request resolved: https://github.com/pytorch/pytorch/pull/85460
Approved by: https://github.com/jjsjann123, https://github.com/mruberry, https://github.com/ngimel
2022-09-28 03:50:42 +00:00
18d8c548f4 [Modes] remove enable and rewrite mode stack (squashed) (#84774)
Based on @ezyang's suggestion, mode stack now has "one true mode" which is the _only_ mode that can ever be active at the C++ level. That mode's torch dispatch is just to take the top mode in the stack, reenable itself (if we aren't at the end of the mode stack), and run the top mode's torch_{dispatch|function}

This maintains that in the middle of a mode's torch dispatch, the mode itself will not be active. It changes the function the user has to call to see what the current mode is (no longer queries the C++, it's python only) but allows the user to also see the entire mode stack easily

Removes `enable_torch_dispatch_mode` and `.restore()` since neither makes sense in this new setup

### Background
Why do we want this? Well, a pretty common pattern that was coming up was that users had to do something like

```python
## PRE-PR UX
def f(mode):
  with mode.restore():  # user needs to understand this restore thing?
    ...

with Mode() as m:
  pass
f(m)
```

Many users were getting error from forgetting to call `.restore` or from forgetting to add the (tbh weird) "mode instantiation"  step where they use the mode as a context manager with an empty body. Really, they wanted to treat modes like context managers and just write
```python
## FROM FEEDBACK, USER DESIRED CODE. POSSIBLE POST-PR
def f(mode):
  with mode:
    ...
f(Mode())
```

** Technical Details **
With the old mode stack, we basically had a linked list so the mode itself could only be used once and had a fixed parent. In this new design, the mode stack is just a python list that we're pushing to and popping from. There's only one mode that's ever active at the C++ level and it runs the next mode in the Python list. The modes don't have state on them anymore
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84774
Approved by: https://github.com/ezyang, https://github.com/zou3519
2022-09-27 01:04:35 +00:00
c7b17d7eb1 Add nvprims rand_like support for Dropout (#85077)
NM
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85077
Approved by: https://github.com/IvanYashchuk, https://github.com/mruberry
2022-09-23 18:03:35 +00:00
8ca1839d32 Python Dispatcher integration with C++ dispatcher (#85050)
#84826 but without ghstack
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85050
Approved by: https://github.com/malfet
2022-09-15 00:43:36 +00:00
706b990306 Revert "Python Dispatcher integration with C++ dispatcher (#84826)"
This reverts commit 35f6a69191ef762cf22b6cbfe94b8d9406e16674.

Reverted https://github.com/pytorch/pytorch/pull/84826 on behalf of https://github.com/malfet due to Broke dynamo, see 35f6a69191
2022-09-14 14:07:58 +00:00
35f6a69191 Python Dispatcher integration with C++ dispatcher (#84826)
Signed-off-by: Edward Z. Yang <ezyangfb.com>

From @ezyang's original PR:

There are a number of situations where we have non-backend kernels (e.g., CompositeImplicitAutograd, batching rules) which we would like to port to Python, but we have no way to integrate these ports with the overall system while using preexisting C++ registrations otherwise. This PR changes that by introducing a Python dispatcher (which can have its own kernels directly in Python), which can be interpose over ordinary C++ dispatch. The ingredients:

We introduce a new PythonDispatcher dispatch key, that has the same tenor as FuncTorchDynamicLayerFrontMode: it works by getting triggered before every other dispatch key in the dispatch key, and shunting to a Python implementation
The Python dispatcher is a per-interpreter global object that is enabled/disabled via the guard EnablePythonDispatcher/DisablePythonDispatcher. We don't make it compositional as I have no idea what a compositional version of this feature would look like. Because it is global, we don't need to memory manage it and so I use a simpler SafePyHandle (newly added) to control access to this pointer from non-Python C++. Like __torch_dispatch__, we use PyInterpreter to get to the Python interpreter to handle the dispatch.
I need to reimplement dispatch table computation logic in Python. To do this, I expose a lot more helper functions for doing computations on alias dispatch keys and similar. I also improve the pybind11 handling for DispatchKey so that you can either accept the pybind11 bound enum or a string; this simplifies our binding code. See https://github.com/pybind/pybind11/issues/483#issuecomment-1237418106 for how this works; the technique is generally useful.

I need to be able to call backend fallbacks. I do this by permitting you to call at a dispatch key which doesn't have a kernel for the operator; if the kernel doesn't exist, we check the backend fallback table instead.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84826
Approved by: https://github.com/ezyang
2022-09-14 06:57:19 +00:00
d802fcfcd8 Add config to PrimTorch's nvFuser executor (#84482)
This PR adds `executor_parameters` keyword argument to `torch._prims.executor.execute`.

For now there are two knobs:
* `use_python_fusion_cache: bool = True` whether to use lru_cache when constructing fusion object or not.
* `allow_single_op_fusion: bool = True` whether to allow fusions with single callable

Behavior can be controlled by passing dict with custom specified values as `executor_parameters` argument.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84482
Approved by: https://github.com/jjsjann123, https://github.com/ngimel
2022-09-09 07:58:21 +00:00
1a33e944b5 nvfuser torchbench patch (#84411)
1. Patching nvfuser_execute to take aten nvprim fallback when no cuda tensors are provided as inputs
2. Extending support of nvfuser python API on cpu scalar tensor.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84411
Approved by: https://github.com/ngimel, https://github.com/kevinstephano, https://github.com/IvanYashchuk
2022-09-07 05:22:37 +00:00