Commit Graph

37 Commits

Author SHA1 Message Date
60a68477a6 Bump black version to 23.1.0 (#96578)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96578
Approved by: https://github.com/ezyang
2023-03-15 06:27:59 +00:00
78e04f8272 Update nvfuser_executor.py (#96218)
In https://github.com/csarofeen/pytorch/pull/2517 the return value of `compute_contiguity` is changed from tuple to list. This PR handles that change.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96218
Approved by: https://github.com/jjsjann123, https://github.com/davidberard98
2023-03-08 22:07:58 +00:00
21eb7f70f1 Nvfuser python API import fix (#94036)
1. Having nvfuser python API import working with both devel and upstream;
2. Add environment variable to allow custom nvfuser code base to be built with upstream pytorch core.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94036
Approved by: https://github.com/malfet, https://github.com/davidberard98
2023-02-16 20:10:40 +00:00
c11b301bcd [NVFUSER] refactor nvfuser build (#89621)
This PR is the first step towards refactors the build for nvfuser in order to have the coegen being a standalone library.

Contents inside this PR:
1. nvfuser code base has been moved to `./nvfuser`, from `./torch/csrc/jit/codegen/cuda/`, except for registration code for integration (interface.h/interface.cpp)
2. splits the build system so nvfuser is generating its own `.so` files. Currently there are:
    - `libnvfuser_codegen.so`, which contains the integration, codegen and runtime system of nvfuser
    - `nvfuser.so`, which is nvfuser's python API via pybind. Python frontend is now exposed via `nvfuser._C.XXX` instead of `torch._C._nvfuser`
3. nvfuser cpp tests is currently being compiled into `nvfuser_tests`
4. cmake is refactored so that:
    - nvfuser now has its own `CMakeLists.txt`, which is under `torch/csrc/jit/codegen/cuda/`.
    - nvfuser backend code is not compiled inside `libtorch_cuda_xxx` any more
    - nvfuser is added as a subdirectory under `./CMakeLists.txt` at the very end after torch is built.
    - since nvfuser has dependency on torch, the registration of nvfuser at runtime is done via dlopen (`at::DynamicLibrary`). This avoids circular dependency in cmake, which will be a nightmare to handle. For details, look at `torch/csrc/jit/codegen/cuda/interface.cpp::LoadingNvfuserLibrary`

Future work that's scoped in following PR:
- Currently since nvfuser codegen has dependency on torch, we need to refactor that out so we can move nvfuser into a submodule and not rely on dlopen to load the library. @malfet
- Since we moved nvfuser into a cmake build, we effectively disabled bazel build for nvfuser. This could impact internal workload at Meta, so we need to put support back. cc'ing @vors

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89621
Approved by: https://github.com/davidberard98
2023-01-26 02:50:44 +00:00
c0d81aa70c Use fx.replace_pattern for removing empty_like+fill in nvFuser+PrimTorch execution (#89132)
I learned about `torch.fx.replace_pattern` and it's a cleaner way of removing unnecessary tensor materialization from the graph coming from tracing  C++ code `1 - tensor`.

Test:
```
python -m pytest test/test_prims.py -k "test_silu_backward_no_filled_tensor"
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89132
Approved by: https://github.com/mruberry, https://github.com/jjsjann123
2022-11-24 09:37:10 +00:00
7551136b81 Add NVTX markers that dump additional information for nvprim_nvfuser Dynamo graphs (#88259)
dump information on graphs that NVFuser JIT compiles:
- the markers show the list of ops, args, and inputs that make up the graph

also dumps information on FX nodes that are not touched by NVFuser:
- the markers show the op, name, and arg list of the node

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88259
Approved by: https://github.com/IvanYashchuk, https://github.com/jjsjann123, https://github.com/mruberry
2022-11-18 22:36:08 +00:00
af09270e10 nvprims bookend non compute (#88457)
Cherry-pickeding: https://github.com/csarofeen/pytorch/pull/2099

1. enabling bookend non-compute-ops pass on nvfuser
2. fixing bookend op check on intermediate tensor as partition inputs
3. python tests added for: `getitem` special handling bookend_non_compute removal
4. patching dfs by excluding dfs within partition to avoid going over recursion limitation
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88457
Approved by: https://github.com/SherlockNoMad
2022-11-08 12:06:35 +00:00
79abea5683 nvprim python runtime dtype correctness patch (#88452)
Cherry-picking: https://github.com/csarofeen/pytorch/pull/2133

- [x] casts FusionDefinition output to original dtype recorded in the GraphModule
- [x] add a python repro with dynamo

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88452
Approved by: https://github.com/IvanYashchuk, https://github.com/mruberry
2022-11-04 19:17:07 +00:00
4a8382b58e Update caching of tensor arguments for nvFuser's fusion creation (#87860)
Previously nvFuser's fusion definition was cached based on concrete shape and strides of tensor inputs for simplicity and correctness. This PR changes Python's cache to check the number of dimensions, size-1 dimensions, and contiguity information based on given strides and shapes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87860
Approved by: https://github.com/kevinstephano, https://github.com/jjsjann123, https://github.com/ngimel
2022-11-02 09:29:20 +00:00
ae4fbac819 Enable nvprims.transpose fusions for nvFuser (#86967)
This PR allows transposes to be fused with other operations. If a fusion group is formed only from operations that just manipulate metadata in PyTorch (transpose, view, etc.) then this group is not sent to nvFuser.
On top of that if we have converted to `nvprims` but then decided to not form a fusion group we modify the graph use `prim.impl_aten` attribute instead of calling `prim(*args, **kwargs)` that has a higher overhead.

cc @kevinstephano @jjsjann123
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86967
Approved by: https://github.com/jjsjann123, https://github.com/SherlockNoMad
2022-10-26 17:00:07 +00:00
72f446b9bc Remove getitem special handling in the partitioner (#87073)
This special handling of getitem unnecessary splits fusions at functions with tuple outputs.

Example script:
```py
import torch
from torch.fx.passes.infra.partitioner import CapabilityBasedPartitioner
from torch._prims.nvfuser_executor import NvfuserPrimOperatorSupport
from torch.fx.experimental.proxy_tensor import make_fx

def func(x):
    xx = torch.ops.nvprims.add(x, 1)
    var, mean = torch.ops.nvprims.var_mean(x, correction=0)
    var_cos = torch.ops.nvprims.cos(var)
    mean_sin = torch.ops.nvprims.sin(mean)
    return torch.ops.nvprims.add(var_cos, mean_sin)

a = torch.randn(5, 3, 3, device="cuda")
gm = make_fx(func)(a)
gm.graph.print_tabular()

supported_ops = NvfuserPrimOperatorSupport()
partitioner = CapabilityBasedPartitioner(
    gm, supported_ops, allows_single_node_partition=False
)
partitions = partitioner.propose_partitions()
print(partitions)
partitioned_graph = partitioner.fuse_partitions(partitions)
partitioned_graph.graph.print_tabular()
```
Output on master:
```py
opcode         name       target                       args              kwargs
-------------  ---------  ---------------------------  ----------------  -----------------
placeholder    x_1        x_1                          ()                {}
call_function  add        nvprims.add.default          (x_1, 1)          {}
call_function  var_mean   nvprims.var_mean.main        (x_1, [0, 1, 2])  {'correction': 0}
call_function  getitem    <built-in function getitem>  (var_mean, 0)     {}
call_function  getitem_1  <built-in function getitem>  (var_mean, 1)     {}
call_function  cos        nvprims.cos.default          (getitem,)        {}
call_function  sin        nvprims.sin.default          (getitem_1,)      {}
call_function  add_1      nvprims.add.default          (cos, sin)        {}
output         output     output                       (add_1,)          {}
[{cos, sin, add_1}, {var_mean, add, getitem, getitem_1}]
opcode         name       target                       args                    kwargs
-------------  ---------  ---------------------------  ----------------------  --------
placeholder    x_1        x_1                          ()                      {}
call_module    fused_1    fused_1                      (x_1,)                  {}
call_function  getitem_2  <built-in function getitem>  (fused_1, 0)            {}
call_function  getitem_3  <built-in function getitem>  (fused_1, 1)            {}
call_module    fused_0    fused_0                      (getitem_2, getitem_3)  {}
output         output     output                       (fused_0,)              {}
```
Output with this PR:
```
[{var_mean, add_1, cos, sin, add, getitem_1, getitem}]
opcode       name     target    args        kwargs
-----------  -------  --------  ----------  --------
placeholder  x_1      x_1       ()          {}
call_module  fused_0  fused_0   (x_1,)      {}
output       output   output    (fused_0,)  {}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87073
Approved by: https://github.com/jjsjann123, https://github.com/SherlockNoMad
2022-10-26 14:18:46 +00:00
fc3afc8407 Remove empty_like+fill from AOT Autograd graphs for nvFuser (#86908)
AOT Autograd records C++ code `1 - tensor` as a sequence of empty_like, fill, and sub (see https://github.com/pytorch/pytorch/issues/86612).

Both empty_like and fill are not supported yet. This PR is a workaround for enabling fusions of `silu_backward`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86908
Approved by: https://github.com/ngimel
2022-10-14 19:49:39 +00:00
68a6113248 Add nvFuser support for torch.native_batch_norm (#85562)
This PR adds nvFuser's implementation for batch_norm as there's no reference yet (https://github.com/pytorch/pytorch/pull/81191) and no in-place copy support (https://github.com/pytorch/pytorch/pull/84545).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/85562
Approved by: https://github.com/kevinstephano, https://github.com/ngimel
2022-10-03 15:03:08 +00:00
fd553c46f4 nvprim op support runtime checks on dtype compatibility on prims.convert_element_type (#85566)
I'm seeing issue that we lower `_to_copy` into `nvprims.convert_element_type`. In cases where we are casting to a dtype that's not supported by nvfuser, this raise runtime error.

I added a quick check in the lowering part where each op can peek at fx.node and make a runtime decision on whether the given op should be lowered to nvprim.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85566
Approved by: https://github.com/IvanYashchuk, https://github.com/ngimel
2022-09-30 23:19:25 +00:00
d802fcfcd8 Add config to PrimTorch's nvFuser executor (#84482)
This PR adds `executor_parameters` keyword argument to `torch._prims.executor.execute`.

For now there are two knobs:
* `use_python_fusion_cache: bool = True` whether to use lru_cache when constructing fusion object or not.
* `allow_single_op_fusion: bool = True` whether to allow fusions with single callable

Behavior can be controlled by passing dict with custom specified values as `executor_parameters` argument.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84482
Approved by: https://github.com/jjsjann123, https://github.com/ngimel
2022-09-09 07:58:21 +00:00
1a33e944b5 nvfuser torchbench patch (#84411)
1. Patching nvfuser_execute to take aten nvprim fallback when no cuda tensors are provided as inputs
2. Extending support of nvfuser python API on cpu scalar tensor.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84411
Approved by: https://github.com/ngimel, https://github.com/kevinstephano, https://github.com/IvanYashchuk
2022-09-07 05:22:37 +00:00
edab44f6dd Support a few corner cases for nvFuser executor (#84416)
This PR adds asserts to the `nvfuser_execute` function for the cases that do not work. Fallback to eager is used in those cases.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84416
Approved by: https://github.com/jjsjann123, https://github.com/ngimel
2022-09-05 08:49:01 +00:00
2a332afbf4 Add SymFloat, support SymInt to SymFloat conversion (#84284)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84284
Approved by: https://github.com/albanD
2022-09-03 01:30:32 +00:00
2d969dc2ca Revert "Support a few corner cases for nvFuser executor (#84416)"
This reverts commit 3db3845f5f20047d9a30f450d3936e4113975ae6.

Reverted https://github.com/pytorch/pytorch/pull/84416 on behalf of https://github.com/malfet due to Broke both trunk and pull, see 3db3845f5f
2022-09-02 17:40:17 +00:00
3db3845f5f Support a few corner cases for nvFuser executor (#84416)
This PR adds asserts to the `nvfuser_execute` function for the cases that do not work. Fallback to eager is used in those cases.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84416
Approved by: https://github.com/jjsjann123, https://github.com/ngimel
2022-09-02 14:57:05 +00:00
0fd173b097 Revert "Support a few corner cases for nvFuser executor (#84416)"
This reverts commit 3ac9f6683dc8f17e030699da4df6c767f22939b6.

Reverted https://github.com/pytorch/pytorch/pull/84416 on behalf of https://github.com/IvanYashchuk due to trunk CI is failing due to sneaked in print_tabular() call
2022-09-02 10:45:41 +00:00
3ac9f6683d Support a few corner cases for nvFuser executor (#84416)
This PR adds asserts to the `nvfuser_execute` function for the cases that do not work. Fallback to eager is used in those cases.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84416
Approved by: https://github.com/jjsjann123, https://github.com/ngimel
2022-09-02 06:42:39 +00:00
90161c23cf Add nvfuser support for squeeze (#84117)
"_refs.squeeze" and "refs.unsqueeze" now work with nvfuser executor tests.

Similarly to `_refs.reshape` we need to explicitly save the concrete shape on the trace to pass that info to nvfuser, as it gets lost in translation (https://github.com/pytorch/pytorch/pull/83739#discussion_r950352124).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84117
Approved by: https://github.com/ngimel
2022-08-30 20:36:11 +00:00
3aae6ff1e1 Add nvprims.var_mean (#83508)
This PR adds nvfuser-specific primitive - `var_mean`.
Interpretation `torch.var_mean` -> `torch.ops.nvprims.var_mean` is handled by `TorchRefsNvfuserCapabilityMode` context manager.

I moved some helper code from `_prims/__init__.py` to `_prims_common`. Correctness is tested with OpInfo tests (see `PythonRefInfo("ops.nvprims.var_mean"`).

Layer norm reference now uses `torch.var_mean` instead of `torch._refs.var_mean` to allow interception. Here's a simple comparison of performance with this PR and master (on 3080ti):
```py
import torch
from torch._prims.context import TorchRefsNvfuserCapabilityMode
from torch.fx.experimental.proxy_tensor import make_fx
from torch._prims.executor import execute

def func(a):
    return torch.native_layer_norm(a, (1024,), None, None, 1e-6)

a = torch.randn(10, 512, 1024, dtype=torch.float16, device="cuda")

with TorchRefsNvfuserCapabilityMode():
    gm = make_fx(func)(a)

for _ in range(10):
    execute(gm, a, executor="strictly_nvfuser");
```
run with `PYTORCH_NVFUSER_DUMP=dump_eff_bandwidth python script.py`
```py
# WITH THIS PR
# kernel1 run in 0.032768 ms, achieved: 641.25 GB/s
# kernel1 run in 0.033792 ms, achieved: 621.818 GB/s
# kernel1 run in 0.032768 ms, achieved: 641.25 GB/s
# kernel1 run in 0.032608 ms, achieved: 644.396 GB/s
# kernel1 run in 0.031744 ms, achieved: 661.935 GB/s
# kernel1 run in 0.031744 ms, achieved: 661.935 GB/s
# kernel1 run in 0.032768 ms, achieved: 641.25 GB/s
# kernel1 run in 0.03072 ms, achieved: 684 GB/s
# kernel1 run in 0.031744 ms, achieved: 661.935 GB/s
# kernel1 run in 0.031744 ms, achieved: 661.935 GB/s

# ON MASTER
# kernel1 run in 0.05632 ms, achieved: 373.091 GB/s
# kernel1 run in 0.044032 ms, achieved: 477.209 GB/s
# kernel1 run in 0.044032 ms, achieved: 477.209 GB/s
# kernel1 run in 0.044032 ms, achieved: 477.209 GB/s
# kernel1 run in 0.043808 ms, achieved: 479.649 GB/s
# kernel1 run in 0.043008 ms, achieved: 488.571 GB/s
# kernel1 run in 0.044032 ms, achieved: 477.209 GB/s
# kernel1 run in 0.043008 ms, achieved: 488.571 GB/s
# kernel1 run in 0.043008 ms, achieved: 488.571 GB/s
# kernel1 run in 0.043008 ms, achieved: 488.571 GB/s
```
So this PR gives about 35% improvement in performance using nvfuser executor with this specific normalized shape.

Also this PR fixes https://github.com/pytorch/pytorch/issues/83506 (see the change in `torch/csrc/jit/python/pybind_utils.cpp`).

Ref. https://github.com/pytorch/pytorch/issues/80187

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83508
Approved by: https://github.com/ngimel
2022-08-28 18:45:25 +00:00
b159a5230f Revert "Add nvprims.var_mean (#83508)"
This reverts commit 7e7694b6615fbf46abfab234615fa891c2819eb7.

Reverted https://github.com/pytorch/pytorch/pull/83508 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally
2022-08-28 11:30:27 +00:00
7e7694b661 Add nvprims.var_mean (#83508)
This PR adds nvfuser-specific primitive - `var_mean`.
Interpretation `torch.var_mean` -> `torch.ops.nvprims.var_mean` is handled by `TorchRefsNvfuserCapabilityMode` context manager.

I moved some helper code from `_prims/__init__.py` to `_prims_common`. Correctness is tested with OpInfo tests (see `PythonRefInfo("ops.nvprims.var_mean"`).

Layer norm reference now uses `torch.var_mean` instead of `torch._refs.var_mean` to allow interception. Here's a simple comparison of performance with this PR and master (on 3080ti):
```py
import torch
from torch._prims.context import TorchRefsNvfuserCapabilityMode
from torch.fx.experimental.proxy_tensor import make_fx
from torch._prims.executor import execute

def func(a):
    return torch.native_layer_norm(a, (1024,), None, None, 1e-6)

a = torch.randn(10, 512, 1024, dtype=torch.float16, device="cuda")

with TorchRefsNvfuserCapabilityMode():
    gm = make_fx(func)(a)

for _ in range(10):
    execute(gm, a, executor="strictly_nvfuser");
```
run with `PYTORCH_NVFUSER_DUMP=dump_eff_bandwidth python script.py`
```py
# WITH THIS PR
# kernel1 run in 0.032768 ms, achieved: 641.25 GB/s
# kernel1 run in 0.033792 ms, achieved: 621.818 GB/s
# kernel1 run in 0.032768 ms, achieved: 641.25 GB/s
# kernel1 run in 0.032608 ms, achieved: 644.396 GB/s
# kernel1 run in 0.031744 ms, achieved: 661.935 GB/s
# kernel1 run in 0.031744 ms, achieved: 661.935 GB/s
# kernel1 run in 0.032768 ms, achieved: 641.25 GB/s
# kernel1 run in 0.03072 ms, achieved: 684 GB/s
# kernel1 run in 0.031744 ms, achieved: 661.935 GB/s
# kernel1 run in 0.031744 ms, achieved: 661.935 GB/s

# ON MASTER
# kernel1 run in 0.05632 ms, achieved: 373.091 GB/s
# kernel1 run in 0.044032 ms, achieved: 477.209 GB/s
# kernel1 run in 0.044032 ms, achieved: 477.209 GB/s
# kernel1 run in 0.044032 ms, achieved: 477.209 GB/s
# kernel1 run in 0.043808 ms, achieved: 479.649 GB/s
# kernel1 run in 0.043008 ms, achieved: 488.571 GB/s
# kernel1 run in 0.044032 ms, achieved: 477.209 GB/s
# kernel1 run in 0.043008 ms, achieved: 488.571 GB/s
# kernel1 run in 0.043008 ms, achieved: 488.571 GB/s
# kernel1 run in 0.043008 ms, achieved: 488.571 GB/s
```
So this PR gives about 35% improvement in performance using nvfuser executor with this specific normalized shape.

Also this PR fixes https://github.com/pytorch/pytorch/issues/83506 (see the change in `torch/csrc/jit/python/pybind_utils.cpp`).

Ref. https://github.com/pytorch/pytorch/issues/80187

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83508
Approved by: https://github.com/ngimel
2022-08-27 09:05:20 +00:00
108a1fb173 Avoid using fx.Interpreter in nvfuser executor function (#83607)
Using fx.Interpreter is a nice way of modifying the calls inside of FX graphs, but it introduces unnecessary overhead in this case.

Example:
```py
import torch
from torch.fx.experimental.proxy_tensor import make_fx
from torch._prims.context import TorchRefsNvfuserCapabilityMode
from torch._prims.executor import execute
a = torch.randn(3, 2, dtype=torch.float16, device="cuda")
s = torch.sigmoid
d = torch.digamma  # digamma is not supported in nvfuser and aten eager execution is used
def func(a):
    return s(d(s(d(s(d(s(a)))))))
with TorchRefsNvfuserCapabilityMode():
    gm = make_fx(func)(a)

%%timeit
execute(gm, a, executor="nvfuser"); torch.cuda.synchronize();
# On master: 350 µs
# With this PR: 130 µs
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83607
Approved by: https://github.com/ezyang
2022-08-19 16:20:35 +00:00
9f03444f70 Add torch.ops.aten -> torch._refs mapping to TorchRefsMode using decomposition_table (#82657)
### Description
This PR adds the possibility to convert `torch.ops.aten` calls to `torch._refs` and consequently prims under TorchRefsMode.

### Testing
New test, `test_aten_overload_to_prims`, in `test/test_prims.py`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/82657
Approved by: https://github.com/jjsjann123, https://github.com/ezyang
2022-08-17 14:46:06 +00:00
7a4a8f327a Add new NVFuser Python Frontend Record Keeping for Cache enablement. (#81578)
This PR does not include an NVFuser frontend cache but it decouples the backed Fusion IR exposure and instead builds it as needed, if there was a cache, by recording the requested definition for replay to start the process of building a Fusion if it doesn't already exist.   Another PR will be put up to include the actual caching.

The main change in the Python Frontend is that the NVFuser Fusion IR is not directly defined by the interface. Currently, there is direct connection between the Python API and the creation of the Fusion IR and Object.  This means the user defines TensorViews, Scalars, and calls Arith Functions (IR Expressions) on those IR Values.  The goal is to disconnect the Python API from directly specifying the Fusion IR and enable caching of the IR so a Fusion Object is not necessarily built every time a Fusion Definition is seen.

The FusionDefinition in Python will mostly look the same except the Definition is now being recorded in a light weight representation called a "Recording" of Records.  If the Description is not already cached, the Records are executed to build the Fusion IR.  Initially, there is no caching because I am trying to bring up the representation first and get it correctly working.

This is what the Records look like.  The records are functors that are called if it is necessary to build the Fusion IR
torch/csrc/jit/codegen/cuda/python_frontend/fusion_record.h

**Tensor Definition Record**
_Note: The Tensor Definition will change for runtime contiguity caching, I am just matching what is already there for now._
```
InputTensorRecord(
      std::vector<size_t> _outputs,
      std::vector<int64_t> _symbolic_sizes,
      std::vector<bool> _contiguous_info,
      NvfDataType _dtype)
      : RecordFunctor({}, std::move(_outputs)),
        symbolic_sizes(std::move(_symbolic_sizes)),
        contiguous_info(std::move(_contiguous_info)),
        dtype(_dtype) {}
  void operator()(FusionDefinition& fd) final {
    auto tv = TensorViewBuilder()
                  .ndims(symbolic_sizes.size())
                  .contiguity(contiguous_info)
                  .shape(symbolic_sizes)
                  .dtype(dtype)
                  .build();

    fd.fusion_state.at(outputs.at(0)) = tv;
    fd.addInput(tv);
  }

  std::vector<int64_t> symbolic_sizes;
  std::vector<bool> contiguous_info;
  NvfDataType dtype;
};

```
**Generic Templatized Op Record Definition**
Op Records are notable because they record Fusion IR arith functions as the `fusion_op_`.
```
template <class OutType, class... ArgTypes>
struct OpRecord : RecordFunctor {
  OpRecord(
      std::vector<size_t> _args,
      std::vector<size_t> _outputs,
      std::function<OutType(ArgTypes...)> fusion_op)
      : RecordFunctor(std::move(_args), std::move(_outputs)),
        fusion_op_(fusion_op) {}

  template <class TupleType, std::size_t... Is>
  OutType opFunc(
      FusionDefinition& fd,
      TupleType& tp,
      std::index_sequence<Is...>) {
    return fusion_op_(
        dynamic_cast<typename std::tuple_element<Is, TupleType>::type>(
            fd.fusion_state.at(args.at(Is)))...);
  }

  void operator()(FusionDefinition& fd) final {
    using arg_tuple_t = std::tuple<ArgTypes...>;
    auto indices =
        std::make_index_sequence<std::tuple_size<arg_tuple_t>::value>();
    arg_tuple_t inputs;
    auto output = opFunc(fd, inputs, indices);
    fd.fusion_state.at(outputs.at(0)) = output;
  }

 private:
  std::function<OutType(ArgTypes...)> fusion_op_;
};
```

Perhaps the most confusing aspect of the Python Frontend is the `FusionDefinition`.  The C++ Class that is bound to is very light weight, purposely.  In an attempt to make sure users don't have to touch more than one file when adding new ops, assuming an appropriate Record has already been defined, the Python bindings effectively create functions that act on the FusionDefinition and appear as part of the class in Python but are not part of the class in C++.

Here is an example of a Unary Op Macro.  It is creating the binding to a lambda function that effectively appears as a FusionDefinition operation in Python.  The other way to do this would have been to create a class method directly in the `FusionDefinition` C++ and have a separate binding to that method.

```
#define NVFUSER_PYTHON_BINDING_UNARY_OP(op_str, op_name)              \
  nvf_ops.def(                                                        \
      op_str,                                                         \
      [](nvfuser::FusionDefinition::Operators& self,                  \
         nvfuser::Tensor* input) -> nvfuser::Tensor* {                \
        nvfuser::Tensor* output = new nvfuser::Tensor(                \
            self.fusion_definition->recording_state.size());          \
        self.fusion_definition->recording_state.emplace_back(output); \
        self.fusion_definition->recording.emplace_back(               \
            new nvfuser::OpRecord<NvfTensorView*, NvfTensorView*>(    \
                {input->index},                                       \
                {output->index},                                      \
                static_cast<NvfTensorView* (*)(NvfTensorView*)>(      \
                    torch::jit::fuser::cuda::op_name)));              \
        return output;                                                \
      },                                                              \
      py::return_value_policy::reference);                            \
```

Here is the `FusionDefinition` class edited for brevity.  The playing of the records will be found under the `exit()` method where exit refers to exiting of the Python Context Manager.  A `FusionDefinition` is captured through a context manager like the following:

```
fusion = Fusion()
with FusionDefinition(fusion) as fd :
    t0 = fd.define_tensor(sizes=[5], strides=[1])
    t1 = fd.ops.abs(t0)
    fd.add_output(t1)
```

```
class FusionDefinition {
 public:
  FusionDefinition(FusionOwner* fusion_owner)
    : fusion_owner_(fusion_owner),
      prev_fusion_(nullptr),
      recording(),
      recording_state(),
      fusion_state(),
      ops(this) {}

  // Context Manager Methods
  FusionDefinition* enter() {
    prev_fusion_ = FusionGuard::getCurFusion();
    FusionGuard::setCurFusion(fusionPtr());
    return this;
  }

  void exit() {
    // Found in the Python Bindings, currently.
    //for (auto& record : recording) {
    //  auto functor = record.get();
    //  (*functor)(self);
    //}

    FusionGuard::setCurFusion(prev_fusion_);
    prev_fusion_ = nullptr;
  }

  void addInput(torch::jit::fuser::cuda::Val* input) {
    fusionPtr()->addInput(input);
  }
  void addOutput(torch::jit::fuser::cuda::Val* output) {
    fusionPtr()->addOutput(output);
  }

  Fusion* fusionPtr() {
    return fusion_owner_->fusionPtr();
  }

 private:
  FusionOwner* fusion_owner_;
  Fusion* prev_fusion_;

 public:
  std::vector<std::unique_ptr<RecordFunctor>> recording;
  std::vector<std::unique_ptr<State>> recording_state;
  std::vector<NvfVal*> fusion_state;

  struct Operators {
    Operators(FusionDefinition* fd) : fusion_definition(fd) {}

    // Python operations are effectively bound here.

    FusionDefinition* fusion_definition;
  };

  Operators ops;
};
```

The Fusion IR doesn’t have `define_tensor` or `define_scalar` functions.  I made them up and the name for the Python `FusionDefinition` as a more understandable/convenient way to define input tensors and scalars.  `TensorView` objects and Fusion IR `Val` objects are not typically defined outside of a Fusion IR `Expr` output (typically arith function outputs) except for inputs to a graph.  Mechanically speaking, there are two things you need to do to define the input in the Fusion IR.  You need to define the IR `TensorView`/`Val` object and then record that the IR `TensorView`/`Val` object is an input in the `Fusion` Object that encapsulates the Fusion IR.  Since the `FusionDefinition` does not correspond one-to-one with the Fusion IR and `define_tensor` and `define_scalar` are made up functions, I decided to combine the `Val` Object creation and recording of the input in the `Fusion` object in one step to reduce the amount of syntax required to define a Fusion in the python interface.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81578
Approved by: https://github.com/jjsjann123, https://github.com/IvanYashchuk, https://github.com/SherlockNoMad
2022-07-26 21:34:20 +00:00
12cb26509a Apply ufmt to torch internal (#81643)
This is a big bang PR, merge conflicts are probably expected and will be addressed at merge.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81643
Approved by: https://github.com/ezyang
2022-07-22 02:19:50 +00:00
a5fb41e3d3 Revert "Revert "Refactored prim utils into _prims_utils folder (#81746)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81746
Approved by: https://github.com/anijain2305, https://github.com/Krovatkin
2022-07-20 23:43:57 +00:00
a3d5d2ddf1 Add partitioned nvFuser executor with ATen fallbacks (#81043)
This PR introduces a new nvFuser executor for FX graphs containing different kinds of nodes, not just `torch.ops.prims` supported by nvFuser. The FX graph is partitioned based on whether nodes are supported or not by nvFuser and supported nodes are fused into subgraphs, that's all using Sherlock's work on the partitioner.

This new partitions-based executor with fallbacks to ATen is used by default with `executor="nvfuser"`. And the previous executor can be used with `executor="strictly_nvfuser"`, naming suggestions are welcome!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81043
Approved by: https://github.com/jjsjann123, https://github.com/SherlockNoMad
2022-07-20 19:51:20 +00:00
e43a02c314 Revert "Refactored prim utils into _prims_utils folder (#81088)"
This reverts commit 80231d0a72453573728242daa0dbc9cc7b45669c.

Reverted https://github.com/pytorch/pytorch/pull/81088 on behalf of https://github.com/jeanschmidt due to breaking internal tests
2022-07-19 19:56:41 +00:00
80231d0a72 Refactored prim utils into _prims_utils folder (#81088)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81088
Approved by: https://github.com/ngimel
2022-07-19 03:55:51 +00:00
2f55050fb5 Bump nvfuser executor lru cache max size (#81461)
default 128 cache size has been causing no cache hit on some benchmark results with more than 128 partition. Bumping up to a more reasonable cache size.
Note that the simple LRU_CACHE doesn't give us any reuse of repetitive pattern, but that shouldn't be of much issue in our next iteration of nvfuser python API.

script for running benchmarks vvv
https://github.com/SherlockNoMad/NvFuserSample

Pull Request resolved: https://github.com/pytorch/pytorch/pull/81461
Approved by: https://github.com/SherlockNoMad
2022-07-14 22:15:29 +00:00
6b280e880a Update NvFuserOperatorSupport (#81311)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81311
Approved by: https://github.com/davidberard98
2022-07-12 21:19:37 +00:00
9a12aa6cad Add cached nvFuser's fusion creation for torch._prims.executor (#80525)
In the current setup for each call of the `execute` function, a `Fusion` object was constructed using `GraphModule` and args, that's expensive.

This PR makes use of `functools.lru_cache` to pay the `Fusion` creation cost once per `GraphModule` and set of args. Currently, the shape, strides, and dtype of tensors are static it can be changed later to make better use of the nvFuser's internal caching mechanism (by specifying only ndim, contiguity, dtype).

On master:
```py
In [2]: a = torch.randn(3, 3, device='cuda')

In [3]: with TorchRefsMode.push():
   ...:     gm = make_fx(lambda x: torch.sigmoid(x))(a)
   ...:

In [4]: %%timeit
   ...: execute(gm, a, executor="nvfuser")
   ...: torch.cuda.synchronize()
175 ms ± 1.18 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
```
This PR:
```py
In [2]: a = torch.randn(3, 3, device='cuda')

In [3]: with TorchRefsMode.push():
   ...:     gm = make_fx(lambda x: torch.sigmoid(x))(a)
   ...:

In [4]: %%timeit
   ...: execute(gm, a, executor="nvfuser")
   ...: torch.cuda.synchronize()
62.6 µs ± 9.99 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
```

In addition, this PR adds support for pytree inputs and extends the test for this.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80525
Approved by: https://github.com/kevinstephano, https://github.com/jjsjann123, https://github.com/SherlockNoMad
2022-07-05 17:00:45 +00:00