Commit Graph

224 Commits

Author SHA1 Message Date
501d0729cb move build_variables.bzl and ufunc_defs.bzl from pytorch-root/tools/ to the root
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78542

This makes importing easier in different build systems that have
different absolute names for the pytorch-root.

Differential Revision: [D36782582](https://our.internmc.facebook.com/intern/diff/D36782582/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D36782582/)!

Approved by: https://github.com/malfet
2022-06-02 19:39:27 +00:00
844368c032 remove Bazel globs that don't match any files
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78007

These are vestigial.

Differential Revision: [D36558465](https://our.internmc.facebook.com/intern/diff/D36558465/)

Approved by: https://github.com/kit1980
2022-06-02 18:39:47 +00:00
2aad28a539 [quant][core][gpu][feature] Implemented quantized cuda gelu
Summary:
Support for quantized cuda gelu has been provided by using
`dequantize -> fp32 cuda gelu kernel -> quantize`. Mathematically, this
is not equivalent to doing int8 gelu, so we have opted for this approach
for now. It might be possible to write a variant of the int8 gelu that's
equivalent to `dequantize -> fp32 cuda gelu kernel -> quantize`, which
can be a topic for future work.

Test function `test_qgelu` was amended to test gelu for quantized cuda
backends.

Test Plan:
```
python test/test_quantization.py -k test_qgelu
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/77212

Approved by: https://github.com/jerryzh168
2022-05-24 22:31:45 +00:00
02c4d877b4 Codegen Non-Native IR Nodes (#76535)
Add codegen infrastructure to generate IR nodes for non-native ops.

The proposed change is to add a `non_native` key to the `{backend}_native_functions.yaml` file that contains schema definitions similar to what is found in `native_functions.yaml`. e.g.
```
non_native:
    ...
    - func: expand(Tensor input, int[] size, bool is_scalar_expand) -> Tensor
    ...
```
these definitions are parsed into a `LazyIrSchema` that can be used for generating IR nodes using `GenLazyIR`.

Fixes #74628

CC: @wconstab @desertfire @henrytwo

Pull Request resolved: https://github.com/pytorch/pytorch/pull/76535
Approved by: https://github.com/wconstab
2022-05-24 19:29:23 +00:00
e517fc8b28 eliminate Bazel's libtorch_cpp_generated_sources
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76179

This list is redundant with the shared build structure.

Differential Revision: [D35818500](https://our.internmc.facebook.com/intern/diff/D35818500/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D35818500/)!

Approved by: https://github.com/dreiss
2022-05-17 03:46:49 +00:00
a013d83bf9 eliminate Bazel's libtorch_python_generated_sources
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76178

These contents are already identified in the shared build structure.

Differential Revision: [D35817999](https://our.internmc.facebook.com/intern/diff/D35817999/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D35817999/)!

Approved by: https://github.com/dreiss
2022-05-17 03:43:02 +00:00
7eaf4780ba Revert "[LT] Store OpKind for each IR subclass in a static field"
This reverts commit ac37ddc79557d7a5ec184a7d2924241ccfc8a333.

Reverted https://github.com/pytorch/pytorch/pull/76711 on behalf of https://github.com/malfet
2022-05-09 20:50:09 +00:00
ac37ddc795 [LT] Store OpKind for each IR subclass in a static field
Summary: Currently OpKind is stored as an object field called op_ for each IR
node, and one usage of op_ is to avoid dynamic_cast in NodeCast when we
need to downcast a base-node pointer into a concrete sub-node pointer.
As a result, we need to construct and pass in an op when downcasting
nodes, and this becomes quite anonnying when we start to implement the
trie-based IR node reusing. More importantly, the op for each subclass
should be unique for that subclass and thus making it a const static field
is a more logical design.

In this PR, we still keep the object-level op_ for easier XLA adoption. As
furture work, we can come back to remove op_, make the op() method
virtual, and get rid of OpKind in all the node constructors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/76711

Approved by: https://github.com/wconstab, https://github.com/JackCaoG
2022-05-06 19:14:46 +00:00
37fb636b7f fix package violation caused by D35587412 (#76808)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76808

This reached into aten/TARGETS in fbcode.
ghstack-source-id: 155484095

Test Plan: Verified manually.

Reviewed By: dreiss, malfet

Differential Revision: D36128458

fbshipit-source-id: c7447b3a40fe905993e799d211241e72183f8acb
(cherry picked from commit b68eb7a45d8973fadab2dfcafcbb0f63801abd40)
2022-05-05 23:39:03 +00:00
ac45fb9b93 switch Bazel to the shared generate-code genrule (#75790)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75790

We were building it before, but now we use it in downstream
rules. This enables us to eliminate the handwritten genrule.
ghstack-source-id: 155300051

Test Plan: Verified locally and in CI.

Reviewed By: dreiss

Differential Revision: D35645390

fbshipit-source-id: 478bb37a6ec295c232f66383babf46606e83ed5e
(cherry picked from commit 2822d4c5b48c6d9282149b2d43cf72d645237196)
2022-05-04 15:26:25 +00:00
eb27c85160 move generate-code into shared build structure (#75699)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75699

ghstack-source-id: 155255334

Test Plan: Rely on CI.

Reviewed By: dreiss

Differential Revision: D35587412

fbshipit-source-id: 5ab79c07029de279a1fae36519654a73bb61d430
(cherry picked from commit 4896b72a6c0cc087e36889d21d2d885009d94a6d)
2022-05-03 09:53:37 +00:00
b204ad863f Revert "Revert "Allow specifying tags for aten operators in native_functions.yaml""
This reverts commit ea44645c9a682a4e212e64b94a86383c3388ed6b.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/76456

Approved by: https://github.com/osalpekar
2022-04-28 02:04:57 +00:00
6e292f1a21 [quant][core][gpu][improvement] Integrated quantized cudnn max pool2d with existing quantized_max_pool2d (#76129)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76129

Previously, quantized_max_pool2d_cudnn was made available to the
frontend through torch.ops.quantized.max_pool2d.
We improve the integration by also making it available through
torch.max_pool2d, which is made possible by registering
quantized_max_pool2d_cudnn in native_functions.yaml under
quantized_max_pool2d, which is called in max_pool2d.

Ideally and ultimately, we will get rid of the quantized_max_pool2d
registration in native_functions.yaml, and directly register
quantized_max_pool2d and quantized_max_pool2d_cudnn under max_pool2d,
but current support for quantized dispatch keys blocks us from doing so.

Test Plan:
```
python test/run_tests.py
```

```
python test/run_tests.py
```

Differential Revision:
D35789078
D35789078

Reviewed By: jerryzh168

Pulled By: dzdang

fbshipit-source-id: 5d8220255bfab663b4779b5d3c66dea9f79d8ee7
(cherry picked from commit c27164da29043f7dc9a4c27d24a93cd37162c23e)
2022-04-27 01:52:45 +00:00
e816e17655 [PyTorch] Add native fast path for transformer encoder inference (#76333)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76333

The current PyTorch multi-head attention and transformer
implementations are slow. This should speed them up for inference.
ghstack-source-id: 154737857

(Note: this ignores all push blocking failures!)

Test Plan: CI

Reviewed By: cpuhrsch

Differential Revision: D35239925

fbshipit-source-id: 5a7eb8ff79bc6afb4b7d45075ddb2a24a6e2df28
2022-04-26 12:58:03 -04:00
f4200600e4 move Bazel version header generation to shared build structure (#75332)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75332

ghstack-source-id: 154678044

Test Plan: Rely on OSS CI.

Reviewed By: malfet

Differential Revision: D35434229

fbshipit-source-id: 7cdd33fa32d0c485f44477e414c24c9bc4b74963
(cherry picked from commit 60285c613e8703c52f36f0bf1178e35c04574ffa)
2022-04-25 17:51:30 +00:00
d78dd825ba define the caffe2_serialize target in Bazel (#75942)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75942

This also requires changes to the target definition and the xplat translator to get it working.
ghstack-source-id: 154678046

Test Plan: Verify locally and rely on CI.

Reviewed By: malfet

Differential Revision: D35704597

fbshipit-source-id: 6b0d9f5a044609b24dda656f80233ba6186c097f
(cherry picked from commit 6de43c5ca7a973c9f8b71f4d60d4d5e85cc2ba21)
2022-04-25 16:14:05 +00:00
2387efd356 Revert "[PyTorch] Add native fast path for transformer encoder inference"
This reverts commit b369b89f235f54bc9de85d768fb62ac4579681dc.

This has internal changes and should not have been landed via mergebot.

Ref: https://github.com/pytorch/pytorch/pull/75809#issuecomment-1108717166
2022-04-25 11:40:02 -04:00
b369b89f23 [PyTorch] Add native fast path for transformer encoder inference
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75809

The current PyTorch multi-head attention and transformer
implementations are slow. This should speed them up for inference.

Differential Revision: [D35239925](https://our.internmc.facebook.com/intern/diff/D35239925/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D35239925/)!

Approved by: https://github.com/ezyang
2022-04-25 06:11:36 +00:00
36420b5e8c Rename tools/codegen to torchgen (#76275)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76275

In preparation for addressing
https://github.com/pytorch/pytorch/issues/73212

Diff was generated with:

```
git mv tools/codegen torchgen
git grep -l 'tools.codegen' | xargs sed -i 's/tools.codegen/torchgen/g'
sed -i "s/\${TOOLS_PATH}\/codegen/\${TORCH_ROOT}\/torchgen/g" caffe2/CMakeLists.txt
```

and a manual edits to:

* tools/test/test_gen_backend_stubs.py
* torchgen/build.bzl
* torchgen/gen_backend_stubs.py

aka this diff:

```
 diff --git a/tools/test/test_gen_backend_stubs.py b/tools/test/test_gen_backend_stubs.py
index 3dc26c6d2d..104054575e 100644
 --- a/tools/test/test_gen_backend_stubs.py
+++ b/tools/test/test_gen_backend_stubs.py
@@ -9,7 +9,7 @@ from torchgen.gen_backend_stubs import run
 from torchgen.gen import _GLOBAL_PARSE_NATIVE_YAML_CACHE  # noqa: F401

 path = os.path.dirname(os.path.realpath(__file__))
-gen_backend_stubs_path = os.path.join(path, '../torchgen/gen_backend_stubs.py')
+gen_backend_stubs_path = os.path.join(path, '../../torchgen/gen_backend_stubs.py')

 # gen_backend_stubs.py is an integration point that is called directly by external backends.
 # The tests here are to confirm that badly formed inputs result in reasonable error messages.
 diff --git a/torchgen/build.bzl b/torchgen/build.bzl
index ed04e35a43..d00078a3cf 100644
 --- a/torchgen/build.bzl
+++ b/torchgen/build.bzl
@@ -1,6 +1,6 @@
 def define_targets(rules):
     rules.py_library(
-        name = "codegen",
+        name = "torchgen",
         srcs = rules.glob(["**/*.py"]),
         deps = [
             rules.requirement("PyYAML"),
@@ -11,6 +11,6 @@ def define_targets(rules):

     rules.py_binary(
         name = "gen",
-        srcs = [":codegen"],
+        srcs = [":torchgen"],
         visibility = ["//visibility:public"],
     )
 diff --git a/torchgen/gen_backend_stubs.py b/torchgen/gen_backend_stubs.py
index c1a672a655..beee7a15e0 100644
 --- a/torchgen/gen_backend_stubs.py
+++ b/torchgen/gen_backend_stubs.py
@@ -474,7 +474,7 @@ def run(
 ) -> None:

     # Assumes that this file lives at PYTORCH_ROOT/torchgen/gen_backend_stubs.py
-    pytorch_root = pathlib.Path(__file__).parent.parent.parent.absolute()
+    pytorch_root = pathlib.Path(__file__).parent.parent.absolute()
     template_dir = os.path.join(pytorch_root, "aten/src/ATen/templates")

     def make_file_manager(install_dir: str) -> FileManager:
```

run_all_fbandroid_tests

Test Plan: sandcastle

Reviewed By: albanD, ngimel

Differential Revision: D35770317

fbshipit-source-id: 153ac4a7fef15b1e750812a90bfafdbc8f1ebcdf
(cherry picked from commit c6d485d1d4648fa1c8a4c14c5bf3d8e899b9b4dd)
2022-04-25 01:38:06 +00:00
0a5e788ab2 [PyTorch] Add NestedTensorCPU and NestedTensorCUDA dispatch keys (#75808)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75808

Just as it is often difficult to write a single kernel that can handle both CPU and CUDA, so can it be difficult to do the same for NestedTensor.
ghstack-source-id: 154171542

(Note: this ignores all push blocking failures!)

Test Plan: CI?

Reviewed By: bdhirsh

Differential Revision: D35603836

fbshipit-source-id: fb0ebb19d34531ed96ce176aca325f8e2b5f90e6
(cherry picked from commit 0bcd753f93c04256c1b745f84a74ecccf0dceef5)
2022-04-19 18:12:12 +00:00
ef50186a7d remove unused //:tools_jit target and dependency from generate-code
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74837

This is no longer used.

Differential Revision: [D35187901](https://our.internmc.facebook.com/intern/diff/D35187901/)

Approved by: https://github.com/malfet
2022-04-17 20:54:46 +00:00
97c993ca7a [PyTorch] Add NestedTensor support functions for transformers
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75491

Here are the NestedTensor kernels we'll need for the improved transformer implementation.

Differential Revision: [D35409275](https://our.internmc.facebook.com/intern/diff/D35409275/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D35409275/)!

Approved by: https://github.com/cpuhrsch
2022-04-14 16:30:23 +00:00
23b8414391 code-generate non-aliasing {view}_copy kernels (#73442)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73442

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D35016025

Pulled By: bdhirsh

fbshipit-source-id: 2a7f303ec76f5913b744c7822a531d55a57589c9
(cherry picked from commit 3abe13c2a787bcbe9c41b0a335c96e5a3d3642fb)
2022-04-11 19:48:55 +00:00
d92213f741 move setup_helpers programs to their own package (#74838)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74838

This matches structure of internal builds.
ghstack-source-id: 153272856

Test Plan: Should be a no-op, rely on CI to validate.

Reviewed By: ezyang

Differential Revision: D35187899

fbshipit-source-id: 44b51145df54c41836149704d0d84d4a882f158e
(cherry picked from commit af48ae5e7dd6ea7d6dc3acfb367d76508f7d6b0c)
2022-04-08 11:17:48 +00:00
0d7aad822e move Bazel //:tools_autograd to the //tools/autograd package (#74745)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74745

This is to be consistent with internal builds.
ghstack-source-id: 152722873

Test Plan: Rely on CI.

Reviewed By: ezyang

Differential Revision: D35144475

fbshipit-source-id: f4cea54ceb05cd70de76c7a679466c92d063d15e
(cherry picked from commit a52691384ad6773cf6e7ebbb5709a4d2696875e8)
2022-04-06 16:18:37 +00:00
60729d02f1 remove unused nn_path from generate_code (#74563)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74563

This is used inconsistently in all the generate_code program
invocations. Nevertheless, nothing consumes this flag, so we can
safely remove it.

This was removed in #25353.
ghstack-source-id: 152249818

Test Plan: Should be a no-op, rely on CI.

Reviewed By: malfet

Differential Revision: D35053096

fbshipit-source-id: 3ad19e83ca14649b514dc163c3caff6cbd118e14
(cherry picked from commit a43f05bb43553249caac3c3479986cbc45d286ae)
2022-03-31 18:35:30 +00:00
40bf3cfeb7 use the shared //tools/codegen:gen in OSS Bazel (#74471)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74471

ghstack-source-id: 152438662

Test Plan: Rely on CI.

Reviewed By: malfet

Differential Revision: D35011683

fbshipit-source-id: 3729c6c7c1b323e253d9832bb51752af8ae2faa6
(cherry picked from commit 1727d8693fa07e308f7390e6f645516dc9dee6a9)
2022-03-31 16:01:42 +00:00
79307fbde0 use the //tools/codegen target in Bazel (#74465)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74465

This requires adding py_library and its PyPI dependency provider
"requirement".
ghstack-source-id: 152438643

Test Plan: Rely on CI to validate.

Reviewed By: malfet

Differential Revision: D35009795

fbshipit-source-id: 424c4968474b3c2fb37d2c7dba932b37605a63f7
(cherry picked from commit 91e442c3bf0e204b0fb6c98405aaaa7308011511)
2022-03-31 12:54:14 +00:00
ea44645c9a Revert "Allow specifying tags for aten operators in native_functions.yaml"
This reverts commit 1dab71ab258df5168bb635530a820625f9d4b522.

Reverted https://github.com/pytorch/pytorch/pull/72549 on behalf of https://github.com/malfet
2022-03-28 18:04:38 +00:00
1dab71ab25 Allow specifying tags for aten operators in native_functions.yaml
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72549

Approved by: https://github.com/ezyang
2022-03-25 21:17:52 +00:00
3547f20872 Land remaining parts of Torchscript Lazy Tensor backend (#74111)
Summary:
Also enables bazel build to run lazy codegen.  Bazel (oss) build feeds off the same filelists as cmake/buck (build_variables.bzl), so enabling it is easier than keeping it disabled.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/74111

Test Plan: Run CI and verify test_lazy_ops is running via OSS cmake builds

Reviewed By: bdhirsh

Differential Revision: D34772403

fbshipit-source-id: 8a63f58b9536e6ac1be530667932176ef2549496
(cherry picked from commit e807ffb1918853d10b924fdc24f85ee5b1a39021)
2022-03-22 23:14:03 +00:00
44a8d4d998 Add lazy tensor unit tests, disabled (#74309)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74309

Since the test file is large, it can be landed on its own and then switched on
in the diff that actually builds lazy tensor code.

Test Plan: verify CI passes

Reviewed By: desertfire

Differential Revision: D34928619

fbshipit-source-id: cd556155326f7fb55b3f29031f80bc36c936d565
(cherry picked from commit 60945adbefb6a8d19f89e330f8b344d076b13bfc)
2022-03-17 15:31:26 +00:00
72b1194464 Run lazy tensor codegen in generate_code.py (#73996)
Summary:
Hooks into existing autograd codegen script (generate_code.py) to take advantage of its integrations into buck/cmake/bazel.

Adds a new option (--gen_lazy_ts_backend) to. generate_code.py, calling this from CMake OSS build and fbcode build, but not from other internal xplat/ovrsource builds (these could be opted in later)

Bazel support is added in a later diff.

Includes one generated file (torch/csrc/lazy/generated/LazyIr.h) in a unit test (test/cpp/lazy/test_ir.cpp) to partially verify the generator is working, but does not compile the remaining output sources from the generator yet as they depend on other files not yet landed from lazy_tensor_staging branch.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/73996

Test Plan: OSS/internal CI - verify all builds are working and test_ir.cpp compiles LazyIr.h

Reviewed By: ezyang

Differential Revision: D34408536

fbshipit-source-id: 8af0aea3b95d81eccafc17d64390d70ddd176515
(cherry picked from commit f930612f2bad61c76eb02d85cfbec9f33a1459dc)
2022-03-17 15:31:26 +00:00
484c0de670 Minimal NestedTensor (#72881)
Summary:
This PR adds a minimal version of a NestedTensor. It introduces the general harness future development can be built around.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/72881

Reviewed By: albanD

Differential Revision: D34259177

Pulled By: cpuhrsch

fbshipit-source-id: 0245c36f603424e20f3b09651043c207f526d760
(cherry picked from commit 10764e8d427f29b364567e4cbc86ed73c3933158)
2022-03-02 16:31:51 +00:00
ce7910ba81 ufunc codegen (#65851)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65851

Design doc: https://docs.google.com/document/d/12rtlHnPUpaJ-I52Iob3L0WA3rKRr_OY7fXqeCvn2MVY/edit

First read the design doc to understand the user syntax. In this PR, we have converted add to use ufunc codegen; most of the cpp changes are deleting the preexisting implementations of add, and ufunc/add.h are the new implementations in the ufunc format.

The bulk of this PR is in the new codegen machinery. Here's the order to read the files:

* `tools/codegen/model.py`
  * Some self-explanatory utility classes: `ScalarType`, `DTYPE_CLASSES`
  * New classes for representing ufunc entries in `native_functions.yaml`: `UfuncKey` and  `UfuncInnerLoop`, as well as parsing logic for these entries. UfuncKey has some unusual entries (e.g., CPUScalar) that don't show up in the documentation, more on these below).
  * A predicate `is_ufunc_dispatch_key` for testing which dispatch keys should get automatically generated when an operator opts into ufuncs (CPU and CUDA, for now!)
* `tools/codegen/api/types.py`
  * More self-explanatory utility stuff: ScalarTypeToCppMapping mapping ScalarType to CppTypes; Binding.rename for changing the name of a binding (used when we assign constructor variables to member variables inside CUDA functors)
  * New VectorizedCType, representing `at::vec::Vectorized<T>`. This is used inside vectorized CPU codegen.
  * New `scalar_t` and `opmath_t` BaseCppTypes, representing template parameters that we work with when doing codegen inside ufunc kernel loops (e.g., where you previously had Tensor, now you have `scalar_t`)
  * `StructuredImplSignature` represents a `TORCH_IMPL_FUNC` definition, and straightforwardly follows from preexisting `tools.codegen.api.structured`
* `tools/codegen/translate.py` - Yes, we use translate a LOT in this PR. I improved some of the documentation, the only substantive changes are adding two new conversions: given a `scalar_t` or a `const Scalar&`, make it convertible to an `opmath_t`
* `tools/codegen/api/ufunc.py`
  * OK, now we're at the meaty stuff. This file represents the calling conventions of three important concepts in ufunc codegen, which we'll describe shortly. All of these APIs are relatively simple, since there aren't any complicated types by the time you get to kernels.
  * stubs are the DispatchStub trampolines that CPU kernels use to get to their vectorized versions. They drop all Tensor arguments (as they are in TensorIterator) but otherwise match the structured calling convention
  * ufuncs are the inner loop template functions that you wrote in ufunc/add.h which do the actual computation in question. Here, all the Tensors and Scalars have been converted into the computation type (`opmath_t` in CUDA, `scalar_t` in CPU)
  * ufunctors are a CUDA-only concept representing functors that take some of their arguments on a host-side constructor, and the rest in the device-side apply. Once again, Tensors and Scalars are converted into the computation type, `opmath_t`, but for clarity all the functions take `scalar_t` as argument (as this is the type that is most salient at the call site). Because the constructor and apply are code generated separately, `ufunctor_arguments` returns a teeny struct `UfunctorBindings`
* `tools/codegen/dest/ufunc.py` - the workhorse. This gets its own section below.
* `tools/codegen/gen.py` - just calling out to the new dest.ufunc implementation to generate UfuncCPU_add.cpp, UFuncCPUKernel_add.cpp and UfuncCUDA_add.cu files per ufunc operator. Each of these files does what you expect (small file that registers kernel and calls stub; CPU implementation; CUDA implementation). There is a new file manager for UFuncCPUKernel files as these need to get replicated by cmake for vectorization. One little trick to avoid recompilation is we directly replicate code generated forward declarations in these files, to reduce the number of headers we depend on (this is codegen, we're just doing the preprocessors job!)
* I'll talk about build system adjustments below.

OK, let's talk about tools/codegen/dest/ufunc.py. This file can be roughly understood in two halves: one for CPU code generation, and the other for CUDA code generation.

**CPU codegen.** Here's roughly what we want to generate:

```
// in UfuncCPU_add.cpp
using add_fn = void (*)(TensorIteratorBase&, const at::Scalar&);
DECLARE_DISPATCH(add_fn, add_stub);
DEFINE_DISPATCH(add_stub);

TORCH_IMPL_FUNC(ufunc_add_CPU)
(const at::Tensor& self, const at::Tensor& other, const at::Scalar& alpha, const at::Tensor& out) {
  add_stub(device_type(), *this, alpha);
}

// in UfuncCPUKernel_add.cpp
void add_kernel(TensorIteratorBase& iter, const at::Scalar& alpha) {
  at::ScalarType st = iter.common_dtype();
  RECORD_KERNEL_FUNCTION_DTYPE("add_stub", st);
  switch (st) {
    AT_PRIVATE_CASE_TYPE("add_stub", at::ScalarType::Bool, bool, [&]() {
      auto _s_alpha = alpha.to<scalar_t>();
      cpu_kernel(iter, [=](scalar_t self, scalar_t other) {
        return ufunc::add(self, other, _s_alpha);
      });
    })

    AT_PRIVATE_CASE_TYPE(
        "add_stub", at::ScalarType::ComplexFloat, c10::complex<float>, [&]() {
          auto _s_alpha = alpha.to<scalar_t>();
          auto _v_alpha = at::vec::Vectorized<scalar_t>(_s_alpha);
          cpu_kernel_vec(
              iter,
              [=](scalar_t self, scalar_t other) {
                return ufunc::add(self, other, _s_alpha);
              },
              [=](at::vec::Vectorized<scalar_t> self,
                  at::vec::Vectorized<scalar_t> other) {
                return ufunc::add(self, other, _v_alpha);
              });
        })

   ...
```

The most interesting change about the generated code is what previously was an `AT_DISPATCH` macro invocation is now an unrolled loop. This makes it easier to vary behavior per-dtype (you can see in this example that the entry for bool and float differ) without having to add extra condtionals on top.

Otherwise, to generate this code, we have to hop through several successive API changes:

* In TORCH_IMPL_FUNC(ufunc_add_CPU), go from StructuredImplSignature to StubSignature (call the stub). This is normal argument massaging in the classic translate style.
* In add_kernel, go from StubSignature to UfuncSignature. This is nontrivial, because we must do various conversions outside of the inner kernel loop. These conversions are done by hand, setting up the context appropriately, and then the final ufunc call is done using translate. (BTW, I introduce a new convention here, call on a Signature, for code generating a C++ call, and I think we should try to use this convention elsewhere)

The other piece of nontrivial logic is the reindexing by dtype. This reindexing exists because the native_functions.yaml format is indexed by UfuncKey:

```
   Generic: add (AllAndComplex, BFloat16, Half)
   ScalarOnly: add (Bool)
```

but when we do code generation, we case on dtype first, and then we generate a `cpu_kernel` or `cpu_kernel_vec` call. We also don't care about CUDA code generation (which Generic) hits. Do this, we lower these keys into two low level keys, CPUScalar and CPUVector, which represent the CPU scalar and CPU vectorized ufuncs, respectively (Generic maps to CPUScalar and CPUVector, while ScalarOnly maps to CPUScalar only). Reindexing then gives us:

```
    AllAndComplex:
        CPUScalar: add
        CPUVector: add
    Bool:
        CPUScalar: add
    ...
```

which is a good format for code generation, but too wordy to force native_functions.yaml authors to write. Note that when reindexing, it is possible for there to be a conflicting definition for the same dtype; we just define a precedence order and have one override the other, so that it is easy to specialize on a particular dtype if necessary. Also note that because CPUScalar/CPUVector are part of UfuncKey, technically you can manually specify them in native_functions.yaml, although I don't expect this functionality to be used.

**CUDA codegen.** CUDA code generation has many of the same ideas as CPU codegen, but it needs to know about functors, and stubs are handled slightly differently.  Here is what we want to generate:

```
template <typename scalar_t>
struct CUDAFunctorOnSelf_add {
  using opmath_t = at::opmath_type<scalar_t>;
  opmath_t other_;
  opmath_t alpha_;
  CUDAFunctorOnSelf_add(opmath_t other, opmath_t alpha)
      : other_(other), alpha_(alpha) {}
  __device__ scalar_t operator()(scalar_t self) {
    return ufunc::add(static_cast<opmath_t>(self), other_, alpha_);
  }
};

... two more functors ...

void add_kernel(TensorIteratorBase& iter, const at::Scalar & alpha) {
  TensorIteratorBase& iter = *this;
  at::ScalarType st = iter.common_dtype();
  RECORD_KERNEL_FUNCTION_DTYPE("ufunc_add_CUDA", st);
  switch (st) {
    AT_PRIVATE_CASE_TYPE("ufunc_add_CUDA", at::ScalarType::Bool, bool, [&]() {
      using opmath_t = at::opmath_type<scalar_t>;
      if (false) {
      } else if (iter.is_cpu_scalar(1)) {
        CUDAFunctorOnOther_add<scalar_t> ufunctor(
            iter.scalar_value<opmath_t>(1), (alpha).to<opmath_t>());
        iter.remove_operand(1);
        gpu_kernel(iter, ufunctor);
      } else if (iter.is_cpu_scalar(2)) {
        CUDAFunctorOnSelf_add<scalar_t> ufunctor(
            iter.scalar_value<opmath_t>(2), (alpha).to<opmath_t>());
        iter.remove_operand(2);
        gpu_kernel(iter, ufunctor);
      } else {
        gpu_kernel(iter, CUDAFunctor_add<scalar_t>((alpha).to<opmath_t>()));
      }
    })

   ...

REGISTER_DISPATCH(add_stub, &add_kernel);

TORCH_IMPL_FUNC(ufunc_add_CUDA)
(const at::Tensor& self,
 const at::Tensor& other,
 const at::Scalar& alpha,
 const at::Tensor& out) {
  add_kernel(*this, alpha);
}

```

The functor business is the bulk of the complexity. Like CPU, we decompose CUDA implementation into three low-level keys: CUDAFunctor (normal, all CUDA kernels will have this), and CUDAFunctorOnOther/CUDAFunctorOnScalar (these are to support Tensor-Scalar specializations when the Scalar lives on CPU). Both Generic and ScalarOnly provide ufuncs for CUDAFunctor, but for us to also lift these into Tensor-Scalar specializations, the operator itself must be eligible for Tensor-Scalar specialization. At the moment, this is hardcoded to be all binary operators, but in the future we can use tags in native_functions.yaml to disambiguate (or perhaps expand codegen to handle n-ary operators).

The reindexing process not only reassociates ufuncs by dtype, but it also works out if Tensor-Scalar specializations are needed and codegens the ufunctors necessary for the level of specialization here (`compute_ufunc_cuda_functors`). Generating the actual kernel (`compute_ufunc_cuda_dtype_body`) just consists of, for each specialization, constructing the functor and then passing it off to `gpu_kernel`. Most of the hard work is in functor generation, where we take care to make sure `operator()` has the correct input and output types (which `gpu_kernel` uses to arrange for memory accesses to the actual CUDA tensor; if you get these types wrong, your kernel will still work, it will just run very slowly!)

There is one big subtlety with CUDA codegen: this won't work:

```
   Generic: add (AllAndComplex, BFloat16, Half)
   ScalarOnly: add_bool (Bool)
```

This is because, even though there are separate Generic/ScalarOnly entries, we only generate a single functor to cover ALL dtypes in this case, and the functor has the ufunc name hardcoded into it. You'll get an error if you try to do this; to fix it, just make sure the ufunc is named the same consistently throughout. In the code, you see this because after testing for the short circuit case (when a user provided the functor themselves), we squash all the generic entries together and assert their ufunc names are the same. Hypothetically, if we generated a separate functor per dtype, we could support differently named ufuncs but... why would you do that to yourself. (One piece of nastiness is that the native_functions.yaml syntax doesn't stop you from shooting yourself in the foot.)

A brief word about CUDA stubs: technically, they are not necessary, as there is no CPU/CPUKernel style split for CUDA kernels (so, if you look, structured impl actually calls add_kernel directly). However, there is some code that still makes use of CUDA stubs (in particular, I use the stub to conveniently reimplement sub in terms of add), so we still register it. This might be worth frying some more at a later point in time.

**Build system changes.** If you are at FB, you should review these changes in fbcode, as there are several changes in files that are not exported to ShipIt.

The build system changes in this patch are substantively complicated by the fact that I have to implement these changes five times:

* OSS cmake build
* OSS Bazel build
* FB fbcode Buck build
* FB xplat Buck build (selective build)
* FB ovrsource Buck build

Due to technical limitations in the xplat Buck build related to selective build, it is required that you list every ufunc header manually (this is done in tools/build_variables.bzl)

The OSS cmake changes are entirely in cmake/Codegen.cmake there is a new set of files cpu_vec_generated (corresponding to UfuncCPUKernel files) which is wired up in the same way as other files. These files are different because they need to get compiled multiple times under different vectorization settings. I adjust the codegen, slightly refactoring the inner loop into its own function so I can use different base path calculation depending on if the file is traditional (in the native/cpu folder) or generated (new stuff from this diff.

The Bazel/Buck changes are organized around tools/build_variables.bzl, which contain the canonical list of ufunc headers (aten_ufunc_headers), and tools/ufunc_defs.bzl (added to ShipIt export list in D34465699) which defines a number of functions that compute the generated cpu, cpu kernel and cuda files based on the headers list. For convenience, these functions take a genpattern (a string with a {} for interpolation) which can be used to easily reformat the list of formats in target form, which is commonly needed in the build systems.

The split between build_variables.bzl and ufunc_defs.bzl is required because build_variables.bzl is executed by a conventional Python interpreter as part of the OSS cmake, but we require Skylark features to implement the functions in ufunc_defs.bzl (I did some quick Googling but didn't find a lightweight way to run the Skylark interpreter in open source.)

With these new file lists, the rest of the build changes are mostly inserting references to these files wherever necessary; in particular, cpu kernel files have to be worked into the multiple vectorization build flow (intern_build_aten_ops in OSS Bazel). Most of the subtlety relates to selective build. Selective build requires operator files to be copied per overall selective build; as dhruvbird explains to me, glob expansion happens during the action graph phase, but the selective build handling of TEMPLATE_SOURCE_LIST is referencing the target graph. In other words, we can't use a glob to generate deps for another rule, because we need to copy files from wherever (included generated files) to a staging folder so the rules can pick them up.

It can be somewhat confusing to understand which bzl files are associated with which build. Here are the relevant mappings for files I edited:

* Used by everyone - tools/build_tools.bzl, tools/ufunc_defs.bzl
* OSS Bazel - aten.bzl, BUILD.bazel
* FB fbcode Buck - TARGETS
* FB xplat Buck -BUCK, pt_defs.bzl, pt_template_srcs.bzl
* FB ovrsource Buck - ovrsource_defs.bzl, pt_defs.bzl

Note that pt_defs.bzl is used by both xplat and ovrsource. This leads to the "tiresome" handling for enabled backends, as selective build is CPU only, but ovrsource is CPU and CUDA.

BTW, while I was at it, I beefed up fb/build_arvr.sh to also do a CUDA ovrsource build, which was not triggered previously.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D31306586

Pulled By: ezyang

fbshipit-source-id: 210258ce83f578f79cf91b77bfaeac34945a00c6
(cherry picked from commit d65157b0b894b6701ee062f05a5f57790a06c91c)
2022-03-01 00:33:40 +00:00
763ad1bf25 (2/2) Make TorchScript Preserve Fully Qualified Class Name for Python Exceptions: frontend change (#72899)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72899

Reland D33282878 (911d527b87). This is the frontend change.
ghstack-source-id: 149204031

Test Plan: Refer to D33282878 (911d527b87). Also check CI

Reviewed By: gmagogsfm

Differential Revision: D34252127

fbshipit-source-id: 27b17ddd4d05d904eb91fd9ee094d9121f00e388
(cherry picked from commit 1d276baca308110ac40111ccd622400b3bbdc864)
2022-02-16 03:45:15 +00:00
133461e5d6 Move CUDA linalg code to its own subfolder (#72304)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72304

This is a no-op change that simply moves files around in preparation of moving linear algebra in its own dynamically boundable module
This also simplifies torch_cuda_cu build rules, as all files from linalg it needs are in its own folder now.
Bazel CUDA rules are in some weird disarray(needed to add wildcard there as it ignores files mentioned in build_variables.so) and similar wildcard needs to be added to internal build system.

Test Plan: Imported from OSS

Reviewed By: dagitses, ngimel

Differential Revision: D33992796

Pulled By: malfet

fbshipit-source-id: 3f4fa1c224016d03e1a982a7ae5ac7807bc772e2
(cherry picked from commit 6a5a1b0c3f306a8915a45bcdf2c51f15b02d8a14)
2022-02-07 17:55:50 +00:00
286f5a51f9 move //c10:tests target to the shared //c10/test package (#70928)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70928

ghstack-source-id: 148159366

Test Plan: Ensured that the same number of tests are found and run.

Reviewed By: malfet

Differential Revision: D33455272

fbshipit-source-id: fba1e3409b14794be3e6fe4445c56dd5361cfe9d
(cherry picked from commit b45fce500aa9c3f69915bf0857144ba6d268e649)
2022-02-03 20:14:57 +00:00
847dbb8684 CMake: Clean up unused definitions (#69216)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69216

This cleans up 4 pre-processor defines not used by any code:
- HAVE_GCC_GET_CPUID
- USE_GCC_GET_CPUID
- USE_AVX
- USE_AVX2

`cpuid` isn't used in PyTorch any more, we only use `cpuinfo`.
`USE_AVX*` is also not used, instead `HAVE_*_CPU_DEFINITIONS` tells
you which `CPU_CAPABILITY` flags are being compiled.

There is also `fbgemm`'s code path adding `third_party` as an include
path, despite `fbgemm` having a dedicated include directory and a
CMake setup that properly includes it.

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D33794424

Pulled By: malfet

fbshipit-source-id: 99d504af088818d4a26c2f6ce67ec0d59a5eb703
(cherry picked from commit 2e099d41f0e2f7d96c6013ac83223a75f4e4f862)
2022-01-31 22:49:11 +00:00
2bb6a4f437 Generate aten_interned_strings.h automatically (#69407)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69407

This generates aten_interned_strings.h from `native_functions.yaml`
which is more like how it was originally done. The items deleted from
`interned_strings.h` are duplicates that need to be removed in order
for the code to compile, some of the remaining items may still be out
of date but it is fairly benign even if that's the case.

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D32923636

Pulled By: albanD

fbshipit-source-id: a0fd6b3714e70454c5f4ea9b19da5e047d2a4687
2022-01-18 08:29:54 -08:00
1bc3571078 [pytorch][PR] Add ability for a mobile::Module to save as flatbuffer (#70201)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70201

Included functions:
save_mobile_module -> saves a mobile::Module to flatbuffer
load_mobile_module_from_file -> loads a flatbuffer into mobile::Module
parse_mobile_module -> parses from bytes or deserialized flatbuffer module object

Compared to previous attempts, this diff only adds flatbuffer to cmake target and leaves fbcode/xplat ones unchanged.

Test Plan: unittest

Reviewed By: malfet, gmagogsfm

Differential Revision: D33239362

fbshipit-source-id: b9ca36b83d6af2d78cc50b9eb9e2a6fa7fce0763
2022-01-12 16:30:39 -08:00
e1aa5db108 Bazel: Only run ATen codegen once (#70147)
Summary:
Due to a merge conflict, the new bazel cuda build does something
rather obnoxious. It runs ATen codegen with `--per-operator-headers`
enabled and extracts a subset of the generated files; then calls it
again without the flag to extract the CUDA files.

This PR instead calls the codegen once but keeps track of what is
CPU and what is CUDA in separate lists.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70147

Reviewed By: VitalyFedyunin

Differential Revision: D33413020

Pulled By: malfet

fbshipit-source-id: 4b502c38a209d1aa63d715e2336df6fc5aac2212
2022-01-05 06:56:52 -08:00
c34aa715fa AT_MKL_SEQUENTIAL and build changes (#70259)
Summary:
Re-land of  https://github.com/pytorch/pytorch/pull/69419

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70259

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D33246757

Pulled By: ngimel

fbshipit-source-id: 738f8558d4cad6752be14108f9931ec3514f6682
2021-12-22 13:52:23 -08:00
e35bf56461 [Bazel] Add CUDA build to CI (#66241)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/35316
On master, bazel cuda build is disabled due to lack of a proper `cu_library` rule. This PR:
- Add `rules_cuda` to the WORKSPACE and forward `cu_library` to `rules_cuda`.
- Use a simple local cuda and cudnn repositories (adopted from TRTorch) for cuda 11.3.
- Fix current broken cuda build.
- Enable cuda build in CI, not just for `:torch` target but all the test binaries to catch undefined symbols.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66241

Reviewed By: ejguan

Differential Revision: D31544091

Pulled By: malfet

fbshipit-source-id: fd3c34d0e8f80fee06f015694a4c13a8e9e12206
2021-12-17 13:44:29 -08:00
02c63c3006 extract out c10 targets to the c10 package (#69992)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69992

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D33141013

fbshipit-source-id: e5edd6bd5b5834ac27390ba940ebed9148512c8d
2021-12-16 13:11:49 -08:00
4829dcea09 Codegen: Generate seperate headers per operator (#68247)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68247

This splits `Functions.h`, `Operators.h`, `NativeFunctions.h` and
`NativeMetaFunctions.h` into seperate headers per operator base name.
With `at::sum` as an example, we can include:
```cpp
<ATen/core/sum.h>         // Like Functions.h
<ATen/core/sum_ops.h>     // Like Operators.h
<ATen/core/sum_native.h>  // Like NativeFunctions.h
<ATen/core/sum_meta.h>    // Like NativeMetaFunctions.h
```

The umbrella headers are still being generated, but all they do is
include from the `ATen/ops' folder.

Further, `TensorBody.h` now only includes the operators that have
method variants. Which means files that only include `Tensor.h` don't
need to be rebuilt when you modify function-only operators. Currently
there are about 680 operators that don't have method variants, so this
is potentially a significant win for incremental builds.

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D32596272

Pulled By: albanD

fbshipit-source-id: 447671b2b6adc1364f66ed9717c896dae25fa272
2021-12-14 06:40:08 -08:00
b08d64202a Remove THGeneral (#69041)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69041

`TH_CONCAT_{N}` is still being used by THP so I've moved that into
it's own header but all the compiled code is gone.

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D32872477

Pulled By: ngimel

fbshipit-source-id: 06c82d8f96dbcee0715be407c61dfc7d7e8be47a
2021-12-13 16:14:28 -08:00
17f3179d60 Back out "[pytorch][PR] Add ability for a mobile::Module to save as flatbuffer" (#69796)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69796

(Note: this ignores all push blocking failures!)

Test Plan: External CI + Sandcastle

Reviewed By: zhxchen17

Differential Revision: D33032671

fbshipit-source-id: dbf6690e960e25d6a5f19043cbe792add2acd7ef
2021-12-10 21:29:53 -08:00
9962bfb3c9 Remove THTensor (#69040)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69040

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D32872478

Pulled By: ngimel

fbshipit-source-id: f93e16509d64308d91e374744410a6a811e7f4e3
2021-12-10 02:29:11 -08:00
b2e79ed5ec Remove WindowsTorchApiMacro.h in favor of Export.h (#69585)
Summary:
Follow up to https://github.com/pytorch/pytorch/issues/68095

This also changes the files from the ATen folder to include c10's `Export.h` instead since they can't ever be exporting `TORCH_PYTHON_API`.

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69585

Reviewed By: mrshenli

Differential Revision: D32958594

Pulled By: albanD

fbshipit-source-id: 1ec7ef63764573fa2b486928955e3a1172150061
2021-12-09 17:30:09 -08:00