Commit Graph

88 Commits

Author SHA1 Message Date
8a67daf283 [BE][Easy] enable postponed annotations in tools (#129375)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129375
Approved by: https://github.com/malfet
2024-06-29 09:23:35 +00:00
a32ce5ce34 Revert "[BE][Easy] enable postponed annotations in tools (#129375)"
This reverts commit 59eb2897f1745f513edb6c63065ffad481c4c8d0.

Reverted https://github.com/pytorch/pytorch/pull/129375 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I need to revert to cleanly revert https://github.com/pytorch/pytorch/pull/129374, please do a rebase and reland this ([comment](https://github.com/pytorch/pytorch/pull/129375#issuecomment-2197800541))
2024-06-29 00:44:25 +00:00
59eb2897f1 [BE][Easy] enable postponed annotations in tools (#129375)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129375
Approved by: https://github.com/malfet
2024-06-28 15:37:54 +00:00
35ea5c6b22 [3/N][Easy] fix typo for usort config in pyproject.toml (kown -> known): sort torchgen (#127124)
The `usort` config in `pyproject.toml` has no effect due to a typo. Fixing the typo make `usort` do more and generate the changes in the PR. Except `pyproject.toml`, all changes are generated by `lintrunner -a --take UFMT --all-files`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127124
Approved by: https://github.com/Skylion007
ghstack dependencies: #127122, #127123
2024-05-25 19:20:03 +00:00
c5fafe9f48 [BE]: TRY002 - Ban raising vanilla exceptions (#124570)
Adds a ruff lint rule to ban raising raw exceptions. Most of these should at the very least be runtime exception, value errors, type errors or some other errors. There are hundreds of instance of these bad exception types already in the codebase, so I have noqa'd most of them. Hopefully this error code will get commiters to rethink what exception type they should raise when they submit a PR.

I also encourage people to gradually go and fix all the existing noqas that have been added so they can be removed overtime and our exception typing can be improved.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124570
Approved by: https://github.com/ezyang
2024-04-21 22:26:40 +00:00
ee5d981249 [BE]: Enable RUFF PERF402 and apply fixes (#115505)
* Enable PERF402. Makes code more efficient and succinct by removing useless list copies that could be accomplished either via a list constructor or extend call. All test cases have noqa added since performance is not as sensitive in that folder.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115505
Approved by: https://github.com/malfet
2023-12-20 18:01:24 +00:00
376217cc0b [BE]: Apply FURB145 to make code more readable and idiomatic. (#112990)
Testing out some new rules that are in beta, I think I will apply this one codebase wide once it's out of preview. Replaces the hack of using `[:]` to do copies of list with the proper copy method. More efficient and more readable.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112990
Approved by: https://github.com/ezyang
2023-11-06 13:15:04 +00:00
01b662bafe [gen_operators_yaml] add arguments to control include_all_overloads (#108396)
Summary:
In SelectiveBuildOperator, we can specify argument `include_all_overloads`. If True, all overloaded operators (for example, `aten::to.dtype_layout`, `aten::to.prim_Device"` are considered as overloaded operators of `aten::to`), will be built and linked to the final binary. This can significantly increases the final binary size, which could be a deal breaker for on-device deployment.

In this diff, we make back-compatible changes to add new arguments `--not-include-all-overloads-static-root-ops` and `--not-include-all-overloads-closure-ops`. When they are set, we set `include_all_overloads` flag to False for static root ops and closure ops, and rely on code analyzer to decide the actual used overloaded operator.

Test Plan:
- unit test
```
buck test //xplat/caffe2/tools:gen_operators_yaml_test
```
- See test plan in D48771544 where we reduce the shared lib file `libmrengine.lib` from 16653072 bytes to 13686032 bytes.
- See detailed document: https://fburl.com/gdoc/mc93h6kb

Reviewed By: larryliu0820

Differential Revision: D48772302

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108396
Approved by: https://github.com/larryliu0820
2023-09-02 17:37:36 +00:00
6d43c89f37 [BE]: Update Ruff to 0.0.280 (#105724)
Removes unusued loop values in python dictionary iteration. Automated fix from Ruff master

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105724
Approved by: https://github.com/ezyang, https://github.com/janeyx99
2023-07-22 23:03:34 +00:00
14d87bb5ff [BE] Enable ruff's UP rules and autoformat tools and scripts (#105428)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105428
Approved by: https://github.com/albanD, https://github.com/soulitzer, https://github.com/malfet
2023-07-19 01:24:44 +00:00
24f882369a [EdgeML] Remove dependency on all_mobile_model_configs.yaml from pt_operator_library BUCK rule (#99122)
Summary: Removes the dependency on the unified YAML file

Test Plan:
Smoke test via some caffe2 tests.

```
buck2 run xplat/caffe2:supported_mobile_models_test
```

Build a major FoA app that uses model tracing  and confirm it still works.

```
buck2 build fb4a
```

CI/CD for the rest.  If operator tracing / bundling was broken, I'd hope in the 1000+ tests spawned by this change should catch it.

Differential Revision: D44946368

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99122
Approved by: https://github.com/dhruvbird
2023-04-18 17:19:55 +00:00
47dca20d80 [BE] Enable flake8-comprehension rule C417 (#97880)
Enables flake8-comprehension rule C417. Ruff autogenerated these fixes to the codebase.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97880
Approved by: https://github.com/ezyang, https://github.com/kit1980, https://github.com/albanD
2023-03-30 14:34:24 +00:00
7554c10899 Fix typos under tools directory (#97779)
Fix typos under tools directory

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97779
Approved by: https://github.com/clee2000, https://github.com/kit1980
2023-03-30 08:21:35 +00:00
60a68477a6 Bump black version to 23.1.0 (#96578)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96578
Approved by: https://github.com/ezyang
2023-03-15 06:27:59 +00:00
a229b4526f [BE] Prefer dash over underscore in command-line options (#94505)
Preferring dash over underscore in command-line options. Add `--command-arg-name` to the argument parser. The old arguments with underscores `--command_arg_name` are kept for backward compatibility.

Both dashes and underscores are used in the PyTorch codebase. Some argument parsers only have dashes or only have underscores in arguments. For example, the `torchrun` utility for distributed training only accepts underscore arguments (e.g., `--master_port`). The dashes are more common in other command-line tools. And it looks to be the default choice in the Python standard library:

`argparse.BooleanOptionalAction`: 4a9dff0e5a/Lib/argparse.py (L893-L895)

```python
class BooleanOptionalAction(Action):
    def __init__(...):
            if option_string.startswith('--'):
                option_string = '--no-' + option_string[2:]
                _option_strings.append(option_string)
```

It adds `--no-argname`, not `--no_argname`. Also typing `_` need to press the shift or the caps-lock key than `-`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94505
Approved by: https://github.com/ezyang, https://github.com/seemethere
2023-02-09 20:16:49 +00:00
047e542a1a [tools] expose selective build library (#89351)
Change the base module and visibility of `tools:gen_oplist_lib` so that it can be reused.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89351
Approved by: https://github.com/cccclai
2022-11-21 21:08:13 +00:00
347b036350 Apply ufmt linter to all py files under tools (#81285)
With ufmt in place https://github.com/pytorch/pytorch/pull/81157, we can now use it to gradually format all files. I'm breaking this down into multiple smaller batches to avoid too many merge conflicts later on.

This batch (as copied from the current BLACK linter config):
* `tools/**/*.py`

Upcoming batchs:
* `torchgen/**/*.py`
* `torch/package/**/*.py`
* `torch/onnx/**/*.py`
* `torch/_refs/**/*.py`
* `torch/_prims/**/*.py`
* `torch/_meta_registrations.py`
* `torch/_decomp/**/*.py`
* `test/onnx/**/*.py`

Once they are all formatted, BLACK linter will be removed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/81285
Approved by: https://github.com/suo
2022-07-13 07:59:22 +00:00
1f8049566f Re-land BUCK build for pytorch mobile (#77612)
see https://github.com/pytorch/pytorch/pull/76480
fixed most lint errors
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77612
Approved by: https://github.com/kit1980
2022-05-17 00:30:13 +00:00
530481ed69 Revert "[mobile] add buck build for mobile targets (#76480)"
This reverts commit 168dc70faf9764417a7e41a14bf2f4e15a7f3e4a.

Reverted https://github.com/pytorch/pytorch/pull/76480 on behalf of https://github.com/atalman
2022-05-16 16:14:17 +00:00
168dc70faf [mobile] add buck build for mobile targets (#76480)
Create buck targets to replicate internal BUCK build, including
- XNNPACK
- QNNPACK
- C10
- aten_cpu
- torch_mobile_core
- torch_mobile_all_ops
- ptmobile_benchmark

And able to run mobilenet v2 using ptmobile_benchmark (with all ops).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76480
Approved by: https://github.com/seemethere, https://github.com/dreiss
2022-05-15 18:42:41 +00:00
36420b5e8c Rename tools/codegen to torchgen (#76275)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76275

In preparation for addressing
https://github.com/pytorch/pytorch/issues/73212

Diff was generated with:

```
git mv tools/codegen torchgen
git grep -l 'tools.codegen' | xargs sed -i 's/tools.codegen/torchgen/g'
sed -i "s/\${TOOLS_PATH}\/codegen/\${TORCH_ROOT}\/torchgen/g" caffe2/CMakeLists.txt
```

and a manual edits to:

* tools/test/test_gen_backend_stubs.py
* torchgen/build.bzl
* torchgen/gen_backend_stubs.py

aka this diff:

```
 diff --git a/tools/test/test_gen_backend_stubs.py b/tools/test/test_gen_backend_stubs.py
index 3dc26c6d2d..104054575e 100644
 --- a/tools/test/test_gen_backend_stubs.py
+++ b/tools/test/test_gen_backend_stubs.py
@@ -9,7 +9,7 @@ from torchgen.gen_backend_stubs import run
 from torchgen.gen import _GLOBAL_PARSE_NATIVE_YAML_CACHE  # noqa: F401

 path = os.path.dirname(os.path.realpath(__file__))
-gen_backend_stubs_path = os.path.join(path, '../torchgen/gen_backend_stubs.py')
+gen_backend_stubs_path = os.path.join(path, '../../torchgen/gen_backend_stubs.py')

 # gen_backend_stubs.py is an integration point that is called directly by external backends.
 # The tests here are to confirm that badly formed inputs result in reasonable error messages.
 diff --git a/torchgen/build.bzl b/torchgen/build.bzl
index ed04e35a43..d00078a3cf 100644
 --- a/torchgen/build.bzl
+++ b/torchgen/build.bzl
@@ -1,6 +1,6 @@
 def define_targets(rules):
     rules.py_library(
-        name = "codegen",
+        name = "torchgen",
         srcs = rules.glob(["**/*.py"]),
         deps = [
             rules.requirement("PyYAML"),
@@ -11,6 +11,6 @@ def define_targets(rules):

     rules.py_binary(
         name = "gen",
-        srcs = [":codegen"],
+        srcs = [":torchgen"],
         visibility = ["//visibility:public"],
     )
 diff --git a/torchgen/gen_backend_stubs.py b/torchgen/gen_backend_stubs.py
index c1a672a655..beee7a15e0 100644
 --- a/torchgen/gen_backend_stubs.py
+++ b/torchgen/gen_backend_stubs.py
@@ -474,7 +474,7 @@ def run(
 ) -> None:

     # Assumes that this file lives at PYTORCH_ROOT/torchgen/gen_backend_stubs.py
-    pytorch_root = pathlib.Path(__file__).parent.parent.parent.absolute()
+    pytorch_root = pathlib.Path(__file__).parent.parent.absolute()
     template_dir = os.path.join(pytorch_root, "aten/src/ATen/templates")

     def make_file_manager(install_dir: str) -> FileManager:
```

run_all_fbandroid_tests

Test Plan: sandcastle

Reviewed By: albanD, ngimel

Differential Revision: D35770317

fbshipit-source-id: 153ac4a7fef15b1e750812a90bfafdbc8f1ebcdf
(cherry picked from commit c6d485d1d4648fa1c8a4c14c5bf3d8e899b9b4dd)
2022-04-25 01:38:06 +00:00
a11c1bbdd0 Run Black on all of tools/
Signed-off-by: Edward Z. Yang <ezyangfb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/76089

Approved by: https://github.com/albanD
2022-04-20 17:29:41 +00:00
3dc0754c53 [pytorch][mobile] deprecate the LLVM-based static analyzer (#68180)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68180

Since we've open sourced the tracing-based selective build, we can deprecate the
op-dependency-graph-based selective build and the static analyzer tool that
produces the dependency graph.
ghstack-source-id: 143108377

Test Plan: CIs

Reviewed By: seemethere

Differential Revision: D32358467

fbshipit-source-id: c61523706b85a49361416da2230ec1b035b8b99c
2021-11-11 16:37:08 -08:00
355acfdebc [PyTorch Edge][tracing-based] use operator.yaml to build libtorch library (#66237)
Summary:
https://pxl.cl/1QK3N
Enable using the yaml file from tracer to build libtorch library for ios and android.

1. Android:
```
SELECTED_OP_LIST=/Users/chenlai/Documents/pytorch/tracing/deeplabv3_scripted_tracing_update.yaml TRACING_BASED=1  ./scripts/build_pytorch_android.sh x86
```
libtorch_lite.so x86: 3 MB (larger than H1, static is ~3.2 MB)

2. iOS
```
SELECTED_OP_LIST=/Users/chenlai/Documents/pytorch/tracing/deeplabv3_scripted_tracing_update.yaml TRACING_BASED=1 BUILD_PYTORCH_MOBILE=1 IOS_PLATFORM=SIMULATOR  ./scripts/build_ios.sh
```
Binary size: 7.6 MB
Size:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66237

ghstack-source-id: 140197164

Reviewed By: dhruvbird

Differential Revision: D31463119

fbshipit-source-id: c3f4eb71bdef1969eab6cb60999fec8547641cbd
2021-10-10 14:07:01 -07:00
93e0f3a330 Shard Operators.cpp (#62185)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62185

This file can take 5 minutes on its own to compile, and is the single limiting
factor for compile time of `libtorch_cpu` on a 32-core threadripper. Instead,
sharding into 5 files that take around 1 minute each cuts a full minute off the
overall build time.

This also factors out the `.findSchemaOrThrow(...).typed` step so the code can
be shared between `call` and `redispatch`.

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D29962049

Pulled By: albanD

fbshipit-source-id: be5df05fbea09ada0d825855f1618c25a11abbd8
2021-08-09 16:19:49 -07:00
69b2bf70f9 [pytorch] fix tools/code_analyzer for llvm 11 (#60322)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60322

Test Plan: Imported from OSS

Reviewed By: iseeyuan

Differential Revision: D29250420

Pulled By: ljk53

fbshipit-source-id: ff7f9cbacd1d9518ed81c06fc843a90d6948f760
2021-06-20 00:39:11 -07:00
501320ed81 [pytorch] deprecate default_op_deps.yaml (#59573)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59573

To do mobile selective build, we have several options:
1. static dispatch;
2. dynamic dispatch + static analysis (to create the dependency graph);
3. dynamic dispatch + tracing;

We are developing 3. For open source, we used to only support 1, and
currently we support both 1 and 2.

This file is only used for 2. It was introduced when we deprecated
the static dispatch (1). The motivation was to make sure we have a
low-friction selective build workflow for dynamic dispatch (2).
As the name indicates, it is the *default* dependency graph that users
can try if they don't bother to run the static analyzer themselves.
We have a CI to run the full workflow of 2 on every PR, which creates
the dependency graph on-the-fly instead of using the committed file.

Since the workflow to automatically update the file has been broken
for a while, it started to confuse other pytorch developers as people
are already manually editing it, and it might be broken for some models
already.

We reintroduced the static dispatch recently, so we decide to deprecate
this file now and automatically turn on static dispatch if users run
selective build without providing the static analysis graph.

The tracing-based selective build will be the ultimate solution we'd
like to provide for OSS, but it will take some more effort to polish
and release.

Differential Revision:
D28941020
D28941020

Test Plan: Imported from OSS

Reviewed By: dhruvbird

Pulled By: ljk53

fbshipit-source-id: 9977ab8568e2cc1bdcdecd3d22e29547ef63889e
2021-06-07 19:37:37 -07:00
737d920b21 Strictly type everything in .github and tools (#59117)
Summary:
This PR greatly simplifies `mypy-strict.ini` by strictly typing everything in `.github` and `tools`, rather than picking and choosing only specific files in those two dirs. It also removes `warn_unused_ignores` from `mypy-strict.ini`, for reasons described in https://github.com/pytorch/pytorch/pull/56402#issuecomment-822743795: basically, that setting makes life more difficult depending on what libraries you have installed locally vs in CI (e.g. `ruamel`).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59117

Test Plan:
```
flake8
mypy --config mypy-strict.ini
```

Reviewed By: malfet

Differential Revision: D28765386

Pulled By: samestep

fbshipit-source-id: 3e744e301c7a464f8a2a2428fcdbad534e231f2e
2021-06-07 14:49:36 -07:00
09a8f22bf9 Add mish activation function (#58648)
Summary:
See issus: https://github.com/pytorch/pytorch/issues/58375

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58648

Reviewed By: gchanan

Differential Revision: D28625390

Pulled By: jbschlosser

fbshipit-source-id: 23ea2eb7d5b3dc89c6809ff6581b90ee742149f4
2021-05-25 10:36:21 -07:00
bbc3cc6718 [CUDA graphs] [BC-breaking] Makes torch.cuda.amp.GradScaler scale updates in-place for better composability with graph capture (#55562)
Summary:
I'd like the following pattern (a natural composition of Amp with full fwd+bwd capture) to work:
```python
# Create "static_input" with dummy data, run warmup iterations,
# call optimizer.zero_grad(set_to_none=True), then
g = torch.cuda._Graph()
s.wait_stream(torch.cuda.current_stream())
with torch.cuda.stream(s):
    optimizer.zero_grad(set_to_none=True)
    g.capture_begin()
    with autocast():
        out = model(static_input)
        loss = loss_fn(out)
    scaler.scale(loss).backward()
    g.capture_end()
torch.cuda.current_stream().wait_stream(s)

# Training loop:
for b in data:
    # optimizer.zero_grad() deliberately omitted, replay()'s baked-in backward will refill statically held .grads
    static_input.copy_(b)
    g.replay()
    scaler.step(optimizer)
    scaler.update()
```

Right now `GradScaler` can't work with this pattern because `update()` creates the scale tensor for the next iteration out of place. This PR changes `update()` to act in place on a long-lived scale tensor that stays static across iterations.

I'm not sure how this change affects XLA (see https://github.com/pytorch/pytorch/pull/48570), so we shouldn't merge without approval from ailzhang yaochengji.

Tagged bc-breaking because it's a change to the amp update utility function in native_functions.yaml. The function was never meant to be user-facing though.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55562

Reviewed By: zou3519

Differential Revision: D28046159

Pulled By: ngimel

fbshipit-source-id: 02018c221609974546c562f691e20ab6ac611910
2021-04-30 13:03:05 -07:00
fd02fc5d71 Port put_ and take from TH to ATen (#53356)
Summary:
The two ports were don together, as they can be implemented with the same kernel. In TH, they were already implemented with the same kernel.

Resolves https://github.com/pytorch/pytorch/issues/24751
Resolves https://github.com/pytorch/pytorch/issues/24614
Resolves https://github.com/pytorch/pytorch/issues/24640
Resolves https://github.com/pytorch/pytorch/issues/24772

This port makes sure that it interacts correctly with the "deterministic algorithms" flag, as done in https://github.com/pytorch/pytorch/pull/51388

This PR also makes these two functions correct in the following aspects (all of them added to the tests as well):
- Support for complex numbers
- Correct handling of scalar inputs and zero-dimensional inputs
- Implementation that does not do any copies nor sorting of any of the input tensors
- Faster and more correct implementation of the backwards (now it works as it should when `source.shape() != index.shape()`)
- Now `put_(..., accumulate=True)` is implemented correctly with atomic operations on GPU / CPU (when possible) and is deterministic (modulo the loss of precision that might happen due to the reordering of a sum of floats)
- Adds the `torch.put` function that was missing, (`index_put` exists, for example)
- Corrected docs

It also adds a much more thorough testing to the operations and their gradients.

There is a BC-breaking change, and that is that now we check that the inputs do not overlap in the `put_` operation. This was handled (some of the cases, other cases were wrong) in the TH implementation by making contiguous copies of the inputs. How should we handle this one?

**Edit.** Benchmarks:
<details>
<summary>Script</summary>

```python
from IPython import get_ipython
import torch
from itertools import product

torch.manual_seed(13)
torch.set_num_threads(1)

ipython = get_ipython()

cpu = torch.device('cpu')
cuda = torch.device('cuda')

def run_test(ndims, size, index_len, device, cmd):
    print(f"cmd: {cmd}, ndims: {ndims}, tensor_size: {size}, index_len: {index_len}, device: {device}")

    large_tensor = torch.rand(*([size] * ndims), device=device)
    small_tensor = torch.rand((index_len,), device=device)
    index = torch.randint(size * ndims, (index_len,), dtype=torch.long, device=device)
    if cmd == "put":
        command = "large_tensor.put_(index, small_tensor, accumulate=False)"
        if device == cuda:
            command += "; torch.cuda.synchronize()"
    elif cmd == "accumulate":
        command = "large_tensor.put_(index, small_tensor, accumulate=True)"
        if device == cuda:
            command += "; torch.cuda.synchronize()"
    elif cmd == "take":
        command = "torch.take(large_tensor, index)"
        if device == cuda:
            command += "; torch.cuda.synchronize()"
    ipython.magic(f"timeit {command}")
    print()

for method, device in product(["accumulate", "put", "take"], [cpu, cuda]):
    run_test(3, 1000, 10, device, method)
    run_test(3, 1000, 1000, device, method)
    run_test(3, 1000, 10000, device, method)
    run_test(2, 10000, 100000, device, method)
```
</details>

```python
put_(accumulate=False)
```

<details>
<summary>ATen CPU (1.5x - 2x speedup)</summary>

```python
cmd: put, ndims: 3, tensor_size: 1000, index_len: 10, device: cpu
1.05 µs ± 2.35 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

cmd: put, ndims: 3, tensor_size: 1000, index_len: 1000, device: cpu
3.15 µs ± 5.13 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: put, ndims: 3, tensor_size: 1000, index_len: 10000, device: cpu
21.6 µs ± 13.1 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

cmd: put, ndims: 2, tensor_size: 10000, index_len: 100000, device: cpu
238 µs ± 781 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
```
</details>

<details>
<summary>TH CPU</summary>

```python
cmd: put, ndims: 3, tensor_size: 1000, index_len: 10, device: cpu
722 ns ± 2.67 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

cmd: put, ndims: 3, tensor_size: 1000, index_len: 1000, device: cpu
4.89 µs ± 18.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: put, ndims: 3, tensor_size: 1000, index_len: 10000, device: cpu
42.5 µs ± 96.3 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

cmd: put, ndims: 2, tensor_size: 10000, index_len: 100000, device: cpu
428 µs ± 774 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
```
</details>
<details>
<summary>ATen GPU (same speed)</summary>

```python
cmd: put, ndims: 3, tensor_size: 1000, index_len: 10, device: cuda
8.99 µs ± 16 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: put, ndims: 3, tensor_size: 1000, index_len: 1000, device: cuda
10.4 µs ± 24.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: put, ndims: 3, tensor_size: 1000, index_len: 10000, device: cuda
10.4 µs ± 11.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: put, ndims: 2, tensor_size: 10000, index_len: 100000, device: cuda
15.6 µs ± 1.12 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
```
</details>

<details>
<summary>TH GPU</summary>

```python
cmd: put, ndims: 3, tensor_size: 1000, index_len: 10, device: cuda
8.44 µs ± 31.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: put, ndims: 3, tensor_size: 1000, index_len: 1000, device: cuda
9.09 µs ± 4.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: put, ndims: 3, tensor_size: 1000, index_len: 10000, device: cuda
9.77 µs ± 0.998 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: put, ndims: 2, tensor_size: 10000, index_len: 100000, device: cuda
15.8 µs ± 5.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
```
</details>

```python
put_(accumulate=True)
```

<details>
<summary>ATen CPU (x2 speedup)</summary>

```python
cmd: accumulate, ndims: 3, tensor_size: 1000, index_len: 10, device: cpu
1.12 µs ± 2.91 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

cmd: accumulate, ndims: 3, tensor_size: 1000, index_len: 1000, device: cpu
3.14 µs ± 2.05 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: accumulate, ndims: 3, tensor_size: 1000, index_len: 10000, device: cpu
20.8 µs ± 25.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

cmd: accumulate, ndims: 2, tensor_size: 10000, index_len: 100000, device: cpu
264 µs ± 263 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
```
</details>

<details>
<summary>TH CPU</summary>

```python
cmd: accumulate, ndims: 3, tensor_size: 1000, index_len: 10, device: cpu
814 ns ± 1.87 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

cmd: accumulate, ndims: 3, tensor_size: 1000, index_len: 1000, device: cpu
5.11 µs ± 6.02 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: accumulate, ndims: 3, tensor_size: 1000, index_len: 10000, device: cpu
43.9 µs ± 49.4 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

cmd: accumulate, ndims: 2, tensor_size: 10000, index_len: 100000, device: cpu
442 µs ± 1.07 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
```
</details>
<details>
<summary>ATen GPU (3x - 11x speedup)</summary>

```python
cmd: accumulate, ndims: 3, tensor_size: 1000, index_len: 10, device: cuda
9.01 µs ± 14.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: accumulate, ndims: 3, tensor_size: 1000, index_len: 1000, device: cuda
10.4 µs ± 15.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: accumulate, ndims: 3, tensor_size: 1000, index_len: 10000, device: cuda
10.3 µs ± 44.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: accumulate, ndims: 2, tensor_size: 10000, index_len: 100000, device: cuda
12.6 µs ± 19 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
```
</details>

<details>
<summary>TH GPU</summary>

```python
cmd: accumulate, ndims: 3, tensor_size: 1000, index_len: 10, device: cuda
34.7 µs ± 131 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

cmd: accumulate, ndims: 3, tensor_size: 1000, index_len: 1000, device: cuda
38.2 µs ± 116 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

cmd: accumulate, ndims: 3, tensor_size: 1000, index_len: 10000, device: cuda
61.2 µs ± 50.4 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

cmd: accumulate, ndims: 2, tensor_size: 10000, index_len: 100000, device: cuda
140 µs ± 24.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
```
</details>

```python
take()
```

<details>
<summary>ATen CPU (1.1x speedup)</summary>

```python
cmd: take, ndims: 3, tensor_size: 1000, index_len: 10, device: cpu
1.18 µs ± 2.34 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

cmd: take, ndims: 3, tensor_size: 1000, index_len: 1000, device: cpu
2.79 µs ± 2.96 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: take, ndims: 3, tensor_size: 1000, index_len: 10000, device: cpu
16.6 µs ± 10.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: take, ndims: 2, tensor_size: 10000, index_len: 100000, device: cpu
161 µs ± 984 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
```
</details>

<details>
<summary>TH CPU</summary>

```python
cmd: take, ndims: 3, tensor_size: 1000, index_len: 10, device: cpu
1.1 µs ± 3.14 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

cmd: take, ndims: 3, tensor_size: 1000, index_len: 1000, device: cpu
2.93 µs ± 7.31 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: take, ndims: 3, tensor_size: 1000, index_len: 10000, device: cpu
18.6 µs ± 14.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: take, ndims: 2, tensor_size: 10000, index_len: 100000, device: cpu
178 µs ± 139 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
```
</details>
<details>
<summary>ATen GPU (same speed)</summary>

```python
cmd: take, ndims: 3, tensor_size: 1000, index_len: 10, device: cuda
9.38 µs ± 23.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: take, ndims: 3, tensor_size: 1000, index_len: 1000, device: cuda
10.7 µs ± 9.77 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: take, ndims: 3, tensor_size: 1000, index_len: 10000, device: cuda
10.6 µs ± 107 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: take, ndims: 2, tensor_size: 10000, index_len: 100000, device: cuda
11.5 µs ± 21.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
```
</details>

<details>
<summary>TH GPU</summary>

```python
cmd: take, ndims: 3, tensor_size: 1000, index_len: 10, device: cuda
9.31 µs ± 7.57 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: take, ndims: 3, tensor_size: 1000, index_len: 1000, device: cuda
9.52 µs ± 5.78 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: take, ndims: 3, tensor_size: 1000, index_len: 10000, device: cuda
9.73 µs ± 17.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: take, ndims: 2, tensor_size: 10000, index_len: 100000, device: cuda
11.7 µs ± 5.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
```
</details>

cc mruberry

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53356

Reviewed By: mruberry

Differential Revision: D27520243

Pulled By: ngimel

fbshipit-source-id: e3979349c2c62d2949e09fb05e5fd4883fbc9093
2021-04-05 18:05:38 -07:00
b2d8f0a431 [pytorch][bot] update mobile op deps (#52110)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52110

LLVM_DIR=/usr ANALYZE_TORCH=1 tools/code_analyzer/build.sh
cp build_code_analyzer/work/torch_result.yaml tools/code_analyzer/default_op_deps.yaml

Test Plan: Imported from OSS

Reviewed By: iseeyuan

Differential Revision: D26419138

Pulled By: ljk53

fbshipit-source-id: 26bf00036b19ad18a9cf06111df4d9fe32e5feab
2021-02-12 14:50:29 -08:00
c458558334 kill multinomial_alias_setup/draw (#50489)
Summary:
As per title. Partially Fixes https://github.com/pytorch/pytorch/issues/49421.
These functions appear to be dead code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50489

Reviewed By: mruberry

Differential Revision: D25948912

Pulled By: ngimel

fbshipit-source-id: 108723bd4c76cbc3535eba902d6f74597bfdfa58
2021-01-19 00:23:58 -08:00
5252e9857a [pytorch] clean up unused util srcs under tools/autograd (#50611)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50611

Removed the unused old-style code to prevent it from being used.
Added all autograd/gen_pyi sources to mypy-strict.ini config.

Confirmed byte-for-byte compatible with the old codegen:
```
Run it before and after this PR:
  .jenkins/pytorch/codegen-test.sh <baseline_output_dir>
  .jenkins/pytorch/codegen-test.sh <test_output_dir>

Then run diff to compare the generated files:
  diff -Naur <baseline_output_dir> <test_output_dir>
```

Confirmed clean mypy-strict run:
```
mypy --config mypy-strict.ini
```

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D25929730

Pulled By: ljk53

fbshipit-source-id: 1fc94436fd4a6b9b368ee0736e99bfb3c01d38ef
2021-01-18 23:54:02 -08:00
4a14020c0d Remove .impl_UNBOXED() and functionalities associated with it (#49220)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49220

Since all ops are c10-full, we can remove .impl_UNBOXED now.
This also removes the ability of KernelFunction or CppFunction to store unboxedOnly kernels.
ghstack-source-id: 119450489

Test Plan: waitforsandcastle

Reviewed By: ezyang

Differential Revision: D25490225

fbshipit-source-id: 32de9d591e6a842fe18abc82541580647e9cfdad
2021-01-06 14:22:46 -08:00
b5149513ec migrate export_caffe2_op_to_c10.h macros to the new dispatcher registration API, update code_analyzer regex (#48308)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48308

The original regex that I added didn't correctly match namespaces that started with an underscore (e.g. `_test`), which caused a master-only test to fail.

The only change from the previous commit is that I updated the regex like so:

before: `^.*TORCH_LIBRARY_IMPL_init_([^_]+)_([^_]+)_[0-9]+(\(.*)?$`
after: `^.*TORCH_LIBRARY_IMPL_init_([_]*[^_]+)_([^_]+)_[0-9]+(\(.*)?$`

I added in a `[_]*` to the beginning of the namespace capture. I did the same for the `_FRAGMENT` regex.

Verified that running `ANALYZE_TEST=1 tools/code_analyzer/build.sh` (as the master-only test does) produces no diff in the output.

Fixing regex pattern to allow for underscores at the beginning of the
namespace

This reverts commit 3c936ecd3c68f395dad01f42935f20ed8068da02.

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D25123295

Pulled By: bdhirsh

fbshipit-source-id: 54bd1e3f0c8e28145e736142ad62a18806bb9672
2020-11-30 13:05:33 -08:00
3c936ecd3c Revert D25056091: migrate export_caffe2_op_to_c10.h macros to the new dispatcher registration API
Test Plan: revert-hammer

Differential Revision:
D25056091 (0ea4982cf3)

Original commit changeset: 0f647ab9bc5e

fbshipit-source-id: e54047b91d82df25460ee00482373c4580f94d50
2020-11-19 19:10:14 -08:00
0ea4982cf3 migrate export_caffe2_op_to_c10.h macros to the new dispatcher registration API (#48097)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48097

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D25056091

Pulled By: bdhirsh

fbshipit-source-id: 0f647ab9bc5e5aee497dac058df492f6e742cfe9
2020-11-19 17:56:56 -08:00
4f538a2ba4 [pytorch][bot] update mobile op deps (#47825)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47825

Test Plan: Imported from OSS

Reviewed By: iseeyuan

Differential Revision: D24913587

Pulled By: ljk53

fbshipit-source-id: b6219573c3238fb453d88019197a00c9f9dbabb8
2020-11-12 19:19:25 -08:00
3d962430a9 Make gen_op_registration flake8 compliant (#47604)
Summary:
Fixes regression introduced by D24686838 (8182558c22)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47604

Reviewed By: walterddr

Differential Revision: D24832687

Pulled By: malfet

fbshipit-source-id: e9f7a35561c2b1705e11fd11abe402e3c83cf5cc
2020-11-09 08:31:07 -08:00
8182558c22 [PyTorch Mobile] Don't use __ROOT__ for inference only ops
Summary:
`__ROOT__` ops are only used in full-jit. To make size compact, disable using it in inference. Since FL is still in fill-jit, keep it for training only.

It saves -17 KB for fbios.

TODO: when FL is migrated to lite_trainer, remove `__ROOT__` to save size in training too.

Test Plan: CI

Reviewed By: dhruvbird

Differential Revision: D24686838

fbshipit-source-id: 15214cebb9d8defa3fdac3aa0d73884b352aa753
2020-11-08 15:27:47 -08:00
27e2ea4cea Make add_relu an internal function (#46676)
Summary:
Cleanup for 1.7

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46676

Reviewed By: gchanan

Differential Revision: D24458565

Pulled By: albanD

fbshipit-source-id: b1e4b4630233d3f1a4bac20e3077411d1ae17f7b
2020-10-22 18:08:15 -07:00
75322dbeb4 [PyTorch] [BUCK] Replace pt_deps.bzl with a YAML operator dependency file which is generated by the code analyser (#46057)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46057

The code analyser (that uses LLVM and runs in the OSS PyTorch git repo) already produces a YAML file which contains base operator names and the operators that they depend on. Currently, this operator dependency graph is converted into a python dictionary to be imported in BUCK and used there. However, it is mostly fed into other executables by serializing the JSON and the consumer pieces this JSON together by concatenating each argument together. This seems unnecessary. Instead, this diff retains the original YAML file and makes all consumers consume that same YAML file.
ghstack-source-id: 114641582

Test Plan: Build Lite Predictor + sandcastle.

Reviewed By: iseeyuan

Differential Revision: D24186303

fbshipit-source-id: eecf41bf673d90b960c3efe7a1271249f0a4867f
2020-10-20 02:00:36 -07:00
0c5cd8c2b9 [RFC] Switch PyTorch Selective Build (Custom Build) to use the SelectiveBuilder abstraction (#45722)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45722

This diff does a bunch of things:

1. Introduces some abstractions as detailed in https://fb.quip.com/2oEzAR5MKqbD to help with selective build related codegen in multiple files.
2. Adds helper methods to combine operators, debug info, operator lists, etc...
3. Currently, the selective build machinery querying `op_registration_whitelist` directly at various places in the code. `op_registration_whitelist` is a list of allowed operator names (without overload name). We want to move to a world where the overload names are also included so that we can be more selective about which operators we include. To that effect, it makes sense to hide the checking logic in a separate abstraction and have the build use that abstraction instead of putting all this selective build specific logic in the code-generator itself. This change is attempting to do just that.
4. Updates generate_code, unboxing-wrapper codegen, and autograd codegen to accept the operator selector paradigm as opposed to a selected operator list.
5. Update `tools/code_analyzer/gen_op_registration_allowlist.py` to expose providing an actual structured operator dependency graph in addition to a serialized string.

There are a bunch of structural changes as well:

1. `root_op_list.yaml` and `combined_op_list.yaml` are now actual YAML files (not a space separated list of operator names)
2. `generate_code.py` accepts only paths to operator list YAML files (both old style as well as new style) and not list of operator names on the command line as arguments
3. `gen.py` optionally also accepts a custom build related operators YAML path (this file has information about which operators to register in the generated library).

ghstack-source-id: 114578753

(Note: this ignores all push blocking failures!)

Test Plan:
`buck test caffe2/test:selective_build`

Generated YAML files after the change:

{P143981979}

{P143982025}

{P143982056}

Ensure that the generated files are same before and after the change:

```
[dhruvbird@devvm2490 /tmp/TypeDefault.cpp] find -name "*.cpp" | xargs md5sum
d72c3d125baa7b77e4c5581bbc7110d2  ./after_change/gen_aten/TypeDefault.cpp
42353036c83ebc7620a7159235b9647f  ./after_change/lite_predictor_lib_aten/TypeDefault.cpp
d72c3d125baa7b77e4c5581bbc7110d2  ./before_change/gen_aten/TypeDefault.cpp
42353036c83ebc7620a7159235b9647f  ./before_change/lite_predictor_lib_aten/TypeDefault.cpp
```

`VariableTypes_N.cpp` are generated the same both before and after the change:

```
[dhruvbird@devvm2490 /tmp/VariableType] find -name "*.cpp" | xargs -n 1 md5sum | sort
3be89f63fd098291f01935077a60b677  ./after/VariableType_2.cpp
3be89f63fd098291f01935077a60b677  ./before/VariableType_2.cpp
40a3e59d64e9dbe86024cf314f127fd6  ./after/VariableType_4.cpp
40a3e59d64e9dbe86024cf314f127fd6  ./before/VariableType_4.cpp
a4911699ceda3c3a430f08c64e8243fd  ./after/VariableType_1.cpp
a4911699ceda3c3a430f08c64e8243fd  ./before/VariableType_1.cpp
ca9aa611fcb2a573a8cba4e269468c99  ./after/VariableType_0.cpp
ca9aa611fcb2a573a8cba4e269468c99  ./before/VariableType_0.cpp
e18f639ed23d802dc4a31cdba40df570  ./after/VariableType_3.cpp
e18f639ed23d802dc4a31cdba40df570  ./before/VariableType_3.cpp
```

Reviewed By: ljk53

Differential Revision: D23837010

fbshipit-source-id: ad06b1756af5be25baa39fd801dfdf09bc565442
2020-10-18 15:10:42 -07:00
d2623da52c replaced whitelist with allowlist (#45260)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/41754

**(1)**
Intially file was named **gen_op_registration_whitelist.py** I changed it to **gen_op_registration_allowlist.py**

**(2)**
There were some **whitelist** in comment inside the file, I changed it to **allowlist**
![update1](https://user-images.githubusercontent.com/62737243/94106752-b296e780-fe59-11ea-8541-632a1dbf90d6.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45260

Reviewed By: dhruvbird

Differential Revision: D23947182

Pulled By: ljk53

fbshipit-source-id: 31b486592451dbb0605d7950e07747cbb72ab80f
2020-09-29 00:27:46 -07:00
4a9c80e82e [pytorch][bot] update mobile op deps (#44854)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44854

Test Plan: Imported from OSS

Reviewed By: iseeyuan

Differential Revision: D23751925

Pulled By: ljk53

fbshipit-source-id: 8e1905091bf3abaac20d97182eb88f96e905ffc2
2020-09-17 18:33:13 -07:00
3fa7f515a5 [pytorch][bot] update mobile op deps (#44700)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44700

Test Plan: Imported from OSS

Reviewed By: iseeyuan

Differential Revision: D23719486

Pulled By: ljk53

fbshipit-source-id: 39219ceeee51861f90b228fdfe2ab59ac8a9704d
2020-09-16 17:20:15 -07:00
0e3cf6b8d2 [pytorch] remove code analyzer build folder between builds (#44148)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44148

Automatically remove the build_code_analyzer folder each time build.sh is run
ghstack-source-id: 111458413

Test Plan:
Run build.sh with different options and compare the outputs (should be different).
Ex:
`ANALYZE_TORCH=1 DEPLOY=1 BASE_OPS_FILE=/path/to/baseops MOBILE_BUILD_FLAGS='-DBUILD_MOBILE_AUTOGRAD=OFF' tools/code_analyzer/build.sh `

should produce a shorter file than
`ANALYZE_TORCH=1 DEPLOY=1 BASE_OPS_FILE=/path/to/baseops MOBILE_BUILD_FLAGS='-DBUILD_MOBILE_AUTOGRAD=ON' tools/code_analyzer/build.sh`

Reviewed By: iseeyuan

Differential Revision: D23503886

fbshipit-source-id: 9b95d4365540da0bd2d27760e1315caed5f44eec
2020-09-04 10:38:12 -07:00
b10c527a1f [pytorch][bot] update mobile op deps (#44100)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44100

Test Plan: Imported from OSS

Reviewed By: iseeyuan

Differential Revision: D23496532

Pulled By: ljk53

fbshipit-source-id: 1e5b9059482e423960349d1361a7a98718c2d9ed
2020-09-03 11:24:26 -07:00
402e9953df [pytorch][bot] update mobile op deps (#44018)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44018

Test Plan: Imported from OSS

Reviewed By: iseeyuan

Differential Revision: D23470528

Pulled By: ljk53

fbshipit-source-id: b677e1c5677fc8929713ee108df69098502c50ea
2020-09-02 14:34:33 -07:00