Commit Graph

37 Commits

Author SHA1 Message Date
cc7a28d727 Refactor Unary Ops tests (#49712)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49712

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D25673712

Pulled By: izdeby

fbshipit-source-id: 4420d5d129026195097d914e410b75b144bea795
2021-03-19 09:28:00 -07:00
b5cdb53af1 Add division logic to a slow/fast path (#49250)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49250

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D25502938

Pulled By: izdeby

fbshipit-source-id: bdd583464eb15d7cb30fd0c22d119cc4b31cbf8d
2021-03-15 12:17:39 -07:00
4bb34c2a75 Update Binary Ops with scalar lists (#49249)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49249

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D25502939

Pulled By: izdeby

fbshipit-source-id: b16e23063b37521be549e83cb17676e3afc4ddb3
2021-03-15 12:16:04 -07:00
84af0c7acd Refactor ForeachUtils.h (#51131)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51131

--------
- Refactored `can_use_fast_route` logic in ForeachUtils.h.
- Fixed related bugs in test_foreach.py

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D26103904

Pulled By: izdeby

fbshipit-source-id: b3859b39adaab55c87dab6f7709d227adc0f6342
2021-03-13 13:39:25 -08:00
8c798e0622 Forbid trailing whitespace (#53406)
Summary:
Context: https://github.com/pytorch/pytorch/pull/53299#discussion_r587882857

These are the only hand-written parts of this diff:
- the addition to `.github/workflows/lint.yml`
- the file endings changed in these four files (to appease FB-internal land-blocking lints):
  - `GLOSSARY.md`
  - `aten/src/ATen/core/op_registration/README.md`
  - `scripts/README.md`
  - `torch/csrc/jit/codegen/fuser/README.md`

The rest was generated by running this command (on macOS):
```
git grep -I -l ' $' -- . ':(exclude)**/contrib/**' ':(exclude)third_party' | xargs gsed -i 's/ *$//'
```

I looked over the auto-generated changes and didn't see anything that looked problematic.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53406

Test Plan:
This run (after adding the lint but before removing existing trailing spaces) failed:
- https://github.com/pytorch/pytorch/runs/2043032377

This run (on the tip of this PR) succeeded:
- https://github.com/pytorch/pytorch/runs/2043296348

Reviewed By: walterddr, seemethere

Differential Revision: D26856620

Pulled By: samestep

fbshipit-source-id: 3f0de7f7c2e4b0f1c089eac9b5085a58dd7e0d97
2021-03-05 17:22:55 -08:00
c697e48023 Refactor ForeachUnaryOp.cu (#51894)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51894

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D26323605

Pulled By: izdeby

fbshipit-source-id: eb65269ab3e14160d7cb5e6e84e85ef4037d3b0d
2021-03-05 10:26:58 -08:00
110a17a4d9 Update foreach APIs to use scalar lists (#51893)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51893

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D26323606

Pulled By: izdeby

fbshipit-source-id: 53791087c924d04526fe7adb8f4ab5676d383b04
2021-03-04 18:20:53 -08:00
e36576d153 Probable fix for out of place BinaryOpScalar bad values and/or IMAs on 11.2 (ci-all edition) (#52634)
Summary:
Should close https://github.com/pytorch/pytorch/issues/51992.

ci-all resubmit of https://github.com/pytorch/pytorch/pull/52591. The plot also thickened considerably since then. Every foreach functor, it turns out, has bad `r_args` accesses for certain code paths and instantiations.

Also, I noticed the [`n % kILP == 0`](2680ff7759/aten/src/ATen/native/cuda/ForeachFunctors.cuh (L87)) condition for vectorization in all functors is way too restrictive: it'll refuse to vectorize anything on any tensor whose overall numel is not a multiple of ILP. That's out of scope though.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52634

Reviewed By: H-Huang

Differential Revision: D26725991

Pulled By: izdeby

fbshipit-source-id: 4bade0ac186bf85527baddc1c44b2c2b8e3c9777
2021-03-01 12:41:24 -08:00
443a431ac3 Revert D25074763: [WIP] Update foreach APIs to use scalar lists
Test Plan: revert-hammer

Differential Revision:
D25074763 (cce84b5ca5)

Original commit changeset: 155e3d2073a2

fbshipit-source-id: ef0d153e2740b50bd4a95f7a57c370bb5da46355
2021-02-03 17:06:40 -08:00
cce84b5ca5 [WIP] Update foreach APIs to use scalar lists (#48223)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48223

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D25074763

Pulled By: izdeby

fbshipit-source-id: 155e3d2073a20d16bdbe358820170bf53f93c7a5
2021-02-02 14:54:28 -08:00
dad74e58fc [WIP] Added foreach_trunc, foreahc_reciprocal, foreach_sigmoid APIs (#47385)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47385

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D24737051

Pulled By: izdeby

fbshipit-source-id: ed259d9184b2b784d8cc1983a8b85cc6cbf930ba
2020-12-07 10:47:23 -08:00
94cd048bda Added foreach_frac API (#47384)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47384

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D24737052

Pulled By: izdeby

fbshipit-source-id: 8c94cc42bf22bfbb8f78bfeb2017a5756045763a
2020-11-17 16:56:30 -08:00
134bce7cd0 Adding bunch of unary foreach APIs (#47875)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47875

Implementing several unary operators for _foreach_ APIs.
### Planned list of ops
- [x]  abs
- [x]  acos
- [x]  asin
- [x]  atan
- [x]  ceil
- [x]  cos
- [x]  cosh
- [x]  erf
- [x]  erfc
- [x]  exp
- [x]  expm1
- [x]  floor
- [x]  log
- [x]  log10
- [x]  log1p
- [x]  log2
- [ ]  frac
- [x]  neg
- [ ]  reciprocal
- [x]  round
- [ ]  rsqrt
- [ ]  sigmoid
- [x]  sin
- [x]  sinh
- [x]  sqrt
- [x]  tan
- [x]  tanh
- [ ]  trunc
- [x]  lgamma
- [ ]  digamma
- [ ]  erfinv
- [ ]  sign
- [ ]  mvlgamma
- [ ]  clamp
- [ ]  clamp_min
- [ ]  clamp_max

### Perf results
```
----------------- OP:  sin  -----------------
  Median: 998.79 us
  300.84 us

----------------- OP:  abs  -----------------
  Median: 1.19 ms
  294.97 us

----------------- OP:  acos  -----------------
  Median: 982.30 us
  299.40 us

----------------- OP:  asin  -----------------
  Median: 1.16 ms
  298.09 us

----------------- OP:  atan  -----------------
  Median: 986.92 us
  295.64 us

----------------- OP:  ceil  -----------------
  Median: 1.17 ms
  297.25 us

----------------- OP:  cos  -----------------
  Median: 972.72 us
  294.41 us

----------------- OP:  cosh  -----------------
  Median: 1.17 ms
  294.97 us

----------------- OP:  erf  -----------------
  Median: 1.17 ms
  297.02 us

----------------- OP:  erfc  -----------------
  Median: 1.14 ms
  299.23 us

----------------- OP:  exp  -----------------
  Median: 1.15 ms
  298.79 us

----------------- OP:  expm1  -----------------
  Median: 1.17 ms
  291.79 us

----------------- OP:  floor  -----------------
  Median: 1.17 ms
  293.51 us

----------------- OP:  log  -----------------
  Median: 1.13 ms
  318.01 us

----------------- OP:  log10  -----------------
  Median: 987.17 us
  295.57 us

----------------- OP:  log1p  -----------------
  Median: 1.13 ms
  297.15 us

----------------- OP:  log2  -----------------
  Median: 974.21 us
  295.01 us

----------------- OP:  frac  -----------------
  Median: 1.15 ms
  296.01 us

----------------- OP:  neg  -----------------
  Median: 1.13 ms
  294.98 us

----------------- OP:  reciprocal  -----------------
  Median: 1.16 ms
  293.69 us

----------------- OP:  round  -----------------
  Median: 1.12 ms
  297.48 us

----------------- OP:  sigmoid  -----------------
  Median: 1.13 ms
  296.53 us

----------------- OP:  sin  -----------------
  Median: 991.02 us
  295.78 us

----------------- OP:  sinh  -----------------
  Median: 1.15 ms
  295.70 us

----------------- OP:  sqrt  -----------------
  Median: 1.17 ms
  297.75 us

----------------- OP:  tan  -----------------
  978.20 us
  297.99 us

----------------- OP:  tanh  -----------------
  Median: 967.84 us
  297.29 us

----------------- OP:  trunc  -----------------
  Median: 1.14 ms
  298.72 us

----------------- OP:  lgamma  -----------------
  Median: 1.14 ms
  317.53 us
```

### Script

```

import torch
import torch.optim as optim
import torch.nn as nn
import torchvision
import torch.utils.benchmark as benchmark_utils

inputs = [torch.rand(3, 200, 200, device="cuda") for _ in range(100)]

def main():
    for op in [
            "sin", "abs", "acos", "asin", "atan", "ceil",
            "cos", "cosh", "erf", "erfc",
            "exp", "expm1", "floor", "log",
            "log10", "log1p", "log2", "frac",
            "neg", "reciprocal", "round",
            "sigmoid", "sin", "sinh", "sqrt",
            "tan", "tanh", "trunc", "lgamma"
        ]:
        print("\n\n----------------- OP: ", op, " -----------------")
        stmt = "[torch.{op}(t) for t in inputs]"
        timer = benchmark_utils.Timer(
            stmt=stmt.format(op = op),
            globals=globals(),
            label="str(optimizer)",
        )
        print(f"autorange:\n{timer.blocked_autorange()}\n\n")

        stmt = "torch._foreach_{op}(inputs)"
        timer_mta = benchmark_utils.Timer(
            stmt=stmt.format(op = op),
            globals=globals(),
            label="str(optimizer_mta)",
        )
        print(f"autorange:\n{timer_mta.blocked_autorange()}\n\n")

if __name__ == "__main__":
    main()

```

Test Plan: Imported from OSS

Reviewed By: nikithamalgifb

Differential Revision: D24948801

Pulled By: izdeby

fbshipit-source-id: defec3c0394d6816d9a8b05a42a057348f1b4d96
2020-11-17 16:51:54 -08:00
1c45631f10 Revert D24737050: [WIP] Adding bunch of unary foreach APIs
Test Plan: revert-hammer

Differential Revision:
D24737050 (b6a2444eff)

Original commit changeset: deb59b41ad1c

fbshipit-source-id: 76cd85028114cfc8fc5b7bb49cd27efc2e315aa5
2020-11-10 09:41:41 -08:00
b6a2444eff [WIP] Adding bunch of unary foreach APIs (#47383)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47383

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D24737050

Pulled By: izdeby

fbshipit-source-id: deb59b41ad1c79b66cafbd9a9d3d6b069794e743
2020-11-09 14:14:28 -08:00
2c55426610 Renamed a TensorListMetaData property. Cleaned up a test (#46662)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46662

Test Plan: Imported from OSS

Reviewed By: ansley

Differential Revision: D24453346

Pulled By: izdeby

fbshipit-source-id: f88ac21708befa2e8f3edeffe5805b69a4634d12
2020-11-04 12:01:28 -08:00
2652f2e334 Optimize arguments checks (#46661)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46661

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D24453342

Pulled By: izdeby

fbshipit-source-id: 26866fdbc9dc2b5410b3b728b175a171cc6a4521
2020-11-03 17:43:10 -08:00
3ea26b1424 [WIP] Push rocm to slow path for foreach APIs (#46733)
Summary:
Move ROCM to a slow path for now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46733

Reviewed By: ngimel

Differential Revision: D24485012

Pulled By: izdeby

fbshipit-source-id: f0f4227cc594d8a87d44008cd5e27ebe100b6b22
2020-10-23 10:33:41 -07:00
c57c560744 Revert "Push rocm to slow path (#46216)" (#46728)
Summary:
This reverts commit bc1ce584512a860c15cb991460d8c98debd62b26.

Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46728

Reviewed By: cpuhrsch

Differential Revision: D24482783

Pulled By: izdeby

fbshipit-source-id: 619b710a8e790b9878e7317f672b4947e7b88145
2020-10-22 12:04:29 -07:00
bc1ce58451 Push rocm to slow path (#46216)
Summary:
Push rocm to slow path

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46216

Reviewed By: bwasti

Differential Revision: D24263731

Pulled By: izdeby

fbshipit-source-id: 98ede2478b8f075ceed44a9e4f2aa292f523b8e2
2020-10-22 09:31:01 -07:00
e7564b076c Refactor scalar list APIs to use overloads (#45673)
Summary:
Refactor foreach APIs to use overloads in case of scalar list inputs.
Tested via unit tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45673

Reviewed By: heitorschueroff

Differential Revision: D24053424

Pulled By: izdeby

fbshipit-source-id: 35976cc50b4acfe228a32ed26cede579d5621cde
2020-10-19 09:28:49 -07:00
8a074af929 Added scalar lists APIs for addcdiv and addcmul (#45932)
Summary:
1) Added new APIs:
 _foreach_addcdiv(Tensor(a!)[] self, Tensor[] tensor1, Tensor[] tensor2, float[] scalars)
 _foreach_addcdiv_(Tensor(a!)[] self, Tensor[] tensor1, Tensor[] tensor2, float[] scalars)
 _foreach_addcmul(Tensor(a!)[] self, Tensor[] tensor1, Tensor[] tensor2, float[] scalars)
 _foreach_addcmul_(Tensor(a!)[] self, Tensor[] tensor1, Tensor[] tensor2, float[] scalars)

2) Updated optimizers to use new APIs

Tested via unit tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45932

Reviewed By: navahgar

Differential Revision: D24150306

Pulled By: izdeby

fbshipit-source-id: c2e65dedc95d9d81a2fdd116e41df0accb0b6f26
2020-10-14 08:12:37 -07:00
1a57b390e8 Add torch._foreach_maximum(TensorList, TensorList) & torch._foreach_minimum(TensorList, TensorList) APIs (#45692)
Summary:
- Adding torch._foreach_maximum(TensorList, TensorList) API
- Adding torch._foreach_minimum(TensorList, TensorList) API
- Updated Adam/AdamW optimizers

Tested via unit tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45692

Reviewed By: anjali411

Differential Revision: D24142464

Pulled By: izdeby

fbshipit-source-id: 6a4fc343a1613cb1e26c8398450ac9cea0a2eb51
2020-10-13 09:22:30 -07:00
a69a78daa2 Use smaller N to speed up TestForeach (#45785)
Summary:
Between September 25 and September 27, approximately half an hour was added to the running time of `pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_test`. Judging from the CircleCI data, it looks like the majority of the new time was added by the following PRs:

- https://github.com/pytorch/pytorch/issues/44550
- https://github.com/pytorch/pytorch/issues/45298

I'm not sure what to do about https://github.com/pytorch/pytorch/issues/44550, but it looks like https://github.com/pytorch/pytorch/issues/45298 increased the `N` for `TestForeach` from just 20 to include both 30 and 300. This PR would remove the 300, decreasing the test time by a couple orders of magnitude (at least when running it on my devserver), from over ten minutes to just a few seconds.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45785

Reviewed By: malfet

Differential Revision: D24094782

Pulled By: samestep

fbshipit-source-id: 2476cee9d513b2b07bc384de751e08d0e5d8b5e7
2020-10-06 13:29:04 -07:00
72bc3d9de4 Use MTA for amp grad unscaling, enforce op math type in MTA functors, and allow op lambdas (#44778)
Summary:
Amp gradient unscaling is a great use case for multi tensor apply (in fact it's the first case I wrote it for).  This PR adds an MTA unscale+infcheck functor.  Really excited to have it for `torch.cuda.amp`. izdeby your interface was clean and straightforward to use, great work!

Labeled as bc-breaking because the native_functions.yaml exposure of unscale+infcheck changes from [`_amp_non_finite_check_and_unscale_` to `_amp_foreach_non_finite_check_and_unscale_`]( https://github.com/pytorch/pytorch/pull/44778/files#diff-f1e4b2c15de770d978d0eb77b53a4077L6289-L6293).

The PR also modifies Unary/Binary/Pointwise Functors to
- do ops' internal math in FP32 for FP16 or bfloat16 inputs, which improves precision ([and throughput, on some architectures!](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#arithmetic-instructions)) and has no downside for the ops we care about.
- accept an instantiated op functor rather than an op functor template (`template<class> class Op`).  This allows calling code to pass lambdas.

Open question:  As written now, the PR has MTA Functors take care of pre- and post-casting FP16/bfloat16 inputs to FP32 before running the ops.  However, alternatively, the pre- and post-math casting could be deferred/written into the ops themselves, which gives them a bit more control.  I can easily rewrite it that way if you prefer.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44778

Reviewed By: gchanan

Differential Revision: D23944102

Pulled By: izdeby

fbshipit-source-id: 22b25ccad5f69b413c77afe8733fa9cacc8e766d
2020-10-01 07:51:16 -07:00
d5748d9a1a Enable binary ops with Scalar Lists with for foreach APIs (#45298)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45298

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D23931986

Pulled By: izdeby

fbshipit-source-id: 281267cd6f90d57a169af89f9f10b0f4fcab47e3
2020-09-25 12:58:34 -07:00
26001a2334 Revert D23753711: [pytorch][PR] Add foreach APIs for binary ops with ScalarList
Test Plan: revert-hammer

Differential Revision:
D23753711 (71d1b5b0e2)

Original commit changeset: bf3e8c54bc07

fbshipit-source-id: 192692e0d3fff4cade9983db0a1760fedfc9674c
2020-09-24 11:55:49 -07:00
71d1b5b0e2 Add foreach APIs for binary ops with ScalarList (#44743)
Summary:
In this PR:
1) Added binary operations with ScalarLists.
2) Fixed _foreach_div(...) bug in native_functions
3) Covered all possible cases with scalars and scalar lists in tests
4) [minor] fixed bug in native_functions by adding "use_c10_dispatcher: full" to all _foreach functions

tested via unit tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44743

Reviewed By: bwasti, malfet

Differential Revision: D23753711

Pulled By: izdeby

fbshipit-source-id: bf3e8c54bc07867e8f6e82b5d3d35ff8e99b5a0a
2020-09-24 08:30:42 -07:00
686e281bcf Updates div to perform true division (#42907)
Summary:
This PR:

- updates div to perform true division
- makes torch.true_divide an alias of torch.div

This follows on work in previous PyTorch releases that first deprecated div performing "integer" or "floor" division, then prevented it by throwing a runtime error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42907

Reviewed By: ngimel

Differential Revision: D23622114

Pulled By: mruberry

fbshipit-source-id: 414c7e3c1a662a6c3c731ad99cc942507d843927
2020-09-14 15:50:38 -07:00
40d138f7c1 Added alpha overloads for add/sub ops with lists (#43413)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43413

Test Plan: Imported from OSS

Reviewed By: cpuhrsch

Differential Revision: D23331896

Pulled By: izdeby

fbshipit-source-id: 2e7484339fec533e21224f18979fddbeca649d2c
2020-09-08 17:02:08 -07:00
63d62d3e44 Skips test_addcmul_cuda if using ROCm (#44304)
Summary:
This test is failing consistently on linux-bionic-rocm3.7-py3.6-test2. Relevant log snippet:

```
03:43:11 FAIL: test_addcmul_cuda_float16 (__main__.TestForeachCUDA)
03:43:11 ----------------------------------------------------------------------
03:43:11 Traceback (most recent call last):
03:43:11   File "/var/lib/jenkins/.local/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 818, in wrapper
03:43:11     method(*args, **kwargs)
03:43:11   File "/var/lib/jenkins/.local/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 258, in instantiated_test
03:43:11     result = test(self, *args)
03:43:11   File "test_foreach.py", line 83, in test_addcmul
03:43:11     self._test_pointwise_op(device, dtype, torch._foreach_addcmul, torch._foreach_addcmul_, torch.addcmul)
03:43:11   File "test_foreach.py", line 58, in _test_pointwise_op
03:43:11     self.assertEqual(tensors, expected)
03:43:11   File "/var/lib/jenkins/.local/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1153, in assertEqual
03:43:11     exact_dtype=exact_dtype, exact_device=exact_device)
03:43:11   File "/var/lib/jenkins/.local/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1127, in assertEqual
03:43:11     self.assertTrue(result, msg=msg)
03:43:11 AssertionError: False is not true : Tensors failed to compare as equal! With rtol=0.001 and atol=1e-05, found 10 element(s) (out of 400) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 0.00048828125 (-0.46484375 vs. -0.46533203125), which occurred at index (11, 18).
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44304

Reviewed By: malfet, izdeby

Differential Revision: D23578316

Pulled By: mruberry

fbshipit-source-id: 558eecf42677383e7deaa4961e12ef990ffbe28c
2020-09-08 13:14:25 -07:00
cce5982c4c Add unary ops: exp and sqrt (#42537)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42537

[First PR: Add private API to support tensor lists: _foreach_add(TensorList tensors, Scalar scalar)](https://github.com/pytorch/pytorch/pull/41554).

**Motivation**
[GitHub issue](https://github.com/pytorch/pytorch/issues/38655)
Current PyTorch optimizer implementations are not efficient in cases when we work with a lot of small feature tensors. Starting a lot of kernels slows down the whole process. We need to reduce the number of kernels that we start.
As an example, we should be looking at [NVIDIAs Apex](https://github.com/NVIDIA/apex).
In order to track progress, we will pick PyTorchs DCGAN model with Adam optimizer and once the optimizer is reimplemented with tensor lists, benchmark the model performance against original model version, Apexs version with original Adam optimizer and it’s FusedAdam optimizer.

**Current API restrictions**
- List can't be empty (will fixed in upcoming PRs).
- All tensors in the list must have the same dtype, device and size.

**Broadcasting**
At this point we don't support broadcasting.

**What is 'Fast' and 'Slow' route**
In particular cases, we cant process an op with a fast list CUDA kernel. Still, we can do with a regular for-loop where the op will be applied to each tensor individually through the dispatch mechanisms. There are a few checks that decide whether the op will be performed via a 'fast' or 'slow' path.
To go the fast route,
- All tensors must have strided layout
- All tensors must be dense and not have overlapping memory
- The resulting tensor type must be the same.

----------------
**In this PR**
Adding APIs:
```
torch._foreach_exp(TensorList tl1)
torch._foreach_exp_(TensorList tl1)
torch._foreach_sqrt(TensorList tl1)
torch._foreach_sqrt_(TensorList tl1)
```

**Tests**
Tested via unit tests

**TODO**
1. Properly handle empty lists
2. Properly handle bool tensors

**Plan for the next PRs**
1. APIs
- Pointwise Ops

2. Complete tasks from TODO
3. Rewrite PyTorch optimizers to use for-each operators for performance gains.

Test Plan: Imported from OSS

Reviewed By: cpuhrsch

Differential Revision: D23331889

Pulled By: izdeby

fbshipit-source-id: 8b04673b8412957472ed56361954ca3884eb9376
2020-09-07 19:57:34 -07:00
10dd25dcd1 Add binary ops for _foreach APIs (#42536)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42536

[First PR: Add private API to support tensor lists: _foreach_add(TensorList tensors, Scalar scalar)](https://github.com/pytorch/pytorch/pull/41554).

**Motivation**
[GitHub issue](https://github.com/pytorch/pytorch/issues/38655)
Current PyTorch optimizer implementations are not efficient in cases when we work with a lot of small feature tensors. Starting a lot of kernels slows down the whole process. We need to reduce the number of kernels that we start.
As an example, we should be looking at [NVIDIAs Apex](https://github.com/NVIDIA/apex).
In order to track progress, we will pick PyTorchs DCGAN model with Adam optimizer and once the optimizer is reimplemented with tensor lists, benchmark the model performance against original model version, Apexs version with original Adam optimizer and it’s FusedAdam optimizer.

**Current API restrictions**
- List can't be empty (will fixed in upcoming PRs).
- All tensors in the list must have the same dtype, device and size.

**Broadcasting**
At this point we don't support broadcasting.

**What is 'Fast' and 'Slow' route**
In particular cases, we cant process an op with a fast list CUDA kernel. Still, we can do with a regular for-loop where the op will be applied to each tensor individually through the dispatch mechanisms. There are a few checks that decide whether the op will be performed via a 'fast' or 'slow' path.
To go the fast route,
- All tensors must have strided layout
- All tensors must be dense and not have overlapping memory
- The resulting tensor type must be the same.

----------------
**In this PR**
Adding APIs:
```
torch._foreach_sub(TensorList tl1, TensorList tl2)
torch._foreach_sub_(TensorList self, TensorList tl2)
torch._foreach_mul(TensorList tl1, TensorList tl2)
torch._foreach_mul_(TensorList self, TensorList tl2)
torch._foreach_div(TensorList tl1, TensorList tl2)
torch._foreach_div_(TensorList self, TensorList tl2)

torch._foreach_sub(TensorList tl1, Scalar scalar)
torch._foreach_sub_(TensorList self, Scalar scalar)
torch._foreach_mul(TensorList tl1, Scalar scalar)
torch._foreach_mul_(TensorList self, Scalar scalar)
torch._foreach_div(TensorList tl1, Scalar scalar)
torch._foreach_div(TensorList self, Scalar scalar)
```

**Tests**
Tested via unit tests

**TODO**
1. Properly handle empty lists
2. Properly handle bool tensors

**Plan for the next PRs**
1. APIs
- Unary Ops for list
- Pointwise Ops

2. Complete tasks from TODO
3. Rewrite PyTorch optimizers to use for-each operators for performance gains.

Test Plan: Imported from OSS

Reviewed By: cpuhrsch

Differential Revision: D23331891

Pulled By: izdeby

fbshipit-source-id: 18c5937287e33e825b2e391e41864dd64e226f19
2020-09-07 10:29:32 -07:00
2f044d4ee5 Fix CI build (#44068)
Summary:
Some of our machines have only 1 device.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44068

Reviewed By: wanchaol

Differential Revision: D23485730

Pulled By: izdeby

fbshipit-source-id: df6bc0aba18feefc50c56a8f376103352fa2a2ea
2020-09-02 17:09:30 -07:00
297c938729 Add _foreach_add(TensorList tl1, TensorList tl2) and _foreach_add_(TensorList tl1, TensorList tl2) APIs (#42533)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42533

[First PR: Add private API to support tensor lists: _foreach_add(TensorList tensors, Scalar scalar)](https://github.com/pytorch/pytorch/pull/41554).

**Motivation**
[GitHub issue](https://github.com/pytorch/pytorch/issues/38655)
Current PyTorch optimizer implementations are not efficient in cases when we work with a lot of small feature tensors. Starting a lot of kernels slows down the whole process. We need to reduce the number of kernels that we start.
As an example, we should be looking at [NVIDIAs Apex](https://github.com/NVIDIA/apex).
In order to track progress, we will pick PyTorchs DCGAN model with Adam optimizer and once the optimizer is reimplemented with tensor lists, benchmark the model performance against original model version, Apexs version with original Adam optimizer and it’s FusedAdam optimizer.

**Current API restrictions**
- List can't be empty (will fixed in upcoming PRs).
- All tensors in the list must have the same dtype, device and size.

**Broadcasting**
At this point we don't support broadcasting.

**What is 'Fast' and 'Slow' route**
In particular cases, we cant process an op with a fast list CUDA kernel. Still, we can do with a regular for-loop where the op will be applied to each tensor individually through the dispatch mechanisms. There are a few checks that decide whether the op will be performed via a 'fast' or 'slow' path.
To go the fast route,
- All tensors must have strided layout
- All tensors must be dense and not have overlapping memory
- The resulting tensor type must be the same.

----------------
**In this PR**
- Adding a `_foreach_add(TensorList tl1, TensorList tl2)` API
- Adding a `_foreach_add_(TensorList tl1, TensorList tl2)` API

**Tests**
Tested via unit tests

**TODO**
1. Properly handle empty lists

**Plan for the next PRs**
1. APIs
- Binary Ops for list with Scalar
- Binary Ops for list with list
- Unary Ops for list
- Pointwise Ops

2. Complete tasks from TODO
3. Rewrite PyTorch optimizers to use for-each operators for performance gains.

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D23331894

Pulled By: izdeby

fbshipit-source-id: 876dd1bc82750f609b9e3ba23c8cad94d8d6041c
2020-09-02 12:18:28 -07:00
4cb8d306e6 Add _foreach_add_(TensorList tensors, Scalar scalar) API (#42531)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42531

[First PR: Add private API to support tensor lists: _foreach_add(TensorList tensors, Scalar scalar)](https://github.com/pytorch/pytorch/pull/41554).

**Motivation**
[GitHub issue](https://github.com/pytorch/pytorch/issues/38655)
Current PyTorch optimizer implementations are not efficient in cases when we work with a lot of small feature tensors. Starting a lot of kernels slows down the whole process. We need to reduce the number of kernels that we start.
As an example, we should be looking at [NVIDIAs Apex](https://github.com/NVIDIA/apex).
In order to track progress, we will pick PyTorchs DCGAN model with Adam optimizer and once the optimizer is reimplemented with tensor lists, benchmark the model performance against original model version, Apexs version with original Adam optimizer and it’s FusedAdam optimizer.

**Current API restrictions**
- List can't be empty (will fixed in upcoming PRs).
- All tensors in the list must have the same dtype, device and size.

**Broadcasting**
At this point we don't support broadcasting.

**What is 'Fast' and 'Slow' route**
In particular cases, we cant process an op with a fast list CUDA kernel. Still, we can do with a regular for-loop where the op will be applied to each tensor individually through the dispatch mechanisms. There are a few checks that decide whether the op will be performed via a 'fast' or 'slow' path.
To go the fast route,
- All tensors must have strided layout
- All tensors must be dense and not have overlapping memory
- The resulting tensor type must be the same.

---------------
**In this PR**
- Adding a `std::vector<Tensor> _foreach_add_(TensorList tensors, Scalar scalar)` API
- Resolving some additional comments from previous [PR](https://github.com/pytorch/pytorch/pull/41554).

**Tests**
Tested via unit tests

**TODO**
1. Properly handle empty lists

**Plan for the next PRs**
1. APIs
- Binary Ops for list with Scalar
- Binary Ops for list with list
- Unary Ops for list
- Pointwise Ops

2. Complete tasks from TODO
3. Rewrite PyTorch optimizers to use for-each operators for performance gains.

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D23331892

Pulled By: izdeby

fbshipit-source-id: c585b72e1e87f6f273f904f75445618915665c4c
2020-08-28 14:34:46 -07:00
e995c3d21e Add private API to support tensor lists: _foreach_add(TensorList tensors, Scalar scalar) (#41554)
Summary:
Initial PR for the Tensor List functionality.

**Motivation**
[GitHub issue](https://github.com/pytorch/pytorch/issues/38655)
Current PyTorch optimizer implementations are not efficient in cases when we work with a lot of small feature tensors. Starting a lot of kernels slows down the whole process. We need to reduce the number of kernels that we start.
As an example, we should be looking at [NVIDIAs Apex](https://github.com/NVIDIA/apex).
In order to track progress, we will pick PyTorchs DCGAN model with Adam optimizer and once the optimizer is reimplemented with tensor lists, benchmark the model performance against original model version, Apexs version with original Adam optimizer and it’s FusedAdam optimizer.

**In this PR**
- Adding `multi_tensor_apply` mechanism which will help to efficiently apply passed functor on a given list of tensors on CUDA.
- Adding a first private API - `std::vector<Tensor> _foreach_add(TensorList tensors, Scalar scalar)`

**Tests**
Tested via unit tests

**Plan for the next PRs**

1. Cover these ops with `multi_tensor_apply` support
- exponent
- division
- mul_
- add_
- addcmul_
- addcdiv_
- Sqrt

2. Rewrite PyTorch optimizers to use for-each operators in order to get performance gains.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41554

Reviewed By: cpuhrsch

Differential Revision: D22829724

Pulled By: izdeby

fbshipit-source-id: 47febdbf7845cf931958a638567b7428a24782b1
2020-08-04 15:01:09 -07:00