Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47810
`bindSymbolicShapes` wasn't checking device or dtype at all, so it wasn't correct. It also isn't being used anywhere (num_profiles is always 1 and we don't use symbolic shapes). We shouldn't have it on until we are actually using symoblic shapes.
Test Plan: Imported from OSS
Reviewed By: bertmaher
Differential Revision: D25286214
Pulled By: eellison
fbshipit-source-id: 10fb175d0c75bd0159fb63aafc3b59cc5fd6c5af
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47550
I saw over 5% time spent in RecordFunction's ctor during one
of our framework overhead benchmarks in `perf`. Inspecting assembly,
it looks like we just create a lot of RecordFunctions and the
constructor has to initialize a relatively large number of member
variables.
This diff takes advantage of the observation that RecordFunction does
nothing most of the time by moving its state onto the heap and only
allocating it if needed. It does add the requirement that profiling is
actually active to use RecordFunction accessors, which I hope won't be
a problem.
ghstack-source-id: 117498489
Test Plan: Run framework overhead benchmarks. Savings ranging from 3% (InPlace_ndim_1) to 7.5% (empty_ndim_3) wall time.
Reviewed By: ilia-cher
Differential Revision: D24812213
fbshipit-source-id: 823a1e2ca573d9a8d7c5b7bb3972987faaacd11a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47549
In preparation for moving state onto the heap.
ghstack-source-id: 117027862
Test Plan: CI
Reviewed By: ilia-cher
Differential Revision: D24812214
fbshipit-source-id: 1455c2782b66f6a59c4d45ba58e1c4c92402a323
Summary:
By default, TorchScript execution is single threaded and uses the caller's thread pool. For the use case of distributed inference, we hope there is a way to customize the behavior where the interpreter in torch script can be executed in other places. This diff allows an explicit taskLauncher for torchscript interpreter.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46865
Test Plan:
unit test is passed.
fbshipit-source-id: 1d7b003926c0d1f8facc53206efb960cff8897ac
Fixes #{issue number}
Reviewed By: houseroad
Differential Revision: D24616102
Pulled By: garroud
fbshipit-source-id: 79202b62f92d0b0baf72e4bf7aa3f05e0da91d59
Summary:
This diff restores previous behavior of silently allow overflowing when inserting instructions. The behavior was changed recently in https://github.com/pytorch/pytorch/issues/45382. But it started to break some existing use cases that haver overflow problems.
Restoring original behavior but throw a warning to to unblock existing use cases where overflowing happens.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46369
Reviewed By: kwanmacher, wanchaol, fbhuba
Differential Revision: D24324345
Pulled By: gmagogsfm
fbshipit-source-id: 1c0fac421d4de38f070e21059bbdc1b788575bdf
Summary:
* Add a pass at end of runCleanupPasses to annotate `aten::warn` so that each has its unique id
* Enhanced interpreter so that it tracks which `aten::warn` has been executed before and skip them
* Improved insertInstruction so that it correctly checks for overflow
Fixes https://github.com/pytorch/pytorch/issues/45108
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45382
Reviewed By: mrshenli
Differential Revision: D24060677
Pulled By: gmagogsfm
fbshipit-source-id: 9221bc55b9ce36b374bdf614da3fe47496b481c1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43684
This PR attempts to address #42560 by capturing the appropriate
exception_ptr in the autograd engine and passing it over to the Future.
As part of this change, there is a significant change the Future API where we
now only accept an exception_ptr as part of setError.
For the example in #42560, the exception trace would now look like:
```
> Traceback (most recent call last):
> File "test_autograd.py", line 6914, in test_preserve_backtrace
> Foo.apply(t).sum().backward()
> File "torch/tensor.py", line 214, in backward
> torch.autograd.backward(self, gradient, retain_graph, create_graph)
> File "torch/autograd/__init__.py", line 127, in backward
> allow_unreachable=True) # allow_unreachable flag
> File "torch/autograd/function.py", line 87, in apply
> return self._forward_cls.backward(self, *args)
> File "test_autograd.py", line 6910, in backward
> raise ValueError("something")
> ValueError: something
```
ghstack-source-id: 111109637
Test Plan: waitforbuildbot
Reviewed By: albanD
Differential Revision: D23365408
fbshipit-source-id: 1470c4776ec8053ea92a6ee1663460a3bae6edc5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43635
Intern the symbol, no functional changes. Aliasing need to be looked at but this should be done in a separate PR; this PR is just changing the symbol.
Test Plan: Imported from OSS
Reviewed By: bertmaher
Differential Revision: D23358806
Pulled By: eellison
fbshipit-source-id: f18bcd142a0daf514136f019ae607e4c3f45d9f8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43633
In the backward graph, _grad_sum_to_size is inserted whenever a possibly broadcasting op is called:"
`"aten::_grad_sum_to_size(Tensor(a) self, int[]? size) -> Tensor(a)"`
If a broadcast occurred, a sum is called, otherwise the second input is None and it is a no-op. Most of the time, it's a no-op (in the fast RNNs benchmark > 90% of the time).
We can get rid of this op by profiling the optionality of the second input. I added `prim::profile_optional` to do this, which counts the number of times it saw a None value and the number of times it saw a value present. When specializing the backward graph, we insert checks for values we profiled as None, and in the optimized block can remove the grad_sum_to_size calls that use those values.
In the future we may revisit this when NNC supports reductions and we want to replace grad_sum_to_size with sums as well, but I think this is worth landing now.
Test Plan: Imported from OSS
Reviewed By: bwasti, ZolotukhinM
Differential Revision: D23358809
Pulled By: eellison
fbshipit-source-id: a30a148ca581370789d57ba082d23cbf7ef2cd4d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37034
c10 takes a Stack* in boxed functions while JIT took Stack&.
c10 doesn't return anything while JIT returns an int which is always zero.
This changes JIT to follow the c10 behavior.
ghstack-source-id: 106834069
Test Plan: unit tests
Differential Revision: D20567950
fbshipit-source-id: 1a7aea291023afc52ae706957e9a5ca576fbb53b
Summary:
**Summary**
This commit adds support for with statements to PyTorch JIT. Each
of the with items in a with statement is represented in the JIT IR
as a pair of `prim::Enter` and `prim::Exit` nodes that call the
`__enter__` and `__exit__` methods defined on the context manager objects
returned by the expressions in the with item.
**Testing**
This commit adds unit tests for with statements with named with items,
nameless with items, and with statements that encounter exceptions.
```
$ python test/test_jit.py TestWith.test_with_as
Fail to import hypothesis in common_utils, tests are not derandomized
.
----------------------------------------------------------------------
Ran 1 test in 0.430s
OK
```
```
$ python test/test_jit.py TestWith.test_with_no_as
Fail to import hypothesis in common_utils, tests are not derandomized
.
----------------------------------------------------------------------
Ran 1 test in 0.264s
OK
```
```
$ python test/test_jit.py TestWith.test_with_exceptions
Fail to import hypothesis in common_utils, tests are not derandomized
Couldn't download test skip set, leaving all tests enabled...
.
----------------------------------------------------------------------
Ran 1 test in 1.053s
OK
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34705
Differential Revision: D22095945
Pulled By: SplitInfinity
fbshipit-source-id: f661565a834786725259b8ea014b4d7532f9419d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38453
Two fixes:
- RecordFunction in JIT interpreter should exist during the execution
of the frame, and not just when we enter the frame
- When creating a JIT continuation in wait instruction, we'd want to
preserve the original thread local context, right now when we resume
execution in continuation we preserve the thread local state of the
thread that set future value (i.e. executed a forked task)
Test Plan: unittest, CI
Reviewed By: ngimel
Differential Revision: D21565959
Pulled By: ilia-cher
fbshipit-source-id: 206b98e3bfb0052fc8e4031da778e372cc71afc1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37548
Moving RecordFunction from torch::autograd::profiler into at namespace
Test Plan:
CI
Imported from OSS
Differential Revision: D21315852
fbshipit-source-id: 4a4dbabf116c162f9aef0da8606590ec3f3847aa
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37472
Our convention is for `findX` to return an optional version and `getX`
to assert that the X is there. Fix up `getMethod` to be consistent with
this convention.
Test Plan: Imported from OSS
Differential Revision: D21297543
Pulled By: suo
fbshipit-source-id: b40f56231cc8183e61bbb01fe5c0c113bcb6464d
Summary:
Today in PyTorch, warnings triggered in C++ are printed to Python users like this:
`../aten/src/ATen/native/BinaryOps.cpp:81: UserWarning: Integer division of tensors using div or / is deprecated, and in a future release div will perform true division as in Python 3. Use true_divide or floor_divide (// in Python) instead.`
This may be unhelpful to Python users, who have complained it's difficult to relate these messages back to their programs. After this PR, warnings that go through the PyWarningHandler and allow it to add context print like this:
```
test/test_torch.py:16463: UserWarning: Integer division of tensors using div or / is deprecated, and in a future release div will perform true division as in Python 3. Use true_divide or floor_divide (// in Python) instead. (Triggered internally at ../aten/src/ATen/native/BinaryOps.cpp:81.)
cpu_result = getattr(cpu_tensor, op_str)(*cpu_args)
```
This relates the warning back to the user's program. The information about the cpp file and line number is preserved in the body of the warning message.
Some warnings, like those generated in the JIT, already account for a user's Python context, and so they specify that they should be printed verbatim and are unaffected by this change. Warnings originating in Python and warnings that go through c10's warning handler, which prints to cerr, are also unaffected.
A test is added to test_torch.py for this behavior. The test relies on uint8 indexing being deprecated and its warning originating from its current header file, which is an unfortunate dependency. We could implement a `torch.warn` function, instead.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36052
Differential Revision: D20887740
Pulled By: mruberry
fbshipit-source-id: d3515c6658a387acb7fccaf83f23dbb452f02847
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36179
This ensures normal optimization passes run for forked functions.
Test Plan: Imported from OSS
Differential Revision: D20907253
Pulled By: zdevito
fbshipit-source-id: 72cfa9f82643214b1ef3de24697d163a9a24b29c
Summary:
This PR introduces frame ids that will allow us to associate profiling information with its corresponding run.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33788
Differential Revision: D20164897
Pulled By: Krovatkin
fbshipit-source-id: 8172ff9f4d188b339e2ff98a80bbe4a2b306a8aa
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35849
This change harmonizes some aspects of the api.
- torch::utils::Future callback should have no args, like ivalue::future.
Many of the lines of this change are related to fixing that up downstream.
No args makes the api simpler to use, particularly since many/most of the
downstream use cases ignore the passed-in args. It's simple enough to
appropriately capture the future in the lambda if necessary.
- Add error/hasError methods to ivalue::Future.
- Use c10::optional underneath for error to ivalue::Future.
- Change markCompleted(error) to setError(error) to ivalue::Future.
- Add setValue(FutureError) version to torch::utils::Future
ghstack-source-id: 101684435
Test Plan: buck test mode/dev-nosan caffe2/test/...
Differential Revision: D20803251
fbshipit-source-id: e3d925287bd9a80d649843eef5f270163f448269
Summary:
**Summary:** This PR contains the infrastructure of a new CUDA fuser. This CUDA fuser is based on many of the same principles of TensorExpressions and Halide, however the implementation is ground up. The fusion pass itself is similar to the default CUDA fuser, however, it has undergone some refactoring and is using the new code generation infrastructure. For those who are interested in how the code generation in this PR works, I would recommend reviewing _test/cpp/jit/test_gpu_fusion.cpp_ as well as the long comment section at the beginning of _torch/csrc/jit/codegen/cuda/transform_replay.h_ One of the largest differences between our approach and that of TVM/Halide, is the concept of "TensorView". TensorView from a high level should be thought of similarly to how we think of working with Tensors in PyTorch. It's an N-D object which can undergo transformations that change its dimensionality. Dimensionality changes are done through the operations split/merge/reorder/computeAt. These transformations are similar to split/fuse/reorder/compute_at of TVM, they modify how a tensor is iterated over to generate GPU code. Interestingly, in our scheme these transformations are applied to tensors and only impact how that tensor is generated.
**Warning:** This PR is purposefully not feature complete with the current fuser. We wanted to separate out the infrastructure from the fusion capabilities. Once in, smaller incremental PRs will be submitted to expand capabilities of the fuser.
**Short term goals:**
Parity with current CUDA fuser (including performance):
- Dynamic shapes (no recompilation)
- Implicit handling of braodcast (broadcasted tensors are treated as tensors of the braodcasted size in the generated code)
- Dropout
**Mid-term goals:**
- Transposes fused with pointwise operations where transpose involves only 2 axes (across the fused operation).
- 1-D reductions fused with pointwise operations
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34785
Reviewed By: ZolotukhinM
Differential Revision: D20650977
Pulled By: soumith
fbshipit-source-id: ee39c95a880e1b9822e874ed4cc180971572bf63
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35523
In this PR we extend ThreadLocalState to cover dispatch keys and
ThreadLocalDebugInfo and move it from JIT interpreter down to
thread management (at::launch) and autograd (backward threads) code
Test Plan: unit tests (CI)
Reviewed By: dzhulgakov
Differential Revision: D20615714
fbshipit-source-id: 16a9fc96a25cb6c2629230b1187fbf78786ac565
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34710
Extending RecordFunction API to support new recording scopes (such as TorchScript functions), as well as giving more flexibility to set sampling rate.
Test Plan: unit test (test_misc.cpp/testRecordFunction)
Reviewed By: gdankel, dzhulgakov
Differential Revision: D20158523
fbshipit-source-id: a9e0819d21cc06f4952d92d43246587c36137582
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35115
This commit runs the newly added tools/clang_format.py on the JIT
codebase and includes all of the formatting changes thus produced.
Testing:
Ran the script, CI.
Test Plan: Imported from OSS
Reviewed By: eellison
Differential Revision: D20568523
Pulled By: SplitInfinity
fbshipit-source-id: e09bdb982ccf090eecfb7c7b461b8d0681eef82b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34360
The distributed autograd context sets up a thread local context id
which is used to perform appropriate book keeping and autograd recording of RPC
functions in the forward pass.
However, if we use torch.jit._fork within the distributed autograd context, the
code executed within torch.jit._fork will lose this context since it is run in
a separate JIT thread and the thread local is not set in that thread.
To fix this problem, we pass in the distributed autograd context to
torch.jit._fork similar to what we did in
https://github.com/pytorch/pytorch/pull/16101.
ghstack-source-id: 100445465
Test Plan: waitforbuildbot
Differential Revision: D20301352
fbshipit-source-id: aa3fffe69c2b40722c66213351a4e0d77484a621
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34623
The bandaid of "AT_WARN" keeps introducing new warnings. Let's get rid
of it entirely.
Close#34502
Test Plan: Imported from OSS
Differential Revision: D20420112
Pulled By: albanD
fbshipit-source-id: 7160c113cb4deb2d2f50a375356f423fe5e86f50
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33921
**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.intern.facebook.com/intern/diff/D20153092/)!
Test Plan: Imported from OSS
Differential Revision: D20177227
Pulled By: jamesr66a
fbshipit-source-id: 87f3e484c4f873d60f76f50f6789c1b4a73bdfde
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33834
This changes how we report Tracebacks to make them more clear when
there are both serialized and non-serialized ranges. It now looks like:
```
Traceback (most recent call last):
File "foo.py", line 25, in <module>
s2(a, b)
File "/scratch/zdevito/pytorch/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
File "code/__torch__.py", line 7, in forward
x: Tensor,
y: Tensor) -> Tensor:
return (self).bar(x, y, )
~~~~~~~~~ <--- HERE
def bar(self: __torch__.Moo,
x: Tensor,
File "code/__torch__.py", line 11, in bar
x: Tensor,
y: Tensor) -> Tensor:
_0 = (self).baz(x, y, )
~~~~~~~~~ <--- HERE
_1 = torch.ones([3], dtype=None, layout=None, device=None, pin_memory=None)
return torch.add(_0, _1, alpha=1)
File "code/__torch__.py", line 17, in baz
x: Tensor,
y: Tensor) -> Tensor:
return torch.add(x, y, alpha=1)
~~~~~~~~~ <--- HERE
Traceback of TorchScript, original code (most recent call last):
File "foo.py", line 11, in forward
def forward(self, x, y):
return self.bar(x, y)
~~~~~~~~ <--- HERE
File "foo.py", line 9, in bar
def bar(self, x, y):
return self.baz(x, y) + torch.ones(3)
~~~~~~~~ <--- HERE
File "foo.py", line 7, in baz
def baz(self, x, y):
return x + y
~~~~~ <--- HERE
RuntimeError: The size of tensor a (4) must match the size of tensor b (5) at non-singleton dimension 1
```
It follows Python convension of having the most important information last
and reading from the bottom up.
Changes:
* Moved the error message to the end, to copy Python
* Report original traceback separate from serialized traceback
* Make sure root functions have names in the interpreter trace.
Test Plan: Imported from OSS
Differential Revision: D20126136
Pulled By: zdevito
fbshipit-source-id: fd01f9985e5d74e04c4d064c02e8bc320f4fac13