Commit Graph

57810 Commits

Author SHA1 Message Date
dba9487324 Add helpful pretty pretting summaries to torch for lldb debugging (#97101)
# Summary
Add support for pretty printing of tensors when using lldb similiar to what is currently available for gdb

<img width="772" alt="Screenshot 2023-03-18 at 6 20 34 PM" src="https://user-images.githubusercontent.com/32754868/226148687-b4e6cfe1-8be1-4657-9ebc-d134f697dd37.png">

<img width="254" alt="Screenshot 2023-03-18 at 6 20 43 PM" src="https://user-images.githubusercontent.com/32754868/226148690-caca6f76-d873-419e-b5e4-6bb403b3d179.png">

I changed it so to override the variable formatting instead of having to call a seperate command you can just do `print <tensor>`

I also add one for sizes
<img width="309" alt="Screenshot 2023-03-19 at 1 05 49 PM" src="https://user-images.githubusercontent.com/32754868/226206458-e3f0111b-6a97-4d75-8125-48455aa2cf43.png">

Last one:
<img width="815" alt="Screenshot 2023-03-19 at 1 39 23 PM" src="https://user-images.githubusercontent.com/32754868/226207687-20bd014f-9e0e-4c01-b2c8-190b7365aa70.png">

If you use the codelldb extension be sure to add:
    `"lldb.launch.initCommands": ["command source ${env:HOME}/.lldbinit"]`
    To your setttings .json

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97101
Approved by: https://github.com/ngimel
2023-03-20 01:27:44 +00:00
5471621497 [BE] Remove unnecessary dict comprehensions (#97116)
Removes unnecessary dict comprehensions that optimize creation of dicts from iterables

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97116
Approved by: https://github.com/kit1980
2023-03-20 00:56:57 +00:00
be0b415a5a [ONNX] Set shape/type into torchscript (#96349)
Fixes https://github.com/pytorch/pytorch/pull/95676#issuecomment-1460588229

PS: It doesn't seem the exported ONNX_proto having type now. I wonder if there was a ONNX pass doing this for us (converting torch dtype to onnx dtype during exporting.)

Type promotion issue would be raised with an error if we want to set type
```python
onnxscript_value.dtype = expected_value.dtype
```
onnx.onnx_cpp2py_export.shape_inference.InferenceError: [ShapeInferenceError] Shape inference error(s): (op_type:aten_add, node name: aten_add_1): [ShapeInferenceError] (op_type:Add, node name: n3): B has inconsistent type tensor(int64)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96349
Approved by: https://github.com/justinchuby, https://github.com/wschin
2023-03-19 21:58:10 +00:00
722c4e59a4 Replace source check with assert (#95640)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95640
Approved by: https://github.com/ezyang
2023-03-19 21:51:59 +00:00
c8030b5406 Revert "Update mkl_verbose return value check due to API change in mkl (#96283)"
This reverts commit c1214ce5c26fce541a920bdf9917c9ca9f63ecb0.

Reverted https://github.com/pytorch/pytorch/pull/96283 on behalf of https://github.com/kit1980 due to Looks like this broke inductor tests on macos-12-py3-arm64 https://github.com/pytorch/pytorch/actions/runs/4458194071/jobs/7830194137
2023-03-19 21:48:01 +00:00
e74c5e5637 rexnet_100 is disabled for static, does not need dynamic listing (#97100)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97100
Approved by: https://github.com/Skylion007
2023-03-19 20:57:49 +00:00
5d33f9cddb Revert "Fix standalone compile for op with multiple outputs (#96936)"
This reverts commit 37cde56658e20afae6d94b70d53e4131043e09e8.

Reverted https://github.com/pytorch/pytorch/pull/96936 on behalf of https://github.com/kit1980 due to Broke inductor tests on macos-12-py3-arm64 https://github.com/pytorch/pytorch/actions/runs/4458548491/jobs/7830566793
2023-03-19 20:32:13 +00:00
90537a779c Update FlashAttention to work with sm90 Gpus (#97051)
# Summary
FlashAttention was confirmed to work on h100 and sm90 hardware so we update the checks to account for this
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97051
Approved by: https://github.com/cpuhrsch
2023-03-19 19:33:57 +00:00
37cde56658 Fix standalone compile for op with multiple outputs (#96936)
Op-benchmark directly uses fx.Graph to create nodes without dynamo and then compiles the graph with inductor. Currently, operators with multiple outputs, e.g. native_layer_norm, would fail to run caused by standalone torch._inductor.compile() API #95594. Actually, the graph's result is a node with several outputs instead of a tuple with several nodes. However, the standalone API forces a non-tuple result be a tuple, i.e., a tuple with one node-type element with several outputs. This PR considers a return node with several outputs as a tuple to avoid errors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96936
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-03-19 02:44:03 +00:00
c1214ce5c2 Update mkl_verbose return value check due to API change in mkl (#96283)
As title.
Originally `mkl_verbose()` function returned `0` and `1`, indicating failure and success respectively. However, the version that PyTorch uses now changed the output of `mkl_verbose()` to reflect its input level. Thus, the check logic needs to be changed to compare output of the `mkl_verbose()` function with -1.
https://www.intel.com/content/www/us/en/develop/documentation/onemkl-developer-reference-c/top/support-functions/miscellaneous/mkl-verbose.html

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96283
Approved by: https://github.com/jgong5, https://github.com/malfet
2023-03-18 20:30:07 +00:00
5ee5a164ff [aot] disable inference view tracking (#96478)
For inference, we should disable unnecessary view tracking to save time. Most of operators get an improvement of performance (inductor v.s. eager). This PR fix the general regression of operators for inductor.

Example of operators' speedup in torchbench (inductor v.s. eager):
<html xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:x="urn:schemas-microsoft-com:office:excel"
xmlns="http://www.w3.org/TR/REC-html40">

<head>

<meta name=ProgId content=Excel.Sheet>
<meta name=Generator content="Microsoft Excel 15">
<link id=Main-File rel=Main-File
href="file:///C:/Users/xuanliao/AppData/Local/Temp/msohtmlclip1/01/clip.htm">
<link rel=File-List
href="file:///C:/Users/xuanliao/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml">
</head>

<body link="#0563C1" vlink="#954F72">

  | current | new
-- | -- | --
aten.hardsigmoid.default | [0.6426090814905988, 0.6791992931354925, 0.7046010955095103] | [0.7921782106271767, 0.8919522525991529, 0.9128089963571694]
aten.tanh.default | [0.6135534976747065, 0.7588851221588919, 0.898274076411234] | [0.857534066531159, 1.0524121834821605, 1.2535141671420165]
aten.floor.default | [0.6115868728087821, 0.6115868728087821, 0.6115868728087821] | [0.9472870784346195, 0.9472870784346195, 0.9472870784346195]
aten.exp.default | [0.7784016216625718, 0.9279358274876591, 1.1201178548406794] | [0.5777145055206203, 0.8610140436473923, 1.1850714193498957]
aten.mul_.Tensor | [0.14381872531802153, 0.14638969818507447,   0.14947766446663138] | [0.37695307573466363, 0.3832122689450142, 0.38963470437456904]
aten.hardtanh_.default | [0.49502896822398157, 0.5897512505705527, 0.8052969399847189] | [0.4915338157706071, 0.6098169585316151, 0.8587605051115021]
aten.relu_.default | [0.47776870021339685, 0.54452322796367, 0.6516167164223963] | [0.4764791289773786, 0.5608095328163419, 0.6753350976452626]

</body>

</html>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96478
Approved by: https://github.com/EikanWang, https://github.com/jansel, https://github.com/jgong5, https://github.com/bdhirsh
2023-03-18 13:58:24 +00:00
4805441b4a [dtensor] remove unused tests and fix ci (#97064)
fix ci
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97064
Approved by: https://github.com/huydhn
2023-03-18 06:01:37 +00:00
a5923ab3f3 Revert "[inductor] do benchmark in sub processes for max autotuning (#96410)" (#97075)
This reverts commit 34256bc73080d7898138c821273b9f31fab777f8.

@kit1980: I'm not sure how best to revert a co-dev PR like https://github.com/pytorch/pytorch/pull/96410#issuecomment-1474704337.  IIRC, Ivan and Eli did a revert PR like this before, so I create one here just in case we need to use it.  If that's the case, please feel free to get this merge to fix trunk.  Otherwise, this can be closed.

@shunting314 If you can do a forward fix faster than this, please help do so.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97075
Approved by: https://github.com/kit1980
2023-03-18 05:07:18 +00:00
a1c46e5f8f component-level configurable logging for dynamo, inductor, aot (#94858)
Summary:

Adds NNC-like logging that is configured through an env var `TORCH_COMPILE_LOGS`
Examples:
`TORCH_LOGS="dynamo,guards" python script.py` - prints dynamo logs at level INFO with guards of all functions that are compiled

`TORCH_LOGS="+dynamo,guards,graph" python script.py` - prints dynamo logs at level DEBUG with guards and graphs (in tabular) format of all graphs that are compiled

[More examples with full output](https://gist.github.com/mlazos/b17f474457308ce15e88c91721ac1cce)

Implementation:
The implementation parses the log settings from the environment, finds any components (aot, dynamo, inductor) or other loggable objects (guards, graph, etc.) and generates a log_state object. This object contains all of the enabled artifacts, and a qualified log name -> level mapping. _init_logs then adds handlers to the highest level logs (the registered logs), and sets any artifact loggers to level DEBUG if the artifact is enabled.

Note: set_logs is an alternative for manipulating the log_state, but if the environment contains TORCH_LOGS, the environment settings will be prioritized.

Adding a new log:
To add a new log, a dev should add their log name to torch._logging._registrations (there are examples there already).

Adding a new artifact:
To add a new artifact, a dev should add their artifact name to torch._logging._registrations as well.
Additionally, wherever the artifact is logged, `torch._logging.getArtifactLogger(__name__, <artifact_name>)` should be used instead of the standard logging implementation.

[design doc](https://docs.google.com/document/d/1ZRfTWKa8eaPq1AxaiHrq4ASTPouzzlPiuquSBEJYwS8/edit#)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94858
Approved by: https://github.com/ezyang
2023-03-18 04:17:31 +00:00
086ce765a5 Add new parameter materialize_grads to torch.autograd.grad() (#97015)
Fixes #44189
Adds a new parameter, zero_grad_unused, to the torch.autograd.grad() function. This parameter allows for the gradient to be set to 0 instead of None when a variable is unused, which can be helpful for higher-order partial differentials.

Here is an example of using this new parameter to solve d^3y/dx^3 given y = a * x:

```python
x = torch.tensor(0.5, dtype=torch.float32, requires_grad=True)
a = torch.tensor(1, dtype=torch.float32, requires_grad=True)
y = x * a
dydx = torch.autograd.grad(y, x, create_graph=True, allow_unused=True)
d2ydx2 = torch.autograd.grad(dydx, x, allow_unused=True, zero_grad_unused=True)
try:
    d3ydx3 = torch.autograd.grad(d2ydx2, x, allow_unused=True, zero_grad_unused=True)
except RuntimeError as e:
    assert False, "Should not raise error"
```

With `zero_grad_unused`, d2ydx2 could be 0 instead of None, enabling d3ydx3 to be calculated as defined in math without throwing an error.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97015
Approved by: https://github.com/soulitzer
2023-03-18 03:11:12 +00:00
34256bc730 [inductor] do benchmark in sub processes for max autotuning (#96410)
This PR implements the support to benchmark max-autotune choices in subprocesses. This way crash like https://github.com/openai/triton/issues/1298 will only abort the autotuning child process but the parent process can continue.

There are a few things to note:
- cuda runtime does not work with fork. So we have to use spawn to create child processes. Check the best practice from pytorch multithreading module: https://pytorch.org/docs/stable/notes/multiprocessing.html
- to run a job in a child process, the multiprocessing module needs to pickle both the target function and arguments and pass them to child process. This is the major complexity of this prototype since there are quite a lot of corner cases making pickle fail.

Here I list the pickle related issues I encountered:
- pickle a StorageBox cause infinite recursion. Error: https://gist.github.com/171e5ab404b7855dee2dfa1d9f093442 . Work around by pickle the inner buffer.
- IRNode store fx.Node's in its origin fields. However, we can not pickle a fx.Node. It fails when with the following error when picking the fx.Node.graph: https://gist.github.com/9c289e895d7091d7ec787c67bc3c0d70. Work around by skip origins when pickling a IRNode.
- jinja Template in TritonTemplateKernel can not be pickled: `TypeError: Template.__new__() missing 1 required positional argument: 'source' `. Workaround by pickle the source rather than jinjia Template. During unpickling, rebuild the jinja template.
- due to how select_algorithm.template_kernels is populated, in child process, it's empty. Work around by passing select_algorithm.template_kernels from parent process to child process directly.
  - There is some change in TritonTemplate.generate to make a TritonTemplateKernel pickle'able. A TritonTemplate is refered to in the closure for a TritonTemplateKernel object.
- We can not pass choice to child process directly because of pickle failure for lambda/local function being used. However cloudpickle can handle lambda. Work around by passing the cloudpickle'd choice object to child process. The child project need to unpickle it explictly.

Test:
```
python test/inductor/test_max_autotune.py -k test_max_autotune_mm_plus_mm
```
This is basically the repro I get from Bert Maher.

Benchmark in sub process is about 4x slower than benchmark in the same process. Without doing any profiling, I feel the time may be cost by starting a new process and doing initialization. Some ~thread~ process pool may help.

```
AUTOTUNE ref_mm_plus_mm(2048x64, 64x1536, 2048x64, 64x1536)
  triton_mm_plus_mm_0 0.0276s 100.0%
  triton_mm_plus_mm_6 0.0287s 96.4%
  triton_mm_plus_mm_5 0.0317s 87.1%
  triton_mm_plus_mm_1 0.0328s 84.4%
  ref_mm_plus_mm 0.0379s 73.0%
  triton_mm_plus_mm_7 0.0379s 73.0%
  triton_mm_plus_mm_2 0.0399s 69.2%
  triton_mm_plus_mm_3 0.0410s 67.5%
  triton_mm_plus_mm_4 0.0410s 67.5%
AUTOTUNE takes 12.001659393310547 seconds

AUTOTUNE ref_mm_plus_mm(2048x64, 64x1536, 2048x64, 64x1536)
  triton_mm_plus_mm_0 0.0276s 100.0%
  triton_mm_plus_mm_6 0.0287s 96.4%
  triton_mm_plus_mm_1 0.0317s 87.1%
  triton_mm_plus_mm_5 0.0317s 87.1%
  ref_mm_plus_mm 0.0379s 73.0%
  triton_mm_plus_mm_7 0.0389s 71.1%
  triton_mm_plus_mm_2 0.0399s 69.2%
  triton_mm_plus_mm_3 0.0410s 67.5%
  triton_mm_plus_mm_4 0.0410s 67.5%
AUTOTUNE takes 51.39659810066223 seconds
```

The feature is disabled by default and can be enabled by setting the following config or envvar:
```
autotune_in_subproc = os.environ.get("TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC") == "1"
```

Differential Revision: [D43996048](https://our.internmc.facebook.com/intern/diff/D43996048)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96410
Approved by: https://github.com/jansel
2023-03-18 02:43:28 +00:00
b132220309 Update MHA doc string (#97046)
Summary: Update MHA doc string

Test Plan: sandcastle & github

Differential Revision: D44179519

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97046
Approved by: https://github.com/voznesenskym
2023-03-18 02:14:59 +00:00
915cbf8208 [Inductor] Eliminate redundant to_dtype node (#96650)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96650
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-03-18 01:51:38 +00:00
679dec847e Use is_available instead of device_count to check for CUDA availability (#97043)
There are some tests that incorrectly uses the number of GPU devices `torch.cuda.device_count() > 0` to check for CUDA availability instead of the default `torch.cuda.is_available()` call.  This makes these tests more brittle when encountering infra flakiness on G5 runner using A10G, for example [test_pytorch_np](https://hud.pytorch.org/failure/FAILED%20test_tensorboard.py%3A%3ATestTensorBoardPyTorchNumpy%3A%3Atest_pytorch_np%20-%20RuntimeError%3A%20No%20CUDA%20GPUs%20are%20available).

The underlying problem is that GPU devices could crash on these runner.  While the root cause for that is unclear and we will try to upgrade to a new NVIDIA driver https://github.com/pytorch/pytorch/pull/96904 to see if it helps, we can also make these tests more resilient by using the correct check to skip tests correctly when GPU crashes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97043
Approved by: https://github.com/clee2000
2023-03-18 00:39:42 +00:00
c62fc81cc5 Increase the timeout value for linter calculate-docker-image (#96993)
I should have known that this step rebuilds the linter Docker image if it doesn't exists.  When it does so, it takes close to 15 minutes to finish, i.e. https://github.com/pytorch/pytorch/actions/runs/4443046530/attempts/1, instead of the regular 2-minute run, i.e. https://github.com/pytorch/pytorch/actions/runs/4442455480/jobs/7798700609.

This x2 the timeout value of this step to 30 minutes to avoid getting timeout flakily.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96993
Approved by: https://github.com/clee2000
2023-03-18 00:06:39 +00:00
b390e7037e [docs] passing LogSoftmax into NLLLoss (#97001)
Fixes https://github.com/pytorch/pytorch/issues/96795

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97001
Approved by: https://github.com/soulitzer
2023-03-17 23:22:13 +00:00
410210b351 Remove obsolete "merge -g" flag from update_commit_hashes.py (#97033)
The flag is deprecated and is being removed in https://github.com/pytorch/test-infra/pull/3882

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97033
Approved by: https://github.com/huydhn
2023-03-17 22:51:58 +00:00
db2c1ea8c8 Re-enable test_ops_jit on Windows (#96859) (#96931)
Fixes https://github.com/pytorch/pytorch/issues/96858
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96931
Approved by: https://github.com/kit1980
2023-03-17 22:42:22 +00:00
a4c706bcbc [dynamo][dashboard] fix triton clone step in dashboard (#96623)
previously this would clone triton, and then try to checkout without being in the git repo directory. This wasn't usually a problem because the environment already had a triton repo downloaded; but I ran into this while trying to construct a new environment.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96623
Approved by: https://github.com/anijain2305
2023-03-17 22:36:26 +00:00
4a90aca60d Make keep-going work for more than linux (#96974)
cc. asked by @zou3519

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96974
Approved by: https://github.com/huydhn
2023-03-17 22:08:37 +00:00
b59a60ddff Fix CPU bitwise shifts for out-of-limit shift values (#96659)
Negative shift values and positive shift values greater than the bit size of the dtype (limit `0..bits`) now yield expected results which are consistent with numpy.

Left shift with an out-of-limit shift value result in a value of `0`. Right shift with an out-of-limit shift value results in a value of `-1` for negative inputs and `0` for non-negative inputs (sign preserving).

Fixes https://github.com/pytorch/pytorch/issues/70904

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96659
Approved by: https://github.com/ngimel, https://github.com/albanD, https://github.com/zou3519, https://github.com/jgong5, https://github.com/malfet
2023-03-17 21:35:34 +00:00
dd9ade6377 Remove unnecessary items() call in zero_grad (#97040)
Micro-optimization to zero_grad() which is performance critical
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97040
Approved by: https://github.com/ezyang, https://github.com/albanD
2023-03-17 21:34:14 +00:00
98a5cf090d [SDPA] Remove the chunk_grad from mem-eff attention (#96880)
# Summary

There exists an optimization within the scaled_dot_product_efficieint bacwkard attention path to, under the right conditions, output grad_q, grad_k, grad_v all as aliases of the same storage. This was done to optimize for the hot path where mha does packed linear_projection -> chunk -> (view stuff) -> sdpa. The thought was that chunk-> would be able to "trivially" cat inputs to chunk.backward(). However upon closer inspection chunk.backward will call ` cat` irregardless of the inputs so this is not being utilized.

I validated this by profiling on main and then this branch and the traces produced the same both with `split.backward()` calling into cat.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96880
Approved by: https://github.com/cpuhrsch
2023-03-17 21:28:25 +00:00
d4b8ed2b11 Fail fast when dynamo attempts to add unspecialized int/float as additional graph inputs (#96786)
Summary:
Verified the changes to catch unspecialized int/floats being added as additional graph in D44037548 prior to RP(https://github.com/pytorch/pytorch/pull/95621).

However, with #95621 the issue to be solved originally is no longer valid because int & float in `forward` will always be specialized in export. This RP is to add the assertion anyway *(though not be hit unless there is a regression)* to immediately catch the attempt to add unspecialized int/float to additional graphargs

Test Plan:
Example of the error message would look like:
```
Dynamo attempts to add additional input: value=9.999999747378752e-06, source=NNModuleSource(inner=AttrSource(base=NNModuleSource(inner=AttrSource(base=LocalInputSource(local_name='self', pos=0), member='torch_module')), member='eps'))
```
Passed all export tests
```
Buck UI: https://www.internalfb.com/buck2/fea72653-5549-47e7-a9bf-740eb86a8e26
Test UI: https://www.internalfb.com/intern/testinfra/testrun/8725724422167257
RE: reSessionID-7b3470b1-c293-4c4a-9671-dd0b7a2839b8  Up: 6.0 KiB  Down: 0 B
Jobs completed: 101. Time elapsed: 115.7s.
Tests finished: Pass 98. Fail 0. Fatal 0. Skip 0. 0 builds failed
```

Differential Revision: D44075910

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96786
Approved by: https://github.com/tugsbayasgalan, https://github.com/ezyang
2023-03-17 21:15:18 +00:00
cea13ad9fa Improve size mismatch error messaging referencing mat/vet sizes (#96863)
Fixes #94841

This fixes the error messages in the following files, the same as those referenced in the linked issue. I was not able to find any additional examples, but am happy to add commits for any that I may have missed!

```
aten/src/ATen/native/Blas.cpp:     "size mismatch, got ", self.size(0), ", ", mat.size(0), "x", mat.size(1), ",", vec.size(0));
torch/_decomp/decompositions.py:        lambda: f"size mismatch, got {self.size(0)}x{self.size(1)},{vec.size(0)}",
```

Example output for `Blas.cpp` before:
```
size mismatch, got 3, 3x4,1
```

The new error messages have the following format:

```
aten/src/ATen/native/Blas.cpp:     "size mismatch, got bias (", self.size(0), "), matrix (", mat.size(0), "x", mat.size(1), "), vector (", vec.size(0), ")");
torch/_decomp/decompositions.py:        lambda: f"size mismatch, got matrix ({self.size(0)}x{self.size(1)}), vector ({vec.size(0)})",
```

Example output for `Blas.cpp` after:
```
size mismatch, got bias (3), matrix (3x4), vector (1)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96863
Approved by: https://github.com/albanD
2023-03-17 21:07:48 +00:00
985fc66b30 Bind increment_version to python (#96852)
Should be convenient when writing python-only kernels (with triton) that don't have access to the C++ APIs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96852
Approved by: https://github.com/soulitzer
2023-03-17 20:36:33 +00:00
1983b31711 Fixed print tensor.type() issue. (#96381)
Fixes #95954
Updating the cpp printing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96381
Approved by: https://github.com/albanD
2023-03-17 20:26:43 +00:00
57bb5b159d [static-runtime] one more attempt to improve crash log readability (#96903)
Summary:
* add human readable type and ivalue printout
* fix internal linter warnings

Test Plan:
error message now looks like e.g.
```
E0315 16:27:32.409082 422313 ExceptionTracer.cpp:222] exception stack complete
terminate called after throwing an instance of 'c10::Error'
  what():  List[int] is not a subtype of List[int]; schema arg name: 'split_sizes', ivalue: [1, 1]
```

Differential Revision: D44112297

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96903
Approved by: https://github.com/davidberard98
2023-03-17 17:56:26 +00:00
44d7bbfe22 [cpp extension] Allow setting PYTORCH_NVCC to a customized nvcc in torch cpp extension build (#96987)
per title

I can write a script named `nvcc` like this
```bash
#!/bin/bash
/opt/cache/bin/sccache /usr/local/cuda/bin/nvcc $@
```
and set its path to `PYTORCH_NVCC` (added in this PR), along with another `sccache-g++` script to env var `CXX`.
cfa6b52e02/torch/utils/cpp_extension.py (L2106-L2109)

With ninja, I can fully enable c-cached build on my cuda extensions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96987
Approved by: https://github.com/ezyang
2023-03-17 17:05:17 +00:00
8ce296ae2c [ez][inductor] show kernel category in kernel benchmark result (#96991)
I feel it's useful to show if an kernel is pointwise/reduction/persistent_reduction in the benchmark output. Only print the upper case of the first 3 letters to avoid wrap the line:
- POI for pointwise
- RED for reduction
- PER for persistent_reduction

<img width="1091" alt="Screenshot 2023-03-16 at 5 10 21 PM" src="https://user-images.githubusercontent.com/52589240/225780546-07b8d345-2bbe-40bd-9e65-185e9294743e.png">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96991
Approved by: https://github.com/Chillee
2023-03-17 17:02:43 +00:00
46eaf4be7d Fix Typo in pytorch/torch/autograd/__init__.py (#97024)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97024
Approved by: https://github.com/Skylion007, https://github.com/soulitzer
2023-03-17 16:24:18 +00:00
95575f0a5f [DTensor] Fix _get_or_create_default_group() (#96961)
Summary:
This PR fixes `_get_or_create_default_group()` of `DeviceMesh`. When `mesh` of the first created `DeviceMesh` is not `[0, 1, 2, ... WORLD_SIZE - 1]` and `is_initialized() == False`, it wrongly asserts. This PR fixes this issue by removing these assertions.

 ---

More specifically, `_get_or_create_default_group()` has 4 checks:

1. `DeviceMesh must include every process in WORLD`
2. `DeviceMesh cannot have duplicate values`
3. `DeviceMesh ranks must start from 0`
4. `DeviceMesh should have all ranks of WORLD`

1, 3, and 4 are not satisfied when `self.mesh` is not `[0, 1, 2, ... WORLD_SIZE - 1]`.

2 is a valid check, but it is also checked in `__init__()`, so we don't need to check it again in this function.

Test Plan: CI

Reviewed By: wanchaol

Differential Revision: D44098849

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96961
Approved by: https://github.com/wanchaol
2023-03-17 15:52:19 +00:00
ffddb2219a Change THPStorage::cdata to be a MaybeOwned<Storage>, add unpack func (#96801)
Part of #91395

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96801
Approved by: https://github.com/ezyang
2023-03-17 14:58:21 +00:00
7f94ea8492 test/test_torch.py: fix TestTorch::test_from_buffer test (#96952)
Use opposite encoding on big endian systems
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96952
Approved by: https://github.com/ezyang
2023-03-17 14:36:33 +00:00
18cf30fb2a [Inductor] preserve AliasedLayout on View (#96948)
Fix https://github.com/pytorch/pytorch/issues/96728

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96948
Approved by: https://github.com/Chillee
2023-03-17 14:29:13 +00:00
92eb9d363a Decoder native functions join the dead code society (#96025)
Summary: Decoder native joins the dead code society

With the recent introduction of PT2, we no longer need native decoder operators:
1 - full-function SDPA kernels can be used to implement cross-attention efficiently without the (slower) decoder MHA blob.
2 - torch.compile() generates more efficient code across many platforms from the python implementation of decoders than the decoder layer blob by tailoring code to target

Test Plan: github & sandcastle

Differential Revision: D43811808

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96025
Approved by: https://github.com/ezyang, https://github.com/albanD
2023-03-17 09:45:55 +00:00
b5ecf727be Revert "[aot autograd] refactor to make functionalization self-contained (#96341)"
This reverts commit 3cd9c7a16d8b19c28d12bf5b56a8a7c20405476a.

Reverted https://github.com/pytorch/pytorch/pull/96341 on behalf of https://github.com/DanilBaibak due to Break internal build
2023-03-17 09:24:05 +00:00
238b06086f inductor: fix cpp wrapper ExternKernel check (#96799)
Fix cpp_wrapper functionality for ExternKernel. Changes in https://github.com/pytorch/pytorch/pull/91575 has disabled the cpp_wrapper for ExternKernel cases.

1. Need to set the `cpp_wrapper` attr before `V.graph.register_buffer(self)`.
`register_buffer` will invoke the below check:
c6a82e4339/torch/_inductor/graph.py (L220-L223)
The current code which sets the `cpp_wrapper` after the `V.graph.register_buffer(self)` will always disable the cpp wrapper.

2. Fix the missing `ordered_kwargs_for_cpp_kernel` attr for `at::addmm_out`

3. Enhance the UT to check that cpp_wrapper has been turned on for the supported cases to prevent being unintentionally disabled by future changes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96799
Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/jansel
2023-03-17 08:58:35 +00:00
13538c88b3 [1/n] Consolidate replicate and DDP: setup ufmt for distributed.py (#96597)
As we already enabled ufmt for composable APIs in https://github.com/pytorch/pytorch/pull/90873, it seems a good idea to enable ufmt for other distributed APIs as well. This change setup ufmt for DDP.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96597
Approved by: https://github.com/rohan-varma
2023-03-17 06:25:11 +00:00
24ce3a7c34 Move hasPrimaryContext to c10::cuda (#96800)
This method has to be accessible from `c10` to enable CUDA-12 integration.
Implemented by providing private `c10::cuda:_internal::setHasPrimaryContext` that passes the pointer to the implementation (in `torch_cuda`) back to c10.
Use global class constructor/destructor to guarantee RAII.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96800
Approved by: https://github.com/ngimel
2023-03-17 04:50:35 +00:00
cbd3df93c4 [vision hash update] update the pinned vision hash (#96990)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96990
Approved by: https://github.com/pytorchbot
2023-03-17 03:13:22 +00:00
4de1bc16e3 [PyTorch][XNNPACK] Update wrappers for internal only x86 SSE2 kernels (#96896)
Summary:
Same as D43747173 (https://github.com/pytorch/pytorch/pull/95911) except for the newly added x86 SSE2 kernels.

For future reference, wrappers can be generated by

```
cd ~/fbsource/xplat/third-party/XNNPACK
# Update the list of internal only kernels in generate-wrappers.py
python3 generate-wrappers.py
```

Test Plan: CI

Reviewed By: digantdesai

Differential Revision: D44072764

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96896
Approved by: https://github.com/digantdesai
2023-03-17 03:07:39 +00:00
f865e23abc [MPS] Introduce MPSUnaryGradCachedGraph & MPSBinaryGradCachedGraph (#95289)
This PR introduces `MPSUnaryGradCachedGraph` & `MPSBinaryGradCachedGraph` to replace duplicate CachedGraph creation in backward functions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95289
Approved by: https://github.com/kulinseth
2023-03-17 02:50:51 +00:00
571f96bf59 cudagraph trees (#89146)
CUDA Graph Trees

Design doc: https://docs.google.com/document/d/1ZrxLGWz7T45MSX6gPsL6Ln4t0eZCSfWewtJ_qLd_D0E/edit

Not currently implemented :

- Right now, we are using weak tensor refs from outputs to check if a tensor has dies. This doesn't work because a) aliasing, and b) aot_autograd detaches tensors (see note [Detaching saved tensors in AOTAutograd]). Would need either https://github.com/pytorch/pytorch/issues/91395 to land to use storage weak refs or manually add a deleter fn that does what I want. This is doable but theres some interactions with the caching allocator checkpointing so saving for a stacked pr.

- Reclaiming memory from the inputs during model recording. This isn't terribly difficult but deferring to another PR. You would need to write over the input memory during warmup, and therefore copy the inputs to cpu. Saving for a stacked pr.

- Warning on overwriting previous generation outputs. and handling nested torch.compile() calls in generation tracking

Differential Revision: [D43999887](https://our.internmc.facebook.com/intern/diff/D43999887)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89146
Approved by: https://github.com/ezyang
2023-03-17 02:47:03 +00:00
cf732053e4 nn.EmbeddingBag bound check (#96022)
Summary: Today if we're accessing out of bound embedding rows, it'll either go through or throw IMA. This is not ideal - adding bound checks. This will probably slow things down - need to benchmark it.

Test Plan:
TODO: add some tests

Tried a simple example and it's showing this:
```
aten/src/ATen/native/cuda/EmbeddingBag.cu:143: EmbeddingBag_updateOutputKernel_sum_mean: block: [0,0,0], thread: [0,1,0] Assertion `input[emb] < numRows` failed.
```

Differential Revision: D43810777

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96022
Approved by: https://github.com/cpuhrsch, https://github.com/ngimel
2023-03-17 02:01:43 +00:00