pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-20 12:54:11 +08:00

Author	SHA1	Message	Date
drisspg	dba9487324	Add helpful pretty pretting summaries to torch for lldb debugging (#97101 ) # Summary Add support for pretty printing of tensors when using lldb similiar to what is currently available for gdb <img width="772" alt="Screenshot 2023-03-18 at 6 20 34 PM" src="https://user-images.githubusercontent.com/32754868/226148687-b4e6cfe1-8be1-4657-9ebc-d134f697dd37.png"> <img width="254" alt="Screenshot 2023-03-18 at 6 20 43 PM" src="https://user-images.githubusercontent.com/32754868/226148690-caca6f76-d873-419e-b5e4-6bb403b3d179.png"> I changed it so to override the variable formatting instead of having to call a seperate command you can just do `print <tensor>` I also add one for sizes <img width="309" alt="Screenshot 2023-03-19 at 1 05 49 PM" src="https://user-images.githubusercontent.com/32754868/226206458-e3f0111b-6a97-4d75-8125-48455aa2cf43.png"> Last one: <img width="815" alt="Screenshot 2023-03-19 at 1 39 23 PM" src="https://user-images.githubusercontent.com/32754868/226207687-20bd014f-9e0e-4c01-b2c8-190b7365aa70.png"> If you use the codelldb extension be sure to add: `"lldb.launch.initCommands": ["command source ${env:HOME}/.lldbinit"]` To your setttings .json Pull Request resolved: https://github.com/pytorch/pytorch/pull/97101 Approved by: https://github.com/ngimel	2023-03-20 01:27:44 +00:00
Aaron Gokaslan	5471621497	[BE] Remove unnecessary dict comprehensions (#97116 ) Removes unnecessary dict comprehensions that optimize creation of dicts from iterables Pull Request resolved: https://github.com/pytorch/pytorch/pull/97116 Approved by: https://github.com/kit1980	2023-03-20 00:56:57 +00:00
AllenTiTaiWang	be0b415a5a	[ONNX] Set shape/type into torchscript (#96349 ) Fixes https://github.com/pytorch/pytorch/pull/95676#issuecomment-1460588229 PS: It doesn't seem the exported ONNX_proto having type now. I wonder if there was a ONNX pass doing this for us (converting torch dtype to onnx dtype during exporting.) Type promotion issue would be raised with an error if we want to set type ```python onnxscript_value.dtype = expected_value.dtype ``` onnx.onnx_cpp2py_export.shape_inference.InferenceError: [ShapeInferenceError] Shape inference error(s): (op_type:aten_add, node name: aten_add_1): [ShapeInferenceError] (op_type:Add, node name: n3): B has inconsistent type tensor(int64) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96349 Approved by: https://github.com/justinchuby, https://github.com/wschin	2023-03-19 21:58:10 +00:00
Michael Voznesensky	722c4e59a4	Replace source check with assert (#95640 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95640 Approved by: https://github.com/ezyang	2023-03-19 21:51:59 +00:00
PyTorch MergeBot	c8030b5406	Revert "Update mkl_verbose return value check due to API change in mkl (#96283 )" This reverts commit c1214ce5c26fce541a920bdf9917c9ca9f63ecb0. Reverted https://github.com/pytorch/pytorch/pull/96283 on behalf of https://github.com/kit1980 due to Looks like this broke inductor tests on macos-12-py3-arm64 https://github.com/pytorch/pytorch/actions/runs/4458194071/jobs/7830194137	2023-03-19 21:48:01 +00:00
Edward Z. Yang	e74c5e5637	rexnet_100 is disabled for static, does not need dynamic listing (#97100 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/97100 Approved by: https://github.com/Skylion007	2023-03-19 20:57:49 +00:00
PyTorch MergeBot	5d33f9cddb	Revert "Fix standalone compile for op with multiple outputs (#96936 )" This reverts commit 37cde56658e20afae6d94b70d53e4131043e09e8. Reverted https://github.com/pytorch/pytorch/pull/96936 on behalf of https://github.com/kit1980 due to Broke inductor tests on macos-12-py3-arm64 https://github.com/pytorch/pytorch/actions/runs/4458548491/jobs/7830566793	2023-03-19 20:32:13 +00:00
Driss Guessous	90537a779c	Update FlashAttention to work with sm90 Gpus (#97051 ) # Summary FlashAttention was confirmed to work on h100 and sm90 hardware so we update the checks to account for this Pull Request resolved: https://github.com/pytorch/pytorch/pull/97051 Approved by: https://github.com/cpuhrsch	2023-03-19 19:33:57 +00:00
Liao, Xuan	37cde56658	Fix standalone compile for op with multiple outputs (#96936 ) Op-benchmark directly uses fx.Graph to create nodes without dynamo and then compiles the graph with inductor. Currently, operators with multiple outputs, e.g. native_layer_norm, would fail to run caused by standalone torch._inductor.compile() API #95594. Actually, the graph's result is a node with several outputs instead of a tuple with several nodes. However, the standalone API forces a non-tuple result be a tuple, i.e., a tuple with one node-type element with several outputs. This PR considers a return node with several outputs as a tuple to avoid errors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96936 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-03-19 02:44:03 +00:00
Jing Xu	c1214ce5c2	Update mkl_verbose return value check due to API change in mkl (#96283 ) As title. Originally `mkl_verbose()` function returned `0` and `1`, indicating failure and success respectively. However, the version that PyTorch uses now changed the output of `mkl_verbose()` to reflect its input level. Thus, the check logic needs to be changed to compare output of the `mkl_verbose()` function with -1. https://www.intel.com/content/www/us/en/develop/documentation/onemkl-developer-reference-c/top/support-functions/miscellaneous/mkl-verbose.html Pull Request resolved: https://github.com/pytorch/pytorch/pull/96283 Approved by: https://github.com/jgong5, https://github.com/malfet	2023-03-18 20:30:07 +00:00
Liao, Xuan	5ee5a164ff	[aot] disable inference view tracking (#96478 ) For inference, we should disable unnecessary view tracking to save time. Most of operators get an improvement of performance (inductor v.s. eager). This PR fix the general regression of operators for inductor. Example of operators' speedup in torchbench (inductor v.s. eager): <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40"> <head> <meta name=ProgId content=Excel.Sheet> <meta name=Generator content="Microsoft Excel 15"> <link id=Main-File rel=Main-File href="file:///C:/Users/xuanliao/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> <link rel=File-List href="file:///C:/Users/xuanliao/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> </head> <body link="#0563C1" vlink="#954F72"> \| current \| new -- \| -- \| -- aten.hardsigmoid.default \| [0.6426090814905988, 0.6791992931354925, 0.7046010955095103] \| [0.7921782106271767, 0.8919522525991529, 0.9128089963571694] aten.tanh.default \| [0.6135534976747065, 0.7588851221588919, 0.898274076411234] \| [0.857534066531159, 1.0524121834821605, 1.2535141671420165] aten.floor.default \| [0.6115868728087821, 0.6115868728087821, 0.6115868728087821] \| [0.9472870784346195, 0.9472870784346195, 0.9472870784346195] aten.exp.default \| [0.7784016216625718, 0.9279358274876591, 1.1201178548406794] \| [0.5777145055206203, 0.8610140436473923, 1.1850714193498957] aten.mul_.Tensor \| [0.14381872531802153, 0.14638969818507447, 0.14947766446663138] \| [0.37695307573466363, 0.3832122689450142, 0.38963470437456904] aten.hardtanh_.default \| [0.49502896822398157, 0.5897512505705527, 0.8052969399847189] \| [0.4915338157706071, 0.6098169585316151, 0.8587605051115021] aten.relu_.default \| [0.47776870021339685, 0.54452322796367, 0.6516167164223963] \| [0.4764791289773786, 0.5608095328163419, 0.6753350976452626] </body> </html> Pull Request resolved: https://github.com/pytorch/pytorch/pull/96478 Approved by: https://github.com/EikanWang, https://github.com/jansel, https://github.com/jgong5, https://github.com/bdhirsh	2023-03-18 13:58:24 +00:00
Wanchao Liang	4805441b4a	[dtensor] remove unused tests and fix ci (#97064 ) fix ci Pull Request resolved: https://github.com/pytorch/pytorch/pull/97064 Approved by: https://github.com/huydhn	2023-03-18 06:01:37 +00:00
Huy Do	a5923ab3f3	Revert "[inductor] do benchmark in sub processes for max autotuning (#96410 )" (#97075 ) This reverts commit 34256bc73080d7898138c821273b9f31fab777f8. @kit1980: I'm not sure how best to revert a co-dev PR like https://github.com/pytorch/pytorch/pull/96410#issuecomment-1474704337. IIRC, Ivan and Eli did a revert PR like this before, so I create one here just in case we need to use it. If that's the case, please feel free to get this merge to fix trunk. Otherwise, this can be closed. @shunting314 If you can do a forward fix faster than this, please help do so. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97075 Approved by: https://github.com/kit1980	2023-03-18 05:07:18 +00:00
Michael Lazos	a1c46e5f8f	component-level configurable logging for dynamo, inductor, aot (#94858 ) Summary: Adds NNC-like logging that is configured through an env var `TORCH_COMPILE_LOGS` Examples: `TORCH_LOGS="dynamo,guards" python script.py` - prints dynamo logs at level INFO with guards of all functions that are compiled `TORCH_LOGS="+dynamo,guards,graph" python script.py` - prints dynamo logs at level DEBUG with guards and graphs (in tabular) format of all graphs that are compiled [More examples with full output](https://gist.github.com/mlazos/b17f474457308ce15e88c91721ac1cce) Implementation: The implementation parses the log settings from the environment, finds any components (aot, dynamo, inductor) or other loggable objects (guards, graph, etc.) and generates a log_state object. This object contains all of the enabled artifacts, and a qualified log name -> level mapping. _init_logs then adds handlers to the highest level logs (the registered logs), and sets any artifact loggers to level DEBUG if the artifact is enabled. Note: set_logs is an alternative for manipulating the log_state, but if the environment contains TORCH_LOGS, the environment settings will be prioritized. Adding a new log: To add a new log, a dev should add their log name to torch._logging._registrations (there are examples there already). Adding a new artifact: To add a new artifact, a dev should add their artifact name to torch._logging._registrations as well. Additionally, wherever the artifact is logged, `torch._logging.getArtifactLogger(__name__, <artifact_name>)` should be used instead of the standard logging implementation. [design doc](https://docs.google.com/document/d/1ZRfTWKa8eaPq1AxaiHrq4ASTPouzzlPiuquSBEJYwS8/edit#) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94858 Approved by: https://github.com/ezyang	2023-03-18 04:17:31 +00:00
Qi Zhu	086ce765a5	Add new parameter `materialize_grads` to torch.autograd.grad() (#97015 ) Fixes #44189 Adds a new parameter, zero_grad_unused, to the torch.autograd.grad() function. This parameter allows for the gradient to be set to 0 instead of None when a variable is unused, which can be helpful for higher-order partial differentials. Here is an example of using this new parameter to solve d^3y/dx^3 given y = a * x: ```python x = torch.tensor(0.5, dtype=torch.float32, requires_grad=True) a = torch.tensor(1, dtype=torch.float32, requires_grad=True) y = x * a dydx = torch.autograd.grad(y, x, create_graph=True, allow_unused=True) d2ydx2 = torch.autograd.grad(dydx, x, allow_unused=True, zero_grad_unused=True) try: d3ydx3 = torch.autograd.grad(d2ydx2, x, allow_unused=True, zero_grad_unused=True) except RuntimeError as e: assert False, "Should not raise error" ``` With `zero_grad_unused`, d2ydx2 could be 0 instead of None, enabling d3ydx3 to be calculated as defined in math without throwing an error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97015 Approved by: https://github.com/soulitzer	2023-03-18 03:11:12 +00:00
Shunting Zhang	34256bc730	[inductor] do benchmark in sub processes for max autotuning (#96410 ) This PR implements the support to benchmark max-autotune choices in subprocesses. This way crash like https://github.com/openai/triton/issues/1298 will only abort the autotuning child process but the parent process can continue. There are a few things to note: - cuda runtime does not work with fork. So we have to use spawn to create child processes. Check the best practice from pytorch multithreading module: https://pytorch.org/docs/stable/notes/multiprocessing.html - to run a job in a child process, the multiprocessing module needs to pickle both the target function and arguments and pass them to child process. This is the major complexity of this prototype since there are quite a lot of corner cases making pickle fail. Here I list the pickle related issues I encountered: - pickle a StorageBox cause infinite recursion. Error: https://gist.github.com/171e5ab404b7855dee2dfa1d9f093442 . Work around by pickle the inner buffer. - IRNode store fx.Node's in its origin fields. However, we can not pickle a fx.Node. It fails when with the following error when picking the fx.Node.graph: https://gist.github.com/9c289e895d7091d7ec787c67bc3c0d70. Work around by skip origins when pickling a IRNode. - jinja Template in TritonTemplateKernel can not be pickled: `TypeError: Template.__new__() missing 1 required positional argument: 'source' `. Workaround by pickle the source rather than jinjia Template. During unpickling, rebuild the jinja template. - due to how select_algorithm.template_kernels is populated, in child process, it's empty. Work around by passing select_algorithm.template_kernels from parent process to child process directly. - There is some change in TritonTemplate.generate to make a TritonTemplateKernel pickle'able. A TritonTemplate is refered to in the closure for a TritonTemplateKernel object. - We can not pass choice to child process directly because of pickle failure for lambda/local function being used. However cloudpickle can handle lambda. Work around by passing the cloudpickle'd choice object to child process. The child project need to unpickle it explictly. Test: ``` python test/inductor/test_max_autotune.py -k test_max_autotune_mm_plus_mm ``` This is basically the repro I get from Bert Maher. Benchmark in sub process is about 4x slower than benchmark in the same process. Without doing any profiling, I feel the time may be cost by starting a new process and doing initialization. Some ~thread~ process pool may help. ``` AUTOTUNE ref_mm_plus_mm(2048x64, 64x1536, 2048x64, 64x1536) triton_mm_plus_mm_0 0.0276s 100.0% triton_mm_plus_mm_6 0.0287s 96.4% triton_mm_plus_mm_5 0.0317s 87.1% triton_mm_plus_mm_1 0.0328s 84.4% ref_mm_plus_mm 0.0379s 73.0% triton_mm_plus_mm_7 0.0379s 73.0% triton_mm_plus_mm_2 0.0399s 69.2% triton_mm_plus_mm_3 0.0410s 67.5% triton_mm_plus_mm_4 0.0410s 67.5% AUTOTUNE takes 12.001659393310547 seconds AUTOTUNE ref_mm_plus_mm(2048x64, 64x1536, 2048x64, 64x1536) triton_mm_plus_mm_0 0.0276s 100.0% triton_mm_plus_mm_6 0.0287s 96.4% triton_mm_plus_mm_1 0.0317s 87.1% triton_mm_plus_mm_5 0.0317s 87.1% ref_mm_plus_mm 0.0379s 73.0% triton_mm_plus_mm_7 0.0389s 71.1% triton_mm_plus_mm_2 0.0399s 69.2% triton_mm_plus_mm_3 0.0410s 67.5% triton_mm_plus_mm_4 0.0410s 67.5% AUTOTUNE takes 51.39659810066223 seconds ``` The feature is disabled by default and can be enabled by setting the following config or envvar: ``` autotune_in_subproc = os.environ.get("TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC") == "1" ``` Differential Revision: [D43996048](https://our.internmc.facebook.com/intern/diff/D43996048) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96410 Approved by: https://github.com/jansel	2023-03-18 02:43:28 +00:00
Michael Gschwind	b132220309	Update MHA doc string (#97046 ) Summary: Update MHA doc string Test Plan: sandcastle & github Differential Revision: D44179519 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97046 Approved by: https://github.com/voznesenskym	2023-03-18 02:14:59 +00:00
Wang, Eikan	915cbf8208	[Inductor] Eliminate redundant to_dtype node (#96650 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96650 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-03-18 01:51:38 +00:00
Huy Do	679dec847e	Use is_available instead of device_count to check for CUDA availability (#97043 ) There are some tests that incorrectly uses the number of GPU devices `torch.cuda.device_count() > 0` to check for CUDA availability instead of the default `torch.cuda.is_available()` call. This makes these tests more brittle when encountering infra flakiness on G5 runner using A10G, for example [test_pytorch_np](https://hud.pytorch.org/failure/FAILED%20test_tensorboard.py%3A%3ATestTensorBoardPyTorchNumpy%3A%3Atest_pytorch_np%20-%20RuntimeError%3A%20No%20CUDA%20GPUs%20are%20available). The underlying problem is that GPU devices could crash on these runner. While the root cause for that is unclear and we will try to upgrade to a new NVIDIA driver https://github.com/pytorch/pytorch/pull/96904 to see if it helps, we can also make these tests more resilient by using the correct check to skip tests correctly when GPU crashes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97043 Approved by: https://github.com/clee2000	2023-03-18 00:39:42 +00:00
Huy Do	c62fc81cc5	Increase the timeout value for linter calculate-docker-image (#96993 ) I should have known that this step rebuilds the linter Docker image if it doesn't exists. When it does so, it takes close to 15 minutes to finish, i.e. https://github.com/pytorch/pytorch/actions/runs/4443046530/attempts/1, instead of the regular 2-minute run, i.e. https://github.com/pytorch/pytorch/actions/runs/4442455480/jobs/7798700609. This x2 the timeout value of this step to 30 minutes to avoid getting timeout flakily. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96993 Approved by: https://github.com/clee2000	2023-03-18 00:06:39 +00:00
James Braza	b390e7037e	[docs] passing LogSoftmax into NLLLoss (#97001 ) Fixes https://github.com/pytorch/pytorch/issues/96795 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97001 Approved by: https://github.com/soulitzer	2023-03-17 23:22:13 +00:00
Sergii Dymchenko	410210b351	Remove obsolete "merge -g" flag from update_commit_hashes.py (#97033 ) The flag is deprecated and is being removed in https://github.com/pytorch/test-infra/pull/3882 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97033 Approved by: https://github.com/huydhn	2023-03-17 22:51:58 +00:00
Huy Do	db2c1ea8c8	Re-enable test_ops_jit on Windows (#96859 ) (#96931 ) Fixes https://github.com/pytorch/pytorch/issues/96858 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96931 Approved by: https://github.com/kit1980	2023-03-17 22:42:22 +00:00
David Berard	a4c706bcbc	[dynamo][dashboard] fix triton clone step in dashboard (#96623 ) previously this would clone triton, and then try to checkout without being in the git repo directory. This wasn't usually a problem because the environment already had a triton repo downloaded; but I ran into this while trying to construct a new environment. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96623 Approved by: https://github.com/anijain2305	2023-03-17 22:36:26 +00:00
Jane Xu	4a90aca60d	Make keep-going work for more than linux (#96974 ) cc. asked by @zou3519 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96974 Approved by: https://github.com/huydhn	2023-03-17 22:08:37 +00:00
BJ Hargrave	b59a60ddff	Fix CPU bitwise shifts for out-of-limit shift values (#96659 ) Negative shift values and positive shift values greater than the bit size of the dtype (limit `0..bits`) now yield expected results which are consistent with numpy. Left shift with an out-of-limit shift value result in a value of `0`. Right shift with an out-of-limit shift value results in a value of `-1` for negative inputs and `0` for non-negative inputs (sign preserving). Fixes https://github.com/pytorch/pytorch/issues/70904 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96659 Approved by: https://github.com/ngimel, https://github.com/albanD, https://github.com/zou3519, https://github.com/jgong5, https://github.com/malfet	2023-03-17 21:35:34 +00:00
Aaron Gokaslan	dd9ade6377	Remove unnecessary items() call in zero_grad (#97040 ) Micro-optimization to zero_grad() which is performance critical Pull Request resolved: https://github.com/pytorch/pytorch/pull/97040 Approved by: https://github.com/ezyang, https://github.com/albanD	2023-03-17 21:34:14 +00:00
Driss Guessous	98a5cf090d	[SDPA] Remove the chunk_grad from mem-eff attention (#96880 ) # Summary There exists an optimization within the scaled_dot_product_efficieint bacwkard attention path to, under the right conditions, output grad_q, grad_k, grad_v all as aliases of the same storage. This was done to optimize for the hot path where mha does packed linear_projection -> chunk -> (view stuff) -> sdpa. The thought was that chunk-> would be able to "trivially" cat inputs to chunk.backward(). However upon closer inspection chunk.backward will call ` cat` irregardless of the inputs so this is not being utilized. I validated this by profiling on main and then this branch and the traces produced the same both with `split.backward()` calling into cat. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96880 Approved by: https://github.com/cpuhrsch	2023-03-17 21:28:25 +00:00
Guang Yang	d4b8ed2b11	Fail fast when dynamo attempts to add unspecialized int/float as additional graph inputs (#96786 ) Summary: Verified the changes to catch unspecialized int/floats being added as additional graph in D44037548 prior to RP(https://github.com/pytorch/pytorch/pull/95621). However, with #95621 the issue to be solved originally is no longer valid because int & float in `forward` will always be specialized in export. This RP is to add the assertion anyway (though not be hit unless there is a regression) to immediately catch the attempt to add unspecialized int/float to additional graphargs Test Plan: Example of the error message would look like: ``` Dynamo attempts to add additional input: value=9.999999747378752e-06, source=NNModuleSource(inner=AttrSource(base=NNModuleSource(inner=AttrSource(base=LocalInputSource(local_name='self', pos=0), member='torch_module')), member='eps')) ``` Passed all export tests ``` Buck UI: https://www.internalfb.com/buck2/fea72653-5549-47e7-a9bf-740eb86a8e26 Test UI: https://www.internalfb.com/intern/testinfra/testrun/8725724422167257 RE: reSessionID-7b3470b1-c293-4c4a-9671-dd0b7a2839b8 Up: 6.0 KiB Down: 0 B Jobs completed: 101. Time elapsed: 115.7s. Tests finished: Pass 98. Fail 0. Fatal 0. Skip 0. 0 builds failed ``` Differential Revision: D44075910 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96786 Approved by: https://github.com/tugsbayasgalan, https://github.com/ezyang	2023-03-17 21:15:18 +00:00
Kiersten Stokes	cea13ad9fa	Improve size mismatch error messaging referencing mat/vet sizes (#96863 ) Fixes #94841 This fixes the error messages in the following files, the same as those referenced in the linked issue. I was not able to find any additional examples, but am happy to add commits for any that I may have missed! ``` aten/src/ATen/native/Blas.cpp: "size mismatch, got ", self.size(0), ", ", mat.size(0), "x", mat.size(1), ",", vec.size(0)); torch/_decomp/decompositions.py: lambda: f"size mismatch, got {self.size(0)}x{self.size(1)},{vec.size(0)}", ``` Example output for `Blas.cpp` before: ``` size mismatch, got 3, 3x4,1 ``` The new error messages have the following format: ``` aten/src/ATen/native/Blas.cpp: "size mismatch, got bias (", self.size(0), "), matrix (", mat.size(0), "x", mat.size(1), "), vector (", vec.size(0), ")"); torch/_decomp/decompositions.py: lambda: f"size mismatch, got matrix ({self.size(0)}x{self.size(1)}), vector ({vec.size(0)})", ``` Example output for `Blas.cpp` after: ``` size mismatch, got bias (3), matrix (3x4), vector (1) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/96863 Approved by: https://github.com/albanD	2023-03-17 21:07:48 +00:00
albanD	985fc66b30	Bind increment_version to python (#96852 ) Should be convenient when writing python-only kernels (with triton) that don't have access to the C++ APIs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96852 Approved by: https://github.com/soulitzer	2023-03-17 20:36:33 +00:00
dujinhang	1983b31711	Fixed print tensor.type() issue. (#96381 ) Fixes #95954 Updating the cpp printing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96381 Approved by: https://github.com/albanD	2023-03-17 20:26:43 +00:00
Max Podkorytov	57bb5b159d	[static-runtime] one more attempt to improve crash log readability (#96903 ) Summary: * add human readable type and ivalue printout * fix internal linter warnings Test Plan: error message now looks like e.g. ``` E0315 16:27:32.409082 422313 ExceptionTracer.cpp:222] exception stack complete terminate called after throwing an instance of 'c10::Error' what(): List[int] is not a subtype of List[int]; schema arg name: 'split_sizes', ivalue: [1, 1] ``` Differential Revision: D44112297 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96903 Approved by: https://github.com/davidberard98	2023-03-17 17:56:26 +00:00
Xiao Wang	44d7bbfe22	[cpp extension] Allow setting PYTORCH_NVCC to a customized nvcc in torch cpp extension build (#96987 ) per title I can write a script named `nvcc` like this ```bash #!/bin/bash /opt/cache/bin/sccache /usr/local/cuda/bin/nvcc $@ ``` and set its path to `PYTORCH_NVCC` (added in this PR), along with another `sccache-g++` script to env var `CXX`. `cfa6b52e02/torch/utils/cpp_extension.py (L2106-L2109)` With ninja, I can fully enable c-cached build on my cuda extensions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96987 Approved by: https://github.com/ezyang	2023-03-17 17:05:17 +00:00
Shunting Zhang	8ce296ae2c	[ez][inductor] show kernel category in kernel benchmark result (#96991 ) I feel it's useful to show if an kernel is pointwise/reduction/persistent_reduction in the benchmark output. Only print the upper case of the first 3 letters to avoid wrap the line: - POI for pointwise - RED for reduction - PER for persistent_reduction <img width="1091" alt="Screenshot 2023-03-16 at 5 10 21 PM" src="https://user-images.githubusercontent.com/52589240/225780546-07b8d345-2bbe-40bd-9e65-185e9294743e.png"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/96991 Approved by: https://github.com/Chillee	2023-03-17 17:02:43 +00:00
Luke Confait	46eaf4be7d	Fix Typo in pytorch/torch/autograd/__init__.py (#97024 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97024 Approved by: https://github.com/Skylion007, https://github.com/soulitzer	2023-03-17 16:24:18 +00:00
Shintaro Iwasaki	95575f0a5f	[DTensor] Fix _get_or_create_default_group() (#96961 ) Summary: This PR fixes `_get_or_create_default_group()` of `DeviceMesh`. When `mesh` of the first created `DeviceMesh` is not `[0, 1, 2, ... WORLD_SIZE - 1]` and `is_initialized() == False`, it wrongly asserts. This PR fixes this issue by removing these assertions. --- More specifically, `_get_or_create_default_group()` has 4 checks: 1. `DeviceMesh must include every process in WORLD` 2. `DeviceMesh cannot have duplicate values` 3. `DeviceMesh ranks must start from 0` 4. `DeviceMesh should have all ranks of WORLD` 1, 3, and 4 are not satisfied when `self.mesh` is not `[0, 1, 2, ... WORLD_SIZE - 1]`. 2 is a valid check, but it is also checked in `__init__()`, so we don't need to check it again in this function. Test Plan: CI Reviewed By: wanchaol Differential Revision: D44098849 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96961 Approved by: https://github.com/wanchaol	2023-03-17 15:52:19 +00:00
Kurt Mohler	ffddb2219a	Change `THPStorage::cdata` to be a `MaybeOwned<Storage>`, add unpack func (#96801 ) Part of #91395 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96801 Approved by: https://github.com/ezyang	2023-03-17 14:58:21 +00:00
Aleksei Nikiforov	7f94ea8492	test/test_torch.py: fix TestTorch::test_from_buffer test (#96952 ) Use opposite encoding on big endian systems Pull Request resolved: https://github.com/pytorch/pytorch/pull/96952 Approved by: https://github.com/ezyang	2023-03-17 14:36:33 +00:00
Jiong Gong	18cf30fb2a	[Inductor] preserve AliasedLayout on View (#96948 ) Fix https://github.com/pytorch/pytorch/issues/96728 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96948 Approved by: https://github.com/Chillee	2023-03-17 14:29:13 +00:00
Michael Gschwind	92eb9d363a	Decoder native functions join the dead code society (#96025 ) Summary: Decoder native joins the dead code society With the recent introduction of PT2, we no longer need native decoder operators: 1 - full-function SDPA kernels can be used to implement cross-attention efficiently without the (slower) decoder MHA blob. 2 - torch.compile() generates more efficient code across many platforms from the python implementation of decoders than the decoder layer blob by tailoring code to target Test Plan: github & sandcastle Differential Revision: D43811808 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96025 Approved by: https://github.com/ezyang, https://github.com/albanD	2023-03-17 09:45:55 +00:00
PyTorch MergeBot	b5ecf727be	Revert "[aot autograd] refactor to make functionalization self-contained (#96341 )" This reverts commit 3cd9c7a16d8b19c28d12bf5b56a8a7c20405476a. Reverted https://github.com/pytorch/pytorch/pull/96341 on behalf of https://github.com/DanilBaibak due to Break internal build	2023-03-17 09:24:05 +00:00
chunyuan	238b06086f	inductor: fix cpp wrapper ExternKernel check (#96799 ) Fix cpp_wrapper functionality for ExternKernel. Changes in https://github.com/pytorch/pytorch/pull/91575 has disabled the cpp_wrapper for ExternKernel cases. 1. Need to set the `cpp_wrapper` attr before `V.graph.register_buffer(self)`. `register_buffer` will invoke the below check: `c6a82e4339/torch/_inductor/graph.py (L220-L223)` The current code which sets the `cpp_wrapper` after the `V.graph.register_buffer(self)` will always disable the cpp wrapper. 2. Fix the missing `ordered_kwargs_for_cpp_kernel` attr for `at::addmm_out` 3. Enhance the UT to check that cpp_wrapper has been turned on for the supported cases to prevent being unintentionally disabled by future changes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96799 Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/jansel	2023-03-17 08:58:35 +00:00
Charlie Yan	13538c88b3	[1/n] Consolidate `replicate` and `DDP`: setup ufmt for `distributed.py` (#96597 ) As we already enabled ufmt for composable APIs in https://github.com/pytorch/pytorch/pull/90873, it seems a good idea to enable ufmt for other distributed APIs as well. This change setup ufmt for DDP. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96597 Approved by: https://github.com/rohan-varma	2023-03-17 06:25:11 +00:00
Nikita Shulga	24ce3a7c34	Move `hasPrimaryContext` to `c10::cuda` (#96800 ) This method has to be accessible from `c10` to enable CUDA-12 integration. Implemented by providing private `c10::cuda:_internal::setHasPrimaryContext` that passes the pointer to the implementation (in `torch_cuda`) back to c10. Use global class constructor/destructor to guarantee RAII. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96800 Approved by: https://github.com/ngimel	2023-03-17 04:50:35 +00:00
PyTorch MergeBot	cbd3df93c4	[vision hash update] update the pinned vision hash (#96990 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96990 Approved by: https://github.com/pytorchbot	2023-03-17 03:13:22 +00:00
Stephen Jia	4de1bc16e3	[PyTorch][XNNPACK] Update wrappers for internal only x86 SSE2 kernels (#96896 ) Summary: Same as D43747173 (https://github.com/pytorch/pytorch/pull/95911) except for the newly added x86 SSE2 kernels. For future reference, wrappers can be generated by ``` cd ~/fbsource/xplat/third-party/XNNPACK # Update the list of internal only kernels in generate-wrappers.py python3 generate-wrappers.py ``` Test Plan: CI Reviewed By: digantdesai Differential Revision: D44072764 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96896 Approved by: https://github.com/digantdesai	2023-03-17 03:07:39 +00:00
Li-Huai (Allan) Lin	f865e23abc	[MPS] Introduce MPSUnaryGradCachedGraph & MPSBinaryGradCachedGraph (#95289 ) This PR introduces `MPSUnaryGradCachedGraph` & `MPSBinaryGradCachedGraph` to replace duplicate CachedGraph creation in backward functions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95289 Approved by: https://github.com/kulinseth	2023-03-17 02:50:51 +00:00
Elias Ellison	571f96bf59	cudagraph trees (#89146 ) CUDA Graph Trees Design doc: https://docs.google.com/document/d/1ZrxLGWz7T45MSX6gPsL6Ln4t0eZCSfWewtJ_qLd_D0E/edit Not currently implemented : - Right now, we are using weak tensor refs from outputs to check if a tensor has dies. This doesn't work because a) aliasing, and b) aot_autograd detaches tensors (see note [Detaching saved tensors in AOTAutograd]). Would need either https://github.com/pytorch/pytorch/issues/91395 to land to use storage weak refs or manually add a deleter fn that does what I want. This is doable but theres some interactions with the caching allocator checkpointing so saving for a stacked pr. - Reclaiming memory from the inputs during model recording. This isn't terribly difficult but deferring to another PR. You would need to write over the input memory during warmup, and therefore copy the inputs to cpu. Saving for a stacked pr. - Warning on overwriting previous generation outputs. and handling nested torch.compile() calls in generation tracking Differential Revision: [D43999887](https://our.internmc.facebook.com/intern/diff/D43999887) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89146 Approved by: https://github.com/ezyang	2023-03-17 02:47:03 +00:00
Xiaodong Wang	cf732053e4	nn.EmbeddingBag bound check (#96022 ) Summary: Today if we're accessing out of bound embedding rows, it'll either go through or throw IMA. This is not ideal - adding bound checks. This will probably slow things down - need to benchmark it. Test Plan: TODO: add some tests Tried a simple example and it's showing this: ``` aten/src/ATen/native/cuda/EmbeddingBag.cu:143: EmbeddingBag_updateOutputKernel_sum_mean: block: [0,0,0], thread: [0,1,0] Assertion `input[emb] < numRows` failed. ``` Differential Revision: D43810777 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96022 Approved by: https://github.com/cpuhrsch, https://github.com/ngimel	2023-03-17 02:01:43 +00:00

1 2 3 4 5 ...

57810 Commits