pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-11-11 22:34:53 +08:00

Author	SHA1	Message	Date
Louis Feng	5847cb55e4	[PyPer][ET] Refactor EG to ET (#99694 ) Summary: Change execution graph to execution trace. See post: https://fb.workplace.com/groups/873291503156329/permalink/1529496217535851/ Test Plan: Run a job. Reviewed By: chaekit Differential Revision: D44121392 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99694 Approved by: https://github.com/chaekit	2023-06-22 19:41:54 +00:00
leslie-fang-intel	9832cfbbfe	Quantization oneDNN backend only support VNNI CPU (#103653 ) Summary - Update the quantization document that default qconfig with oneDNN backend is recommended to be used on CPUs with Vector Neural Network Instruction support. - Add the warning message when user uses default qconfig with oneDNN backend on CPU without Vector Neural Network Instruction support. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103653 Approved by: https://github.com/jgong5, https://github.com/malfet	2023-06-19 09:50:07 +00:00
xuanqi	b27c3558a4	[RFC]: Create aten native op for constrain_range (#103346 ) At high current implementation of constrains functions (constrain_as_) will raise exception for the following code snippets: ``` def f(x): a = x.item() constrain_as_size(a, 4, 7) return torch.empty((a, 4)) inp = torch.tensor([5]) ep = torch._export.export(f, (inp,)) ``` The reason is because current constrain logic is: 1) Purely python so it won't survive AOT export (the full node is gone after AOT export since AOT export only maintains aten level op). 2) Utilize side effect to add range constraints for traced symbol's shape env ([code](`9591e52880/torch/fx/experimental/symbolic_shapes.py (L370-L372)`)). 3) If runtime assertion is turned on (by default). [`_AddRuntimeAssertionsForConstraintsPass`](`9591e52880/torch/_export/passes/add_runtime_assertions_for_constraints_pass.py (L98-L100)`) will try to append assertion node based on range constrains extracted from shape env of symbol during another interpretation round. 4). However, since 1), in the round of AOT export, range constraints logic won't run for symbols generated during this round. And later there is no range constrains information available for assertion round and caused issue. 5) As a result of above, it will failure at `torch.empty((a, 4))` (there is no constrains for `a` that it must be positive). The fix here is just to implement range constrain logic as a native aten op (CPU implementation as no-op) to make it be able to survive AOT export. NOTE:** [Logic](`2d745b95d7/torch/fx/experimental/symbolic_shapes.py (L350-L365C15)`) within [`constrain_range`](`2d745b95d7/torch/fx/experimental/symbolic_shapes.py (LL313C74-L313C74)`) is split out as `constrain_range_int` to capture case when non `SymInt` is passed in and reused in the new `_constrain_range`. The reason is when non `SymInt` is provided: * If it directly calls `sym_constrain_range`, the C++ version will be called which will be no-op. * So in this case it calls `constrain_range_int` instead to be able to capture issue like user provides a input whose tensor's shape could be out of range during exporting, like the following for above code example: ``` ... inp = torch.tensor([10]) ep = torch._export.export(f, (inp,)) # immediately raise error ``` Differential Revision: [D46734204](https://our.internmc.facebook.com/intern/diff/D46734204) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103346 Approved by: https://github.com/tugsbayasgalan	2023-06-16 14:55:40 +00:00
mingfeima	69b09eca5a	optimize reflection padding performance on CPU (#102254 ) This patch improves reflection padding performance on CPU. Original kernel has nested paralleled loops, e.g. first on dim of batch and then on dim of channels, this is not optimal practice when N * C is small. This patch did dimension collapse on NC and adjacent spatial dims to maximize the parallelism scope. The following benchmark result gathered on Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz, with 20 cores per socket. ### single core inference ``` (before) ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NCHW: 0.281 ms; ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NCHW: 55.675 ms; (after) ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NCHW: 0.049 ms; ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NCHW: 17.252 ms; ``` ### single socket inference ``` (before) ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NCHW: 0.118 ms; ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NCHW: 4.023 ms; (after) ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NCHW: 0.010 ms; ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NCHW: 3.149 ms; ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/102254 Approved by: https://github.com/cpuhrsch	2023-06-14 17:18:51 +00:00
PyTorch MergeBot	2c313e7b99	Revert "Record view stacks if running anomaly mode (#103185 )" This reverts commit a02c573a8996d5d47585410ceaf81c87104cfd43. Reverted https://github.com/pytorch/pytorch/pull/103185 on behalf of https://github.com/izaitsevfb due to Breaks internal builds, see D46629734 ([comment](https://github.com/pytorch/pytorch/pull/103185#issuecomment-1588258206))	2023-06-12 23:52:10 +00:00
Edward Z. Yang	a02c573a89	Record view stacks if running anomaly mode (#103185 ) Now, when you do an inplace mutation and the view is naughty, you get this message: ``` RuntimeError: A view was created in no_grad mode and is being modified inplace with grad mode enabled. Given that this use case is ambiguous and error-prone, it is forbidden. You can clarify your code by moving both the view and the inplace either both inside the no_grad block (if you don't want the inplace to be tracked) or both outside (if you want the inplace to be tracked). To find out where this view was allocated, run your entire forward region under anomaly mode (torch.autograd.detect_anomaly(check_nan=False)). ``` When you run under anomaly mode, you get: ``` RuntimeError: A view was created in no_grad mode and is being modified inplace with grad mode enabled. Given that this use case is ambiguous and error-prone, it is forbidden. You can clarify your code by moving both the view and the inplace either both inside the no_grad block (if you don't want the inplace to be tracked) or both outside (if you want the inplace to be tracked). This view was allocated at: File "/data/users/ezyang/c/pytorch/test/test_autograd.py", line 4299, in arglebargle File "/data/users/ezyang/c/pytorch/test/test_autograd.py", line 4306, in test_anomaly_gives_view_stack File "/home/ezyang/local/c/pytorch-env/lib/python3.10/unittest/case.py", line 549, in _callTestMethod File "/home/ezyang/local/c/pytorch-env/lib/python3.10/unittest/case.py", line 591, in run File "/data/users/ezyang/c/pytorch/torch/testing/_internal/common_utils.py", line 2266, in _run_with_retry File "/data/users/ezyang/c/pytorch/torch/testing/_internal/common_utils.py", line 2337, in run File "/home/ezyang/local/c/pytorch-env/lib/python3.10/unittest/case.py", line 650, in __call__ File "/home/ezyang/local/c/pytorch-env/lib/python3.10/unittest/suite.py", line 122, in run File "/home/ezyang/local/c/pytorch-env/lib/python3.10/unittest/suite.py", line 84, in __call__ File "/home/ezyang/local/c/pytorch-env/lib/python3.10/unittest/suite.py", line 122, in run File "/home/ezyang/local/c/pytorch-env/lib/python3.10/unittest/suite.py", line 84, in __call__ File "/home/ezyang/local/c/pytorch-env/lib/python3.10/unittest/runner.py", line 184, in run File "/home/ezyang/local/c/pytorch-env/lib/python3.10/unittest/main.py", line 271, in runTests File "/home/ezyang/local/c/pytorch-env/lib/python3.10/unittest/main.py", line 101, in __init__ File "/data/users/ezyang/c/pytorch/torch/testing/_internal/common_utils.py", line 894, in run_tests File "/data/users/ezyang/c/pytorch/test/test_autograd.py", line 11209, in <module> ``` Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/103185 Approved by: https://github.com/zdevito	2023-06-09 16:56:28 +00:00
Shiyan Deng	685505353a	Back out "Add PyObject preservation for UntypedStorage (#97470 )" (#102553 ) Summary: Original commit changeset: c24708d18ccb Original Phabricator Diff: D46159983 Test Plan: SL tests and CI Differential Revision: D46284986 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102553 Approved by: https://github.com/DanilBaibak	2023-06-01 17:23:43 +00:00
Kurt Mohler	5fe629e314	Add PyObject preservation for UntypedStorage (#97470 ) Part of #91395 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97470 Approved by: https://github.com/ezyang	2023-05-23 01:27:30 +00:00
Richard Li	c523d7d899	Add a new hook (#99854 ) Differential Revision: D45220984 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99854 Approved by: https://github.com/albanD	2023-04-26 23:00:38 +00:00
William Wen	785676ccb0	[dynamo 3.11] refactor cpython function defs out of eval_frame.c (#99947 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99947 Approved by: https://github.com/voznesenskym, https://github.com/albanD	2023-04-26 00:18:12 +00:00
Elias Ellison	d881b2978c	Make autocast cache and buffer stealing aware of cudagraph static output tensors (#99368 ) In this stack of PRs we adding caching to output tensors for cudagraph trees after we've done initial recording. On initial recording we do not cache tensor outputs because this prevents memory from being reclaimed. On subsequent exeuctions we do cache them to avoid overhead. However, because there is an extra reference around, this caused divergent recording & execution behavior in both autocast caching and autograd gradient stealing. Divergent recording & execution would keep on re-recording and eventually stabilize, but it's not what you want to see happen. This pr makes the autocast cache and buffer stealing aware of the cudagraph static output tensors. I will add this to the other cudagraph impl in another pr. Not sure if this should be in autograd or in autocast since it affects both.. Or somewhere else Pull Request resolved: https://github.com/pytorch/pytorch/pull/99368 Approved by: https://github.com/albanD, https://github.com/ezyang	2023-04-24 20:23:12 +00:00
Rodrigo Kumpera	38e964056b	Reland python ops (#99170 ) Waiting for the revert to land. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99170 Approved by: https://github.com/albanD	2023-04-18 15:15:46 +00:00
PyTorch MergeBot	1c042a2137	Revert "Reland python ops (#99170 )" This reverts commit d4de64ae8d5587ed4a4a9d6ce9555a9a7976866d. Reverted https://github.com/pytorch/pytorch/pull/99170 on behalf of https://github.com/DanilBaibak due to Break internal build	2023-04-18 11:37:43 +00:00
Rodrigo Kumpera	d4de64ae8d	Reland python ops (#99170 ) Waiting for the revert to land. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99170 Approved by: https://github.com/albanD	2023-04-17 21:53:41 +00:00
Rodrigo Kumpera	a910045add	[PATCH] Back out "Move functional collectives implementation to python. (#98595 ) (#99168 ) Summary: Original commit changeset: ba36f8751adc Original Phabricator Diff: D44788697 Test Plan: model loading is fine after reverting the diff Reviewed By: zyan0, sayitmemory Differential Revision: D44921259 --- Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/99168 Approved by: https://github.com/izaitsevfb	2023-04-14 23:48:19 +00:00
Rodrigo Kumpera	24d9001527	Move functional collectives implementation to python. (#98595 ) This simplifies a lot the work we need to add new ops. This relands the previous PR, not sure why it was reverted. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98595 Approved by: https://github.com/wconstab	2023-04-07 21:48:05 +00:00
PyTorch MergeBot	67d1a77086	Revert "Move functional collectives implementation to python. (#98315 )" This reverts commit 8b0374f83c605c47b7c1ba9274011c4b961666ce. Reverted https://github.com/pytorch/pytorch/pull/98315 on behalf of https://github.com/huydhn due to Sorry for reverting for PR. This is failing in trunk probably due to a landrace	2023-04-06 16:49:40 +00:00
Rodrigo Kumpera	8b0374f83c	Move functional collectives implementation to python. (#98315 ) This simplifies a lot the work we need to add new ops. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98315 Approved by: https://github.com/albanD, https://github.com/wconstab, https://github.com/Neilblaze	2023-04-06 14:06:16 +00:00
PyTorch MergeBot	f4f1a5b5b3	Revert "Move functional collectives to the right namespace (#97793 )" This reverts commit 184bfbc3d7b37e8f202f4938f6ea9ba557c93b1e. Reverted https://github.com/pytorch/pytorch/pull/97793 on behalf of https://github.com/atalman due to breaks internal builds	2023-03-31 16:02:07 +00:00
Rodrigo Kumpera	184bfbc3d7	Move functional collectives to the right namespace (#97793 ) This moves them from `torch._C._nn` to `torch._C._dist` Pull Request resolved: https://github.com/pytorch/pytorch/pull/97793 Approved by: https://github.com/albanD	2023-03-30 22:18:13 +00:00
Shuming Hu	b45880c537	Optionally ignore utf-8 decoding error when converting std::string to python str. (#97282 ) Summary: When language models use c++ tokenizer, outputs are a c++ strings that are not necessarily valid utf-8 encodings. Default pybind11 casting uses strict utf-8 decoding. We relax the decoding using 'ignore' argument. Test Plan: https://www.internalfb.com/intern/testinfra/testrun/6473924609918070 Reviewed By: Nayef211 Differential Revision: D43970697 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97282 Approved by: https://github.com/davidberard98	2023-03-23 01:19:08 +00:00
Zachary DeVito	e74f70d212	Revert "Revert "[memory profiling] add a facility to gather combined C++/Python/TorchScript stack traces. (#95541 )"" (#96878 ) This reverts commit e1ea584b1caf9c50de25ce69396dfeb523a452c0. Adds __has_include check to fix fbcode build. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96878 Approved by: https://github.com/ezyang	2023-03-16 04:12:54 +00:00
PyTorch MergeBot	e1ea584b1c	Revert "[memory profiling] add a facility to gather combined C++/Python/TorchScript stack traces. (#95541 )" This reverts commit 4e1060c609c094fd5f58041ebed803f74410ee36. Reverted https://github.com/pytorch/pytorch/pull/95541 on behalf of https://github.com/DanilBaibak due to breaking internal builds	2023-03-15 13:28:41 +00:00
Ramin Azarmehr	234df29901	[MPS] Add C++ API support for MPS backend (#96668 ) - This enables the APIs `torch::mps::is_available()/synchronize()/manual_seed()` for use in PyTorch C++. - Added test case for C++ APIs to `mps_test_allocator.cpp` Fixes #96425 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96668 Approved by: https://github.com/kulinseth, https://github.com/albanD, https://github.com/malfet	2023-03-14 20:27:40 +00:00
Zachary DeVito	4e1060c609	[memory profiling] add a facility to gather combined C++/Python/TorchScript stack traces. (#95541 ) This refactors the stack trace facility specific to memory profiling in python+cuda to make a generic facility to generate combined stack traces. The generic facility (combined_traceback.h) does not require python to be around to work, but will return python stacks if it is present. This facility is then used to add support for stack trace gathering in memory profiling that happens directly from C++. It is also used to expose a python API for gathering and symbolizing combineds stacks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95541 Approved by: https://github.com/ezyang	2023-03-14 18:26:05 +00:00
PyTorch MergeBot	a07817ad8f	Revert "[MPS] Add C++ API support for MPS backend (#96668 )" This reverts commit 069ace131c7889c7aaf2ea64fe8eb44a8ff1e983. Reverted https://github.com/pytorch/pytorch/pull/96668 on behalf of https://github.com/DanilBaibak due to breaking internal builds	2023-03-14 12:43:04 +00:00
Nikita Shulga	82daf98151	[Sparse] Move `SparseTensorUtils.` to `native/` (#96696 ) Fixes internal linking problem after `DECLARE_DISPATCH` was introduced in SparseTensorUtils.cpp, but implemented inside the native library. Also, fix `sign-unsigned` compare in `_flatten_indices_impl` Followups: Move code declared/implemented in `SparseTensorUtils.` to `at::native` namespace Pull Request resolved: https://github.com/pytorch/pytorch/pull/96696 Approved by: https://github.com/albanD	2023-03-14 02:56:52 +00:00
Ramin Azarmehr	069ace131c	[MPS] Add C++ API support for MPS backend (#96668 ) - This enables the APIs `torch::mps::is_available()/synchronize()/manual_seed()` for use in PyTorch C++. - Added test case for C++ APIs to `mps_test_allocator.cpp` Fixes #96425 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96668 Approved by: https://github.com/kulinseth, https://github.com/albanD	2023-03-13 23:15:37 +00:00
Zachary DeVito	4b372e3958	[memory profiling] C++ tracing support (#95357 ) Adds the ability to quickly generate stack traces for C++, and combine Python, TorchScript, and C++ frames into a single trace. This makes it possible for the memory tracer to record allocations inside C++ code (e.g. convolution temporaries, backward operators). The unwinder code is ~10x faster than execinfo.h's backward because it cache fast unwinder routines for instruction pointers that have already been seen. It is also only 1.2--2x slower than copying the entire stack (the approach perf takes), while using 2 orders of magnitude less space per stack. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95357 Approved by: https://github.com/bertmaher	2023-03-12 07:24:14 +00:00
Xunsong, Huang	b053a0f2ba	[XPU][Profiler] Add API support for XPU profiler to Kineto path (#94502 ) This patch is aimed to add support to XPU profiler which will co-work with Kineto. After this PR, kineto will follow these API to fit itself. Also, the development of interface in python is near done. Signed-off-by: Huang, Xunsong <xunsong.huang@intel.com> Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/94502 Approved by: https://github.com/ezyang	2023-03-10 12:17:14 +00:00
Nikita Vedeneev	d0f4d62961	flatten_indices: remove syncs (#94401 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94401 Approved by: https://github.com/cpuhrsch, https://github.com/pearu	2023-03-10 12:03:26 +00:00
David Berard	53b4f6c0f6	Revert "[jit] Add c++ stacktraces for jit::ErrorReport (#94842 )" (#95886 ) This reverts commit 70029214f300f611e7dd816b5f64426224f6ab96. It broke some internal tests. Differential Revision: [D43735833](https://our.internmc.facebook.com/intern/diff/D43735833) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95886 Approved by: https://github.com/malfet, https://github.com/qihqi	2023-03-03 05:49:40 +00:00
David Berard	70029214f3	[jit] Add c++ stacktraces for jit::ErrorReport (#94842 ) Summary: This PR adds C++ stacktraces to jit::ErrorReports. After this PR, if you run with `TORCH_SHOW_CPP_STACKTRACES=1` environment variable and a jit::ErrorReport is thrown, then the C++ stacktrace should be displayed. More background: This behavior already occurs for c10::Error; but not for jit::ErrorReport. jit::ErrorReport _does_ usually have a python stacktrace for the python source, but it is sometimes still helpful to know where in the C++ codebase the error came from. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94842 Approved by: https://github.com/qihqi	2023-02-28 22:37:51 +00:00
Rodrigo Kumpera	e22d791287	[PTD] Introduce tracing friendly collectives. (#93990 ) This change adds torch.distributed.traceable_collectives. This experimental API enables collectives to be fully traced by dynamo and FX. See #93173 for the RFC Pull Request resolved: https://github.com/pytorch/pytorch/pull/93990 Approved by: https://github.com/wconstab, https://github.com/wanchaol, https://github.com/H-Huang	2023-02-16 15:35:01 +00:00
Ramin Azarmehr	bdd8f518d7	[MPS] Add Python Module Bindings for the MPS backend (#94417 ) - This PR is a prerequisite for the upcoming Memory Leak Detection PR. - Enable global manual seeding via `torch.manual_seed()` + test case - Add `torch.mps.synchronize()` to wait for MPS stream to finish + test case - Enable the following python interfaces for MPS: `torch.mps.[get_rng_state(), set_rng_state(), synchronize(), manual_seed(), seed()]` - Added some test cases in test_mps.py - Added `mps.rst` to document the `torch.mps` module. - Fixed the failure with `test_public_bindings.py` Description of new files added: - `torch/csrc/mps/Module.cpp`: implements `torch._C` module functions for `torch.mps` and `torch.backends.mps`. - `torch/mps/__init__.py`: implements Python bindings for `torch.mps` module. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94417 Approved by: https://github.com/albanD	2023-02-12 21:22:30 +00:00
PyTorch MergeBot	4fe365774a	Revert "[MPS] Add Python Module Bindings for the MPS backend (#94417 )" This reverts commit beb4f5bf396ec2d53defa73c81aac48c38360544. Reverted https://github.com/pytorch/pytorch/pull/94417 on behalf of https://github.com/huydhn due to Sorry for reverting your PR, but it seems to break MacOS test in trunk `bae397ec63`	2023-02-11 05:24:45 +00:00
Ramin Azarmehr	beb4f5bf39	[MPS] Add Python Module Bindings for the MPS backend (#94417 ) - This PR is a prerequisite for the upcoming Memory Leak Detection PR. - Enable global manual seeding via `torch.manual_seed()` + test case - Add `torch.mps.synchronize()` to wait for MPS stream to finish + test case - Enable the following python interfaces for MPS: `torch.mps.[get_rng_state(), set_rng_state(), synchronize(), manual_seed(), seed()]` - Added some test cases in test_mps.py - Added `mps.rst` to document the `torch.mps` module. - Fixed the failure with `test_public_bindings.py` Description of new files added: - `torch/csrc/mps/Module.cpp`: implements `torch._C` module functions for `torch.mps` and `torch.backends.mps`. - `torch/mps/__init__.py`: implements Python bindings for `torch.mps` module. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94417 Approved by: https://github.com/albanD	2023-02-10 23:18:41 +00:00
Maxwell Nuyens	0d0ebcdfe5	feature: adding the ability to restore shapes after loading a traced model (#90744 ) Adds the ability to store inputs used in tracing models when calling torch.jit.save and restore the input shapes using torch.jit.load if the appropriate variables are set. Fixes [89185](https://github.com/pytorch/pytorch/issues/89185) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90744 Approved by: https://github.com/davidberard98	2023-02-10 17:12:52 +00:00
mingfeima	c620ece726	port sparse_mm.reduce to pytorch and optimize it on CPU (#83727 ) ### Motivation of this PR This patch is to migrate `spmm_reduce` from `torch-sparse` (a 3rd party dependency for PyG) to `torch`, which is a response to the initial proposal for fusion of Gather, Apply Scatter in Message Passing of GNN inference/training. https://github.com/pytorch/pytorch/issues/71300 GAS is the major step for Message Passing, the behavior of GAS can be classified into 2 kinds depending on the storage type of `EdgeIndex` which records the connections of nodes: * COO: the hotspot is `scatter_reduce` * CSR: the hotspot is `spmm_reduce` The reduce type can be choose from: "max", "mean", "max", "min". extend `torch.sparse.mm` with an `reduce` argument, maps to `torch.sparse_mm.reduce` internally. `sparse_mm_reduce` is registered under the TensorTypeId of `SparseCsrCPU`, and this operator requires an internal interface `_sparse_mm_reduce_impl` which has dual outputs: * `out` - the actual output * `arg_out` - records output indices in the non zero elements if the reduce type is "max" or "min", this is only useful for training. So for inference, it will not be calculated. ### Performance Benchmark on GCN for obgn-products on Xeon single socket, the workload is improved by `4.3x` with this patch. Performance benefit for training will be bigger, the original backward impl for `sum\|mean` is sequential; the original backward impl for `max\|min` is not fused. #### before: ``` ----------------------------- ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls ----------------------------- ------------ ------------ ------------ ------------ ------------ ------------ torch_sparse::spmm_sum 97.09% 56.086s 97.09% 56.088s 6.232s 9 aten::linear 0.00% 85.000us 1.38% 795.485ms 88.387ms 9 aten::matmul 0.00% 57.000us 1.38% 795.260ms 88.362ms 9 aten::mm 1.38% 795.201ms 1.38% 795.203ms 88.356ms 9 aten::relu 0.00% 50.000us 0.76% 440.434ms 73.406ms 6 aten::clamp_min 0.76% 440.384ms 0.76% 440.384ms 73.397ms 6 aten::add_ 0.57% 327.801ms 0.57% 327.801ms 36.422ms 9 aten::log_softmax 0.00% 23.000us 0.10% 55.503ms 18.501ms 3 ``` #### after ``` ----------------------------- ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls ----------------------------- ------------ ------------ ------------ ------------ ------------ ------------ aten::spmm_sum 87.35% 11.826s 87.36% 11.827s 1.314s 9 aten::linear 0.00% 92.000us 5.87% 794.451ms 88.272ms 9 aten::matmul 0.00% 62.000us 5.87% 794.208ms 88.245ms 9 aten::mm 5.87% 794.143ms 5.87% 794.146ms 88.238ms 9 aten::relu 0.00% 53.000us 3.35% 452.977ms 75.496ms 6 aten::clamp_min 3.35% 452.924ms 3.35% 452.924ms 75.487ms 6 aten::add_ 2.58% 348.663ms 2.58% 348.663ms 38.740ms 9 aten::argmax 0.42% 57.473ms 0.42% 57.475ms 14.369ms 4 aten::log_softmax 0.00% 22.000us 0.39% 52.605ms 17.535ms 3 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/83727 Approved by: https://github.com/jgong5, https://github.com/cpuhrsch, https://github.com/rusty1s, https://github.com/pearu	2023-02-10 15:56:40 +00:00
jjsjann123	c11b301bcd	[NVFUSER] refactor nvfuser build (#89621 ) This PR is the first step towards refactors the build for nvfuser in order to have the coegen being a standalone library. Contents inside this PR: 1. nvfuser code base has been moved to `./nvfuser`, from `./torch/csrc/jit/codegen/cuda/`, except for registration code for integration (interface.h/interface.cpp) 2. splits the build system so nvfuser is generating its own `.so` files. Currently there are: - `libnvfuser_codegen.so`, which contains the integration, codegen and runtime system of nvfuser - `nvfuser.so`, which is nvfuser's python API via pybind. Python frontend is now exposed via `nvfuser._C.XXX` instead of `torch._C._nvfuser` 3. nvfuser cpp tests is currently being compiled into `nvfuser_tests` 4. cmake is refactored so that: - nvfuser now has its own `CMakeLists.txt`, which is under `torch/csrc/jit/codegen/cuda/`. - nvfuser backend code is not compiled inside `libtorch_cuda_xxx` any more - nvfuser is added as a subdirectory under `./CMakeLists.txt` at the very end after torch is built. - since nvfuser has dependency on torch, the registration of nvfuser at runtime is done via dlopen (`at::DynamicLibrary`). This avoids circular dependency in cmake, which will be a nightmare to handle. For details, look at `torch/csrc/jit/codegen/cuda/interface.cpp::LoadingNvfuserLibrary` Future work that's scoped in following PR: - Currently since nvfuser codegen has dependency on torch, we need to refactor that out so we can move nvfuser into a submodule and not rely on dlopen to load the library. @malfet - Since we moved nvfuser into a cmake build, we effectively disabled bazel build for nvfuser. This could impact internal workload at Meta, so we need to put support back. cc'ing @vors Pull Request resolved: https://github.com/pytorch/pytorch/pull/89621 Approved by: https://github.com/davidberard98	2023-01-26 02:50:44 +00:00
Kurt Mohler	4d9920fa9c	Move PyInterpreter code in `python_variable.cpp` to its own files (#92647 ) Part of #91395 Pull Request resolved: https://github.com/pytorch/pytorch/pull/92647 Approved by: https://github.com/ezyang, https://github.com/albanD	2023-01-24 23:08:23 +00:00
Elias Ellison	70f4b3551c	Add Hook to store arbitrary python objects that are copied over in tls (#89169 ) For the cudagraphs implementation, we would like to reuse objects that are defined in python across the forward and backward. The backward is run in a different thread, so to handle this we add an api for copying over arbitrary python objects in pytorch's thread local state, in the same way that C++ objects are copied over currently. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89169 Approved by: https://github.com/albanD	2023-01-24 05:24:57 +00:00
soulitzer	1bc60c6b31	[reland] Improve hooks ordering behavior (#92559 ) This reverts commit e525f433e15de1f16966901604a8c4c662828a8a. Original PR: #85849 Fixes #ISSUE_NUMBER In addition to reverting the revert, this PR: - defines the virtual destructor of FunctionPreHook in the header. Why? Presumably the internal build imports the header from somewhere, but does not have function_hooks.cpp (where the virtual destructor was previously defined) in the same compilation unit. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92559 Approved by: https://github.com/albanD	2023-01-19 08:17:32 +00:00
yanbing-j	94a7c01159	Enable oneDNN implementation in LSTM op (#91158 ) ### Description This PR is to enable oneDNN implementation in LSTM op to improve the performance of it. Both FP32 and BF16 are supported. ### Performance improvement In CPX 28C, with setting iomp and jemalloc. We choose 8 LSTM input options (including input_size, hidden_size, num_layers, bidirectional, bias, batch_first, dropout, batch_size, seq_len), and the final option is a real input from train-clean-100 in LibriSpeech dataset. The performance improvements are shown in the following figures. We can see that LSTM with oneDNN implementation can perform better than the original. In single socket: ![image](https://user-images.githubusercontent.com/61222868/211182994-833debec-518a-4b35-8504-6b0fadb17930.png) ![image](https://user-images.githubusercontent.com/61222868/211183012-31e1253f-2c60-4c92-a656-c239a971b453.png) In single core: ![image](https://user-images.githubusercontent.com/61222868/211183017-186e5d47-cb9a-4c1e-914f-fa718e769f1c.png) ![image](https://user-images.githubusercontent.com/61222868/211183022-53266857-5a9e-4a95-b300-33fa34811d08.png) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91158 Approved by: https://github.com/jgong5, https://github.com/malfet	2023-01-18 04:41:18 +00:00
mingfeima	3ab58fd5ed	optimize sampled_addmm performance on CPU (SparseCSR) (#90978 ) ### Target and Background This PR is improving the performance of `sampled_addmm` on CPU device. This is part of effort for improving PyG performance on CPU for GNN training/inference. The current implementation is a reference design which converts `SparseCSR` tensor back to dense tensor and then do the addmm and convert back to `SparseCSR` again: this is going to be very slow and won't be able to run most of the datasets under https://github.com/snap-stanford/ogb (convert to dense would trigger `OOM`). ### Benchmarks Right now we don't have any hands-on benchmark or workload to test this since this operator is not used in PyG yet. I fetched the dataset from `ogb-products` where: * number of nodes: 2.4 * 10^6 * number of edges: 1.26 * 10^8 * number of features: 128 So if we store the adjacency matrix is dense, it is going to be 2.4 * 2.4 * 4 * 10^12 bytes, this will be OOB on current code. I abstract the first 1k rows to compare, 1100x speedup: CPU: Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz, dual socket, 20 cores per socket. ``` ### before: run 1000 rows from the whole dataset sampled_addmm: running dataset ogb-products first 1000 rows: each iter takes 1212.000 ms! ### after: run 1000 rows from the whole dataset sampled_addmm: running dataset ogb-products first 1000 rows: each iter takes 1.102 ms! ### after: run the whole dataset sampled_addmm: running dataset ogb-products (the whole dataset) 2449029 rows: each iter takes 873.306 ms! ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/90978 Approved by: https://github.com/pearu, https://github.com/cpuhrsch	2023-01-12 12:04:07 +00:00
Nikita Shulga	6cef59487a	[BE] Move internal only non-globbed lists to OSS (#91513 ) Summary: Should prevent internal only fixes that were required for https://github.com/pytorch/pytorch/pull/91104 Just moves the list to `build_variables.bzl` and makes it a sublist of aten_cpu_source_non_codegen_list Test Plan: CI Differential Revision: D42281502 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91513 Approved by: https://github.com/kit1980, https://github.com/atalman	2022-12-31 00:02:43 +00:00
Ramin Azarmehr	eeb9154b27	[MPS] Add MPSHooks interface to enable accessing MPS functions globally (#91104 ) This PR is a prerequisite to the upcoming MPSGenerator changes required for Random Ops. Add `MPSHooksInterface.cpp` to `aten_cpu_source_non_codegen_list` Co-authored-by: Nikita Shulga <nikita.shulga@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/91104 Approved by: https://github.com/kulinseth, https://github.com/malfet	2022-12-21 17:37:09 +00:00
min-jean-cho	6d2b0cbb40	[Re-landing 86706] [JIT] Frozen Graph Linear-BatchNormNd Folding (#91020 ) Re-landing #86706 This PR adds linear-batchnormNd folding for JIT frozen graphs. Performance benchmark A preliminary benchmark with a simple model of linear+bn1d tested on first socket, physical cores of skylake machine. FP32, JIT without linear-bn folding ![Screenshot (1368)](https://user-images.githubusercontent.com/93151422/195168944-cfc5b920-bc82-4be1-a221-d194c8fa6c18.png) with linear-bn folding ![Screenshot (1367)](https://user-images.githubusercontent.com/93151422/195168926-267b0515-45a1-4f08-922d-c150845199ae.png) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91020 Approved by: https://github.com/davidberard98	2022-12-21 08:00:32 +00:00
Salil Desai	8c80a4684b	[Vulkan + Profiler] Report Vulkan Events to Profiler in QueryPool (#90670 ) @bypass-github-export-checks With this change, we see Vulkan events reported on the generated chrometrace with proper names and durations. However, their start/end times are not yet synced with the cpu event timeline, and their parent/child relationships are not established properly. These concerns will be addressed in future diffs Differential Revision: [D39834807](https://our.internmc.facebook.com/intern/diff/D39834807/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D39834807/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/90670 Approved by: https://github.com/kimishpatel	2022-12-19 19:56:28 +00:00
PyTorch MergeBot	31b8dc7542	Revert "[JIT] Frozen Graph Linear-BatchNormNd Folding (#86706 )" This reverts commit e585156c59767ff13306a31d8c31ffe7a33439dc. Reverted https://github.com/pytorch/pytorch/pull/86706 on behalf of https://github.com/davidberard98 due to possibly causing internal build failures, will revert and investigate later	2022-12-16 00:49:54 +00:00

1 2 3

148 Commits