pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-21 13:44:15 +08:00

Author	SHA1	Message	Date
Nikita Shulga	4cb534f92e	Make PyTorch code-base clang-tidy compliant (#56892 ) Summary: This is an automatic change generated by the following script: ``` #!/usr/bin/env python3 from subprocess import check_output, check_call import os def get_compiled_files_list(): import json with open("build/compile_commands.json") as f: data = json.load(f) files = [os.path.relpath(node['file']) for node in data] for idx, fname in enumerate(files): if fname.startswith('build/') and fname.endswith('.DEFAULT.cpp'): files[idx] = fname[len('build/'):-len('.DEFAULT.cpp')] return files def run_clang_tidy(fname): check_call(["python3", "tools/clang_tidy.py", "-c", "build", "-x", fname,"-s"]) changes = check_output(["git", "ls-files", "-m"]) if len(changes) == 0: return check_call(["git", "commit","--all", "-m", f"NOLINT stubs for {fname}"]) def main(): git_files = check_output(["git", "ls-files"]).decode("ascii").split("\n") compiled_files = get_compiled_files_list() for idx, fname in enumerate(git_files): if fname not in compiled_files: continue if fname.startswith("caffe2/contrib/aten/"): continue print(f"[{idx}/{len(git_files)}] Processing {fname}") run_clang_tidy(fname) if __name__ == "__main__": main() ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/56892 Reviewed By: H-Huang Differential Revision: D27991944 Pulled By: malfet fbshipit-source-id: 5415e1eb2c1b34319a4f03024bfaa087007d7179	2021-04-28 14:10:25 -07:00
Robin Cheng	5d940e2fbc	[TSAN] Fix PythonEngine data-race-on-vptr. (#56808 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56808 For information about data-race-on-vptr in general, see https://www.internalfb.com/intern/wiki/TSAN/Common_Concurrency_Mistakes/Stopping_a_Thread_in_Destructor/ Engine::~Engine() was previously tasked with stopping the threads. This causes a data race on the object's vptr when PythonEngine is being destructed. This fixes the data race by making ~PythonEngine trigger the thread stopping before going down to the base class's destructor. Test Plan: Many tests are affected, but here's one example: buck test mode/dev-tsan -c fbcode.tsan_strict_mode=true //oculus/research/orcoptics/deep_learning/srg_nn/tests:test_grating_net -- 'test_train (oculus.research.orcoptics.deep_learning.srg_nn.tests.test_grating_net.TestGratingNet)' --run-disabled Reviewed By: walterddr, albanD Differential Revision: D27972384 fbshipit-source-id: 8b70fec8d9326497c591a2777b355ea590a85082	2021-04-23 17:39:27 -07:00
Edward Yang	6ec71ed4f9	Replace all direct cdata access with THPVariable_Unpack (#55799 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55799 I'm going to change the implementation of cdata soon so I need to abstract over cdata access with a function. Additionally, many users are casting manually casting to THPVariable to access the member so I can remove these unsafe casts in the client code (the implementation, of course, is still doing an unsafe cast.) Signed-off-by: Edward Z. Yang <ezyang@fb.com> Test Plan: Imported from OSS Reviewed By: albanD Differential Revision: D27712130 Pulled By: ezyang fbshipit-source-id: 95fcc013bf3913d67f2c634068eb5b3aab144cb3	2021-04-15 08:57:04 -07:00
Jeffrey Wan	aa2fede201	Fix autograd when `inputs` contains tensors without materialized grad_fn (#51940 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/39784 At the time the issue was filed, there was only issue (1) below. There are actually now two issues here: 1. We always set all inputs passed in through `inputs` arg as `needed = True` in exec_info. So if we pass in an input that has a grad_fn that is not materialized, we create an entry of exec_info with nullptr as key with `needed = True`. Coincidentally, when we perform simple arithmetic operations, such as "2 * x", one of the next edges of mul is an invalid edge, meaning that its grad_fn is also nullptr. This causes the discovery algorithm to set all grad_fns that have a path to this invalid_edge as `needed = True`. 2. Before the commit that enabled the engine skipped the dummy node, we knew that root node is always needed, i.e., we hardcode `exec_info[&graph_root]=true`. The issue was that this logic wasn't updated after the code was updated to skip the graph root. To address (1), instead of passing in an invalid edge if an input in `inputs` has no grad_fn, we create a dummy grad_fn. This is done in both python and cpp entry points. The alternative is to add logic for both backward() and grad() cases to check whether the grad_fn is nullptr and set needed=false in that case (the .grad() case would be slightly more complicated than the .backward() case here). For (2), we perform one final iteration of the discovery algorithm so that we really know whether we need to execute the graph root. Pull Request resolved: https://github.com/pytorch/pytorch/pull/51940 Reviewed By: VitalyFedyunin Differential Revision: D26369529 Pulled By: soulitzer fbshipit-source-id: 14a01ae7988a8de621b967a31564ce1d7a00084e	2021-02-11 09:22:15 -08:00
Nikita Shulga	6f3aa58d80	Fix autograd thread crash with python-3.9 (#50998 ) Summary: Update pybind repo to include `gil_scoped_acquire::disarm()` methods In python_engine allocate scoped_acquire as unique_ptr and leak it if engine is finalizing for Python-3.9+ Fixes https://github.com/pytorch/pytorch/issues/50014 and https://github.com/pytorch/pytorch/issues/50893 Pull Request resolved: https://github.com/pytorch/pytorch/pull/50998 Reviewed By: ezyang Differential Revision: D26038314 Pulled By: malfet fbshipit-source-id: 035411e22825e8fdcf1348fed36da0bc33e16f60	2021-01-26 13:29:47 -08:00
Qifan Lu	cfc3db0ca9	Remove THPWrapper (#49871 ) Summary: Remove `THPWrapper` from PyTorch C code since it is not used anymore and because we have dropped Python 2 compatibility, its usage can be replaced by capsule objects (`PyCapsule_New`, `PyCapsule_CheckExact`, `PyCapsule_GetPointer` and `PyCapsule_GetDestructor`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/49871 Reviewed By: mruberry Differential Revision: D25715038 Pulled By: albanD fbshipit-source-id: cc3b6f967bbe0dc42c692adf76dff4e4b667fdd5	2020-12-30 03:01:52 -08:00
Jeffrey Wan	d20483a999	Skip dummy node creation for autograd engine when there is a single input and place on correct queue (#47592 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/42890 - Removes dummy node - Places graph root on the correct queue based on input buffer's device instead of cpu queue by default cpu - no significant change in speed (too noisy to measure), but we see up to 7% reduction in small graphs cuda - small reduction in speed (still very noisy) and up to ~20% reduction in instruction count for small graphs CPU Code: ``` import torch from torch.utils.benchmark import Timer setup=""" a = torch.rand((2, 2), requires_grad=True) b = torch.rand((2, 2), requires_grad=True) gradient = torch.ones(2, 2) """ stmt=""" torch.autograd.grad(ab, [a, b], gradient) """ timer = Timer(stmt, setup) print(timer.timeit(10000)) print(timer.collect_callgrind(100)) ``` Before (when dummy node is not skipped): ``` torch.autograd.grad(ab, [a, b], gradient) setup: a = torch.rand((2, 2), requires_grad=True) b = torch.rand((2, 2), requires_grad=True) gradient = torch.ones(2, 2) 26.62 us 1 measurement, 10000 runs , 1 thread <torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.CallgrindStats object at 0x7efee44ad8e0> torch.autograd.grad(ab, [a, b], gradient) setup: a = torch.rand((2, 2), requires_grad=True) b = torch.rand((2, 2), requires_grad=True) gradient = torch.ones(2, 2) All Noisy symbols removed Instructions: 9755488 9659378 Baseline: 4300 3784 100 runs per measurement, 1 thread ``` After ``` <torch.utils.benchmark.utils.common.Measurement object at 0x7f56961a7730> torch.autograd.grad(ab, [a, b], gradient) setup: a = torch.rand((2, 2), requires_grad=True) b = torch.rand((2, 2), requires_grad=True) gradient = torch.ones(2, 2) 26.78 us 1 measurement, 10000 runs , 1 thread <torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.CallgrindStats object at 0x7f56961a78e0> torch.autograd.grad(ab, [a, b], gradient) setup: a = torch.rand((2, 2), requires_grad=True) b = torch.rand((2, 2), requires_grad=True) gradient = torch.ones(2, 2) All Noisy symbols removed Instructions: 9045508 8939872 Baseline: 4280 3784 100 runs per measurement, 1 thread ``` Cuda* Before ``` <torch.utils.benchmark.utils.common.Measurement object at 0x7f84cbaa1ee0> torch.autograd.grad(out, [x, y], gradient) setup: x = torch.rand((2,2), requires_grad=True, device="cuda") y = torch.rand((2,2), requires_grad=True, device="cuda") out = x + y gradient = torch.ones(2, 2).cuda() 70.49 us 1 measurement, 10000 runs , 1 thread <torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.CallgrindStats object at 0x7f84cbaa1e50> torch.autograd.grad(out, [x, y], gradient) setup: x = torch.rand((2,2), requires_grad=True, device="cuda") y = torch.rand((2,2), requires_grad=True, device="cuda") out = x + y gradient = torch.ones(2, 2).cuda() All Noisy symbols removed Instructions: 5054581 4951911 Baseline: 4105 3735 100 runs per measurement, 1 thread ``` Remove dummy node only ``` <torch.utils.benchmark.utils.common.Measurement object at 0x7fbf29c67eb0> torch.autograd.grad(out, [x, y], gradient) setup: x = torch.rand((2,2), requires_grad=True, device="cuda") y = torch.rand((2,2), requires_grad=True, device="cuda") out = x + y gradient = torch.ones(2, 2).cuda() 55.65 us 1 measurement, 10000 runs , 1 thread <torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.CallgrindStats object at 0x7fbf29c67e20> torch.autograd.grad(out, [x, y], gradient) setup: x = torch.rand((2,2), requires_grad=True, device="cuda") y = torch.rand((2,2), requires_grad=True, device="cuda") out = x + y gradient = torch.ones(2, 2).cuda() All Noisy symbols removed Instructions: 5002105 4900841 Baseline: 4177 3731 100 runs per measurement, 1 thread ``` Remove dummy node and put in correct queue ``` <torch.utils.benchmark.utils.common.Measurement object at 0x7fb64438ce80> torch.autograd.grad(out, [x, y], gradient) setup: x = torch.rand((2,2), requires_grad=True, device="cuda") y = torch.rand((2,2), requires_grad=True, device="cuda") out = x + y gradient = torch.ones(2, 2).cuda() 27.56 us 1 measurement, 10000 runs , 1 thread <torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.CallgrindStats object at 0x7fb64438cdf0> torch.autograd.grad(out, [x, y], gradient) setup: x = torch.rand((2,2), requires_grad=True, device="cuda") y = torch.rand((2,2), requires_grad=True, device="cuda") out = x + y gradient = torch.ones(2, 2).cuda() All Noisy symbols removed Instructions: 4104433 4007555 Baseline: 4159 3735 100 runs per measurement, 1 thread ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/47592 Reviewed By: ailzhang Differential Revision: D24890761 Pulled By: soulitzer fbshipit-source-id: f457376e4a882f8a59476e8c1e708391b1a031a2	2020-11-16 11:33:35 -08:00
Jeffrey Wan	ea93bdc212	Add comment explaining purpose of the accumulate_grad argument (#47266 ) Summary: Addressing a comment from a PR that has already been merged https://github.com/pytorch/pytorch/issues/46855 https://github.com/pytorch/pytorch/pull/46855#discussion_r515161953 Pull Request resolved: https://github.com/pytorch/pytorch/pull/47266 Reviewed By: agolynski Differential Revision: D24709017 Pulled By: soulitzer fbshipit-source-id: 3c104c2fef90ffd75951ecef4ae9e938d4b12d8c	2020-11-03 13:18:23 -08:00
Jeffrey Wan	f5073b0c5a	Add `inputs` argument to `autograd.backward()` (#46855 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/46373 As noted in https://github.com/pytorch/pytorch/issues/46373, there needs to be a flag passed into the engine that indicates whether it was executed through the backward api or grad api. Tentatively named the flag `accumulate_grad` since functionally, backward api accumulates grad into .grad while grad api captures the grad and returns it. Moving changes not necessary to the python api (cpp, torchscript) to a new PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/46855 Reviewed By: ngimel Differential Revision: D24649054 Pulled By: soulitzer fbshipit-source-id: 6925d5a67d583eeb781fc7cfaec807c410e1fc65	2020-11-02 14:32:38 -08:00
Pritam Damania	2b221a9599	Remove PyCFunction casts as much as possible. (#46227 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46227 Follow up from https://github.com/pytorch/pytorch/issues/45419, in this PR I've removed as many PyCFunction casts as I could from the codebase. The only ones I didn't remove were the ones with `METH_VARARGS \| METH_KEYWORDS` which have 3 parameters instead of 2 and had to be casted. Example: ` {"copy_", (PyCFunction)(void(*)(void))THPStorage_(copy_), METH_VARARGS \| METH_KEYWORDS, nullptr},` ghstack-source-id: 114632704 Test Plan: waitforbuildbot Reviewed By: albanD Differential Revision: D24269435 fbshipit-source-id: 025cfd43a9a2a3e59f6b2951c1a78749193d77cf	2020-10-20 15:01:51 -07:00
anjali411	415ed434aa	Add whitelist for complex backward (#45461 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45461 This PR disables autograd for all C -> C, R -> C functions which are not included in the whitelist `GRADIENT_IMPLEMENTED_FOR_COMPLEX`. In practice, there will be a RuntimeError during forward computation when the outputs are differentiable: ``` >>> x=torch.randn(4, 4, requires_grad=True, dtype=torch.cdouble) >>> x.pow(3) Traceback (most recent call last): File "<stdin>", line 1, in <module> RuntimeError: pow does not support automatic differentiation for outputs with complex dtype. ``` The implicit assumption here is that all the C -> R functions have correct backward definitions. So before merging this PR, the following functions must be tested and verified to have correct backward definitions: `torch.abs` (updated in #39955 ), `torch.angle`, `torch.norm`, `torch.irfft`, `torch.istft`. Test Plan: Imported from OSS Reviewed By: malfet Differential Revision: D23998156 Pulled By: anjali411 fbshipit-source-id: 370eb07fe56ac84dd8e2233ef7bf3a3eb8aeb179	2020-09-30 08:45:55 -07:00
Pritam Damania	931b8b4ac8	Use ivalue::Future in autograd engine and DistEngine. (#43676 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43676 This is one part of https://github.com/pytorch/pytorch/issues/41574 to ensure we consolidate everything around ivalue::Future. I've removed the use of torch/csrc/utils/future.h from the autograd engines and used ivalue::Future instead. ghstack-source-id: 110895545 Test Plan: waitforbuildbot. Reviewed By: albanD Differential Revision: D23362415 fbshipit-source-id: aa109b3f8acf0814d59fc5264a85a8c27ef4bdb6	2020-08-29 02:15:26 -07:00
Richard Zou	bda0007620	Improve calling backward() and grad() inside vmap error messages (#42876 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42876 Previously, the error messages were pretty bad. This PR adds nice error messages for the following cases: - user attempts to call .backward() inside vmap for any reason whatsoever - user attempts to call autograd.grad(outputs, inputs, grad_outputs), where outputs or inputs is being vmapped over (so they are BatchedTensors). The case we do support is calling autograd.grad(outputs, inputs, grad_outputs) where `grad_outputs` is being vmapped over. This is the case for batched gradient support (e.g., user passes in a batched grad_output). Test Plan: - new tests: `pytest test/test_vmap.py -v` Reviewed By: ezyang Differential Revision: D23059836 Pulled By: zou3519 fbshipit-source-id: 2fd4e3fd93f558e67e2f0941b18f0d00d8ab439f	2020-08-12 10:05:31 -07:00
Pritam Damania	54c05fa34e	Add basic GPU support to distributed autograd. (#40312 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40312 As part of https://github.com/pytorch/pytorch/issues/40255, we realized that GPU support for distributed autograd was broken as part of our multithreaded autograd change. To fix this in the short term for 1.6, this PR includes the following changes: 1) Long lived CPU thread in DistEngine to execute GPU->CPU continuations in the autograd graph. 2) The long lived CPU thread has its own ready_queue and this queue is used for all GraphTasks created by DistEngine. 3) In thread_main(), the CPU thread cannot exit once the GraphTask is done processing because of the new CPU thread added in 1). 4) To resolve this, thread_main() now has a parameter `device_thread` instead of `reentrant_thread`. When device_thread is True, we expect this to be a long lived device thread that does not exit. 5) When device_thread is False, thread_main is expected to run a GraphTask and return once done. ghstack-source-id: 106391329 Test Plan: waitforbuildbot Differential Revision: D22146183 fbshipit-source-id: dd146b7a95f55db75f6767889b7255e9d62d5825	2020-06-23 07:49:00 -07:00
Nikita Shulga	c3d3782c80	Fix init-shutdown race condition in autograd engine (#39194 ) Summary: If Engine is created shortly before application exits, then non-reentrant thread might not have a chance to spawn which would result in an infinite wait in `Engine::~Engine()` Prevent this by actually waiting for threads to spawn before returning from `Engine::start_device_threads()` Make sure that thread count is incremented before GIL is acquired in PythonThread Pull Request resolved: https://github.com/pytorch/pytorch/pull/39194 Differential Revision: D21789219 Pulled By: malfet fbshipit-source-id: d9b5e74d5ddeb2474b575af2e4f33d022efcfe53	2020-05-29 12:20:31 -07:00
Wanchao Liang	f41742ff2f	[autograd] remove spinning for dist engine (#36606 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36606 This PR refactor the continuation logic of the async mode on autograd engine, to avoid launch spinning works. To achieve that: 1. remove the continuation logic in execute_graph_task_with_continuiation 2. separate the usage of execute_graph_task between dist_engine and local engine, now dist_engine universally use `execute_graph_task_until_ready_queue_empty` (a better name appreciated here). 3. remove enqueue_blocked_task_on_cpu 4. remove the async mode in `execute_with_graph_task` as we don't need to use it in dist_engine Test Plan: Imported from OSS Differential Revision: D21032731 Pulled By: wanchaol fbshipit-source-id: 708ea3bc14815bdc151b56afa15eb85b4ac0f4b1	2020-04-26 22:23:30 -07:00
anjali411	6e92579883	Added autograd support for C->C functions and enabled requires_grad=True for complex (#36932 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36932 Differential Revision: D21181230 Pulled By: anjali411 fbshipit-source-id: 295f2cd1e2b9918a8b2cb88cab0536b2407dc455	2020-04-24 12:30:49 -07:00
Wanchao Liang	618104185b	[autograd] enable graph level thread parallelism on CPU (#33157 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33157 This PR enables graph level thread parallelism on CPU for the Autograd Engine. It replace https://github.com/pytorch/pytorch/pull/29574 for the reason of task level parallelism drawbacks with the existing autograd system. Fixes https://github.com/pytorch/pytorch/issues/18333 The graph level parallelism on CPU design: 1. Remove the single CPU thread that init in the Engine itself and allow the owning thread (which calls Engine::execute) to drive the Engine execution so that we could let outer threading to enable thread parallelism. 2. Maintain a separate ReadyQueue per CPU thread, and stash the ReadyQueue for different devices/threads into the thread local shared_ptr, the Engine itself will memorize the shared_ptr of the ReadyQueue to different devices (other than CPU) 3. The CPU thread local ReadyQueue is initialized per CPU thread Engine::execute call (or `backward()`, `grad()` call), and memorized the shared_ptr into the GraphTask since every `backward()` call have its own GraphTask 4. Cross device NodeTask push is accomplished by 2 and 3. we can refer to device's ReadyQueue from Engine, and CPU's ReadyQueue from GraphTask, which means if we can push to a different ReadyQueue according to the device 5. Termination of the CPU thread: if we mark the graph_task as completed, we will exit the while loop and terminate the current backward execution, because it's guranteed that all other NodeTasks is finished before we mark a GraphTask as complete 6. re-entrant thread logic keeps the same, reentrant thread detection is similar as before, we set the worker_device to NO_DEVICE initially and set to CPU afterward to detect if this is a reentrant call or not. 7. we still have the reentrant thread pool that create new threads if it's a deep reentrant case, and reuse the ReadyQueue with the parent thread for performance. Since we introduce the thread parallelism on CPU, we have to ensure the thread safety of the GraphTask. This is not a problem if we execute all forward in different threads since we will build separate GraphTask in different threads, and each GraphTask is a separate instance that share nothing, i.e. Hogwild training on CPU should be fine on this case. But there might be case that user would like to do some part of the task in a single thread, and do the rest of work in several threads concurrently, so thread safety is crucial in those cases. The thread safety strategy for the multithread autograd is as follows: 1. Add a mutex to protect thread safety in Autograd Node/Function, and hold the lock for different data racing cases 2. Lock the mutex during Node::apply(), this is to ensure Node that writing to the shared variable are not racing across threads (i.e. AccumulateGrad and custom C++ Autograd Node if writing to shared variables ) 3. Lock the mutex during Node::release_variables(), this serve the purpose that when we release saved_variables from one thread, no other threads can call the Node::apply(), this ensures the variable references from other threads aren't dangling. 4. If we don't release any variables and no shared data read/write in the Node i.e. purely functional, we don't lock the mutex This way we could protect the thread safety on Autograd Node, but we could still not protect the thread safety on Node pre/post C++ hooks (python hooks are automatically thread safe), we rely on the user to write thread safe C++ hooks if they want the hook to be correctly applied in multithreading environment. User visiable changes: There're not too much user visiable changes, since we use the owning thread to drive the autograd execution, user could write their own threading code and does not block on the Autograd engine, some behaviors that user should be aware of: Non-determinism: if we are calling backward() on multiple thread concurrently but with shared inputs (i.e. Hogwild CPU training). Since parameters are automatically shared across threads, gradient accumulation might become non-deterministic on backward calls across threads, because two backward calls might access and try to accumulate the same .grad attribute. This is technically not safe, and it might result in racing condition and the result might be invalid to use. But this is expected pattern if user are using the multithreading approach to drive the whole training process but using shared parameters, user who use multithreading should have the threading model in mind and should expect this to happen. User should use the functional interface `torch.autograd.grad()` to calculate the gradients instead of `backward()` on loss. Graph retaining: If part of the autograd graph is shared between threads, i.e. run first part of forward single thread, then run second part in multiple threads, then the first part of graph is shared. In this case different threads execute grad() or backward() on the same graph might have issue of destroying the graph on the fly of one thread, and the other thread will crash in this case. We will error out to the user similar to what call `backward()` twice with out `retain_graph=True`, and let the user know they should use `retain_graph=True`. TODOs: [ ] benchmark the PR with example models and datasets to demonstrate the performance gain in CPU training [ ] ensure that we don't regress the single thread autograd performance Follow ups: [ ] a correct and tight integration with distributed autograd [ ] try to unify the thread pool between JIT and Autograd, and see if there's unifying pattern that we could apply universally Test Plan: Imported from OSS Differential Revision: D20236771 Pulled By: wanchaol fbshipit-source-id: 1e0bd4eec14ffebeffdb60b763b8d6f0e427eb64	2020-03-26 17:17:52 -07:00
Nikita Shulga	1c958f8ef9	`Engine::~Engine()` should wait for non-reentrant threads to shutdown (#34529 ) Summary: Because `this` must be valid while `Engine::main_thread` is running, at least for non-reentrant worker threads Pull Request resolved: https://github.com/pytorch/pytorch/pull/34529 Test Plan: Run `test_api --gtest-filter=ModulesTest.InstanceNorm1d` in a loop Differential Revision: D20552717 Pulled By: malfet fbshipit-source-id: a0197671db1b7b1499dda675e43e0826f368bf0d	2020-03-20 00:49:48 -07:00
Nikita Shulga	a22008f91e	Prohibit copying autograd engines (#34567 ) Summary: Make sure that there could not be more than one instance of either `torch::autograd::Engine` or `torch::autograd::python::PythonEngine` Pull Request resolved: https://github.com/pytorch/pytorch/pull/34567 Test Plan: CI Differential Revision: D20390622 Pulled By: malfet fbshipit-source-id: c90595032afc88f552dee52901361b58b282dc1a	2020-03-12 08:06:53 -07:00
Pritam Damania	d30fa4837e	Unify gradient accumulation between distributed autograd and local autograd (#33214 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33214 Distributed autograd had some custom logic in terms of how we accumulated gradients. This was mostly done early on to enable basic functionality. Although, in the long term we should merge this logic with what we have in the local autograd engine. A lot of work has gone into ensuring we accumulate grads correctly and efficiently and we should reuse that as a starting point. We can investigate if we need further custom logic for distributed autograd later on if we need additional optimizations. In this PR I've merged the gradient accumulation logic and also the gradient hooks. As a result, now gradient hooks are called in distributed autograd as well. ghstack-source-id: 99838019 Test Plan: waitforbuildbot Differential Revision: D19843284 fbshipit-source-id: 7923d7e871fb6afd3e98dba7de96606264dcb5f3	2020-03-10 01:56:08 -07:00
albanD	02aa3ba331	Raise error for code that risk deadlock (#32295 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32295 Fix for https://github.com/pytorch/pytorch/issues/32045 Calling into the engine with the GIL can deadlock because: - worker thread initialization acquires the GIL - Any Node / hook can be a python function that will acquire the GIL The choice was made here to raise an error as one of the advantage of using cpp extensions with python is to be able to release the GIL. So we prefer to educate users to do it rather than doing it under the hook. Test Plan: Imported from OSS Differential Revision: D19430979 Pulled By: albanD fbshipit-source-id: e43f57631885f12e573da0fc569c03a943cec519	2020-01-23 08:53:59 -08:00
Pritam Damania	fde94e7556	Provide async mode for local autograd engine. (#31230 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/31230 A major issue with distributed autograd currently is that we block an RPC thread when we call Engine::execute_with_graph_task. To resolve this issue, I've made modifications to the local autograd engine such that `execute_with_graph_task` returns a Future instead. The `execute()` methods for Engine::execute() and DistEngine::execute() still wait() on this Future which ensures there is no change in behavior yet. In follow up PRs we can modify the distributed autograd engine to take advantage of this Future. Closes #26359 ghstack-source-id: 96298057 Test Plan: waitforbuildbot Differential Revision: D18999709 fbshipit-source-id: 388f54467fd2415a0acb7df17bd063aedc105229	2020-01-05 00:29:28 -08:00
Richard Zou	bcb0bb7e0e	Remove unnecessary ATen/core/EnableNamedTensor.h (#31117 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/31117 After this diff, we will have completely removed the named tensor feature flagging. This means that named tensors are always on and that there is no mechanism to turn them off. There should be no more follow-up diffs. I performed the deletion of the header with ``` find . -type f -print0 \| xargs -0 sed -i '/#include <ATen\/core\/EnableNamedTensor.h>/d' ``` Test Plan: - wait for CI Differential Revision: D18934952 Pulled By: zou3519 fbshipit-source-id: 253d059074b910fef15bdf885ebf71e0edf5bea5	2019-12-12 09:53:07 -08:00
Richard Zou	e05ee4c421	Remove BUILD_NAMEDTENSOR macros (#30894 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30894 This PR begins the process of removing BUILD_NAMEDTENSOR macros. There will be followups. Reasons for removing the macros: - BUILD_NAMEDTENSOR is always on and has been on since pytorch 1.3.0. - Since we don't test building without it, it is useless to keep around. - Code becomes nicer to read without the macros Reasons for not removing the macros: - potential for feature flagging Now, I argue against needing to feature flag. The main reason why we might want to feature flag is if we need to disable the feature. We'd need a fast switch to disable the feature if someone discovers in the future that named tensors caused some regression in some existing workflows. In https://github.com/pytorch/pytorch/pull/25798, I did a variety of macro- and micro- benchmarks to determine the performance impact of named tensors on regular tensors. [The microbenchmarks](https://github.com/pytorch/pytorch/pull/25798#issuecomment-529014810) were not very stable, and running the microbenchmarks for more iterations doesn't actually help because the noise is not distributed in a nice way. Instead of microbenchmarks I ran a [profiler (perf)](https://github.com/pytorch/pytorch/pull/25798#issuecomment-555707645) to estimate how much overhead named tensors add to unnamed code. I estimated the overhead to be less than 100ns for `add` and even smaller for `mm`; there are ways to optimize even futher if we find this to be a problem. [Initial macrobenchmarks](https://github.com/pytorch/pytorch/pull/25798#issuecomment-530539104) were also not very stable. I ran imagenet for some number of epochs. To make them more stable, I got rid of the data loading (which seemed to vary between runs). [In some benchmarkers without data loading](https://github.com/pytorch/pytorch/pull/25798#issuecomment-562214053), we can see that the results are less noisy now. These results support no noticeable regressions in speed. Test Plan: - wait for CI Differential Revision: D18858543 Pulled By: zou3519 fbshipit-source-id: 08bf3853a9f506c6b084808dc9ddd1e835f48c13	2019-12-10 07:54:05 -08:00
Edward Yang	1111a6b810	Use pybind11::gil_scoped_* functions instead of AutoGIL/AutoNoGIL (#30274 ) Summary: Reland of https://github.com/pytorch/pytorch/pull/29095 Pull Request resolved: https://github.com/pytorch/pytorch/pull/30274 Differential Revision: D18762293 Pulled By: ezyang fbshipit-source-id: d3d50c2dd12bcb678ab25fa708eb6587cc4b66f9	2019-12-02 12:19:58 -08:00
Mike Ruberry	eff4c4d7c1	Revert D18301806: Use pybind11::gil_scoped_* functions instead of AutoGIL/AutoNoGIL Test Plan: revert-hammer Differential Revision: D18301806 Original commit changeset: 03da6a26c41e fbshipit-source-id: c1324ee8d154e7e16f5dd4f1cf3625aaa566cd39	2019-11-21 14:50:07 -08:00
Alan Du	f4b9690f2d	Use pybind11::gil_scoped_* functions instead of AutoGIL/AutoNoGIL (#29095 ) Summary: Given that pybind11 implements these gil functions, I don't think it makes sense for Pytorch to have its own bespoke versions. Fixes https://github.com/pytorch/pytorch/issues/29065 Pull Request resolved: https://github.com/pytorch/pytorch/pull/29095 Differential Revision: D18301806 Pulled By: ezyang fbshipit-source-id: 03da6a26c41ee65aaadf7b67b9f0b14d2def2a5a	2019-11-21 13:44:40 -08:00
Edward Yang	1ab2f043ba	Move most methods off Variable into torch::autograd::impl functions. (#29665 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/29665 Our intention is to merge the static distinction between Tensor and Variable. Ordinarily, this would entail merging the methods of Tensor and Variable. But there are a lot of "private"-ish methods on Variable that we don't actually want to dump onto the Tensor class. So, as prep work, we move all of those methods off of Variable and into the torch::autograd::impl namespace (impl as in, please don't use this end users). This ends up being a fairly large patch because all of the call sites have to play ball too. While I was on the topic, I also moved any of the touched functions into the C++ file, so that modifying them would not trigger a recompilation of all of torch. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Test Plan: Imported from OSS Differential Revision: D18496169 Pulled By: ezyang fbshipit-source-id: afb203252620ec274be596b3e7b1d84d321bad3a	2019-11-18 08:12:12 -08:00
vishwakftw	86c64440c9	Make PyTorch Python 3.8 compatible (#29302 ) Summary: PEP 590 modifies the `tp_print` offset to `tp_vectorcall_offset` - which requires a Py_ssize_t object. Passing a nullptr caused compatibility issues for Python 3.8. Changelog: - Modify all occurrences of `nullptr /* tp_print /` to 0 / tp_vectorcall_offset */ - Minor formatting changes Pull Request resolved: https://github.com/pytorch/pytorch/pull/29302 Test Plan: - Local fresh build with Python 3.8 completed successfully. Fixes https://github.com/pytorch/pytorch/issues/28060. Fixes https://github.com/pytorch/pytorch/issues/29162. Supersedes https://github.com/pytorch/pytorch/pull/28364 Differential Revision: D18372022 Pulled By: ezyang fbshipit-source-id: 8e9a15b0d0f72101ccc69bd489f5efa216b880bb	2019-11-07 09:20:19 -08:00
Pritam Damania	e8e7d93293	Additional autograd unit tests for Python UDFs. (#29041 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/29041 1) Enhanced autograd unit tests to test the torch.distributed.autograd.backward() API more thoroughly on Python UDFs. 2) Enhanced `python_error` to override `what` such that it returns an appropriate error string if we call `what()` on this error. This ensures we can propagate exceptions over the wire during RPCs (since we get the error string by calling what() on the exception) ghstack-source-id: 93098679 ghstack-source-id: 93098679 Test Plan: waitforbuildbot Reviewed By: mrshenli Differential Revision: D18273041 fbshipit-source-id: 85d3932fed6337668a812367fdfce233c1b3ff8e	2019-11-01 18:30:09 -07:00
Edward Yang	08860721ad	Revert D18195584: Additional autograd unit tests for Python UDFs. Test Plan: revert-hammer Differential Revision: D18195584 Original commit changeset: b795daf644ba fbshipit-source-id: 413dac34f1a28e0a591893f43e116f006fd3f2be	2019-11-01 06:59:54 -07:00
Pritam Damania	3bba751cd6	Additional autograd unit tests for Python UDFs. (#28824 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/28824 1) Enhanced autograd unit tests to test the torch.distributed.autograd.backward() API more thoroughly on Python UDFs. 2) Enhanced `python_error` to override `what` such that it returns an appropriate error string if we call `what()` on this error. This ensures we can propagate exceptions over the wire during RPCs (since we get the error string by calling what() on the exception) ghstack-source-id: 92972494 Test Plan: waitforbuildbot Differential Revision: D18195584 fbshipit-source-id: b795daf644ba1816fdec484545192ab55a2f71e7	2019-10-31 14:03:00 -07:00
Pritam Damania	1322daa506	Improve error handling for distributed autograd engine. (#27940 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/27940 1) If we receive an error for outstanding rpcs, we enqueue an appropriate error on the local autograd engine. 2) Add an `exit_on_error` mode for the local autograd engine, where the computation stops if we see an error. ghstack-source-id: 92603377 Test Plan: Added unit tests to test failures. Differential Revision: D17916844 fbshipit-source-id: 199a7832f1033c36a9bbcc1e80d86576c04965d0	2019-10-25 12:07:27 -07:00
Richard Zou	caed485873	Turn on BUILD_NAMEDTENSOR permanently (#26060 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/26060 This PR enables BUILD_NAMEDTENSOR by default. This is done via including a header, `c10/core/EnableNamedTensor`, that sets `BUILD_NAMEDTENSOR`. In the future, the plan is to get rid of the flag entirely: we can incrementally delete usages after this PR goes in. This PR also maintains the namedtensor ci vs regular ci distinction. `test/test_namedtensor.py` only runs if TEST_NAMEDTENSOR=1 is specified. TEST_NAMEDTENSOR=1 is set on the namedtensor ci. I'll remove this distinction later and send out an announcement about it; devs will be responsible for named tensor failures after that. The initial reason why we had the BUILD_NAMEDTENSOR flag was so that we could quickly prototype named tensor features without worrying about adding overhead to the framework. The overheads can be categorized as memory overhead and performance overhead. Memory overhead: named tensors adds 1 additional word per Tensor. This is because TensorImpl stores a `unique_ptr<NamedTensorMetaInterface>` field. This is not a lot of overhead. Performance overhead: At all entry points to name inference, we check if inputs to an op are named. If inputs are not named, we short-circuit and don't do name inference. These calls should therefore be as efficient as error-checking code and not take up a lot of time. My plan is to benchmark a few functions and then post the results in a comment to this PR. Test Plan: - [namedtensor ci] Differential Revision: D17331635 Pulled By: zou3519 fbshipit-source-id: deed901347448ae2c26066c1fa432e3dc0cadb92	2019-09-17 08:25:00 -07:00
Ralf Gommers	1b4951d3a5	Fix remaining invalid function cast warnings that show up with GCC 8/9 (#26104 ) Summary: Follow-up to gh-25483, more of the same fixes for warnings like: ``` ../torch/csrc/autograd/python_variable.cpp:503:31: warning: cast between incompatible function types from ‘PyObject* ()(THPVariable)’ {aka ‘_object* ()(THPVariable)’} to ‘getter’ {aka ‘_object* ()(_object, void*)’} [-Wcast-function-type] 503 \| {"_backward_hooks", (getter)THPVariable_get_backwards_hooks, (setter)THPVariable_set_backwards_hooks, nullptr, nullptr}, \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ``` This takes the build log output for a full rebuild with GCC 9.1 from ~10,000 to ~7,000 lines. `clang-tidy` is going to complain, no way around that - see discussion at the end of gh-25483. Pull Request resolved: https://github.com/pytorch/pytorch/pull/26104 Differential Revision: D17396831 Pulled By: ezyang fbshipit-source-id: d71696bfe4dbe25519e4bcb7753151c118bd39f7	2019-09-17 07:43:37 -07:00
Richard Zou	47cee2dd22	Implement initial version of autograd with named tensors (#25604 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/25604 In this initial version: - autograd ignores all names. - tensor.grad is unnamed, unless the user manually assigns to it. - if a grad tensor has any names, perhaps the user was hoping for some alignment-checking behavior that named tensor offers for other ops. We raise a warning in this case. Future: do some more extensive checking to see if this actually works in all cases. Test Plan: - [namedtensor ci] - Check a warning is raised if a grad tensor has names. - Check tensor.grad field is unnamed. - Check that we can perform backward on an op that doesn't explictly support names in backward. `sigmoid` is one such op. Differential Revision: D17171788 Pulled By: zou3519 fbshipit-source-id: 64837fde94d8269610b6d3539ac025516dbe1df4	2019-09-04 06:36:54 -07:00
mal	e7a9b0d62f	Rename torch::autograd::Function to torch::autograd::Node Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/23269 Test Plan: Imported from OSS Differential Revision: D16454878 fbshipit-source-id: b1e840fc2d3901955280d141e5ad6efd5e9d66af	2019-07-23 20:52:22 -07:00
Mikhail Zolotukhin	6ca38d9840	Cleanup includes in torch/csrc/autograd/* (#19923 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/19923 ghimport-source-id: 54debdd21ca0f4230b1915905673de274807a2e5 Differential Revision: D15125016 Pulled By: ZolotukhinM fbshipit-source-id: 8d54f436e4508067089a1d05ce192093220aa1bb	2019-05-06 13:48:42 -07:00
Edward Yang	517c7c9861	Canonicalize all includes in PyTorch. (#14849 ) Summary: Anywhere we used #include "foo.h", we now say #include <foo.h> Paths are adjusted to be rooted out of aten/src, torch/lib, or the root level directory. I modified CMakeLists.txt by hand to remove TH and THC from the include paths. I used the following script to do the canonicalization: ``` import subprocess import re import os.path files = subprocess.check_output(['git', 'ls-files']).decode('utf-8').rstrip().split('\n') for fn in files: if not any(fn.endswith(suff) for suff in ['.cu', '.cpp', '.in', '.h', '.hpp', '.cu', '.cuh', '.cc']): continue if not any(fn.startswith(pref) for pref in ["aten/", "torch/"]): continue with open(fn, 'r') as f: c = f.read() def fmt(p): return "#include <{}>".format(p) def repl(m): p = m.group(1) if p in ["dlfcn.h", "unistd.h", "nvrtc.h", "cuda.h", "cuda_runtime.h", "cstdint", "cudnn.h", "Python.h", "cusparse.h", "cuda_runtime_api.h", "cuda_fp16.h", "cublas_v2.h", "stdint.h", "curand_kernel.h"]: return fmt(p) if any(p.startswith(pref) for pref in ["torch/csrc", "c10/", "ATen/", "caffe2/", "TH/", "THC/", "Eigen/", "gtest/", "zdl/", "gloo/", "onnx/", "miopen/"]): return fmt(p) for root in ["aten/src", "torch/lib", ""]: for bad_root in [os.path.dirname(fn), "aten/src/TH", "aten/src/THC", "torch/csrc"]: new_p = os.path.relpath(os.path.join(bad_root, p), root) if not new_p.startswith("../") and (os.path.exists(os.path.join(root, new_p)) or os.path.exists(os.path.join(root, new_p + ".in"))): return fmt(new_p) print("ERROR: ", fn, p) return m.group(0) new_c = re.sub(r'#include "([^"]+)"', repl, c) if new_c != c: print(fn) with open(fn, 'w') as f: f.write(new_c) ``` Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/14849 Reviewed By: dzhulgakov Differential Revision: D13363445 Pulled By: ezyang fbshipit-source-id: 52361f878a672785f9306c9e9ab2513128092b68	2018-12-08 19:38:30 -08:00
Peter Goldsborough	d6c53328f9	Large scale fix of python-related files in torch/csrc/ Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/14515 Differential Revision: D13247966 Pulled By: goldsborough fbshipit-source-id: 7a127c508fc576a7a92626dd6b729f660162d628	2018-12-07 13:04:46 -08:00
albanD	78e3259bbe	Add autograd automatic anomaly detection (#7677 ) * add autograd automatic anomaly detection * python 3 string support * Fix non python build * fix typo in doc * better test and naming fix * fix no python build and python object handling * fix missing checks * clean NO_PYTHON build * Remove unwanted changes	2018-06-11 21:26:17 -04:00
Sam Gross	12229afd00	Record shape and type in autograd to validate gradients (#8168 ) The check that the gradient is defined is currently disabled because TestJit.test_ge_optimized will trigger the error.	2018-06-06 18:09:53 -04:00
Zachary DeVito	23dd033b51	Factor python dependency out of interpreter (#7970 ) * Factor python dependency out of interpreter * Remove NO_PYTHON for the autograd engine If there is no python bindings, then a default Engine is constructed the first time it is requested. If the python libraries are loaded, then they override the default accessor and the default engine becomes a python Engine. Note: it is possible for two engines to be generated if a non-python one gets created before the python bindings are loaded. This case is rare, and just results in additional threads being spawned. * Fixing AlexNet test which is skipped in CI	2018-06-01 16:07:21 -04:00
Peter Goldsborough	28b1a3852c	Add backward() to Tensor and Variable (#7774 ) * Add backward() to Tensor and Variable * Add at:: in front of Tensor * Trying to not move optional to appease windows? * Move implementation into cpp file * Undo some formatting changes	2018-05-24 17:31:41 -07:00
Will Feng	60745b3380	Revert #7750 and #7762 to fix Windows CI on master (#7772 ) * Revert "Add missing brace (#7762)" This reverts commit ea27c5af50f6bc8ba82068e6d36ade9c773dc101. * Revert "[C++ API] Add backward() to Tensor and Variable (#7750)" This reverts commit 1e2762796f33123d86782936089dbeda37bdcc92.	2018-05-22 15:42:52 -07:00
Peter Goldsborough	1e2762796f	[C++ API] Add backward() to Tensor and Variable (#7750 ) * Add backward() to Tensor and Variable * Added a couple tests	2018-05-22 10:43:04 -07:00
Tongzhou Wang	1c01eabd3c	Codemod to update our codebase to 0.4 standard (#6641 ) * Codemod to update our codebase to 0.4 standard * Update some of the test scri[ts * remove Variable in test_clip_grad_value * fix _symbolic_override_wrapper_maker	2018-04-17 22:06:54 -04:00
Tongzhou Wang	e01569afd7	Restore allow_unused functionality (#6553 )	2018-04-12 21:30:42 +02:00
Priya Goyal	e3196e0ea8	[Re-checkpointing] Autograd container for trading compute for memory (#6467 ) * Autograd container for trading compute for memory * add a unit test for checkpoint * address comments * address review comments * adding some docs for the checkpoint api * more comments * more comments * repro bug * Fix a subtle bug/apply some review comments * Update checkpoint.py * Run everything in grad mode * fix flake and chunk=1 * use imperative backward as per discussion * remove Variable and also add models and test for models * Add a simple thread local variable to check for autograd grad mode * remove models and models test after debugging * address review comments * address more comments * address more comments	2018-04-10 15:26:24 -04:00

1 2

91 Commits