pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-22 06:11:27 +08:00

Author	SHA1	Message	Date
Michael Wootton	67dcd62310	Don't split oversize cached blocks (#44742 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/35901 This change is designed to prevent fragmentation in the Caching Allocator. Permissive block splitting in the allocator allows very large blocks to be split into many pieces. Once split too finely it is unlikely all pieces will be 'free' at that same time so the original allocation can never be returned. Anecdotally, we've seen a model run out of memory failing to alloc a 50 MB block on a 32 GB card while the caching allocator is holding 13 GB of 'split free blocks' Approach: - Large blocks above a certain size are designated "oversize". This limit is currently set 1 decade above large, 200 MB - Oversize blocks can not be split - Oversize blocks must closely match the requested size (e.g. a 200 MB request will match an existing 205 MB block, but not a 300 MB block) - In lieu of splitting oversize blocks there is a mechanism to quickly free a single oversize block (to the system allocator) to allow an appropriate size block to be allocated. This will be activated under memory pressure and will prevent _release_cached_blocks()_ from triggering Initial performance tests show this is similar or quicker than the original strategy. Additional tests are ongoing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/44742 Reviewed By: ngimel Differential Revision: D23752058 Pulled By: ezyang fbshipit-source-id: ccb7c13e3cf8ef2707706726ac9aaac3a5e3d5c8	2021-04-14 03:04:41 -07:00
Nikita Shulga	dea529a779	Add torch.cuda.can_device_access_peer (#50446 ) Summary: And unrelying torch._C._cuda_canDeviceAccessPeer, which is a wrapper around cudaDeviceCanAccessPeer Pull Request resolved: https://github.com/pytorch/pytorch/pull/50446 Reviewed By: mrshenli Differential Revision: D25890405 Pulled By: malfet fbshipit-source-id: ef09405f115bbe73ba301d608d56cd8f8453201b	2021-01-12 20:30:45 -08:00
peterjc123	815d38395a	PyLong_{As/From}{Long/UnsignedLong} lint checks (#49280 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/45581 Pull Request resolved: https://github.com/pytorch/pytorch/pull/49280 Reviewed By: mruberry Differential Revision: D25592330 Pulled By: ezyang fbshipit-source-id: 5c16d6aed88ad1feaa7f129b4cd44c0561be2de2	2020-12-17 09:32:08 -08:00
x00480351	47aa253632	[Feature] Allow user to specify a fraction of the GPU memory. (#48172 ) Summary: Add a new function, torch.cuda.set_per_process_memory_fraction(fraction, device), to torch.cuda. Related: https://github.com/pytorch/pytorch/issues/18626 The fraction (float type, from 0 to 1) is used to limit memory of cashing allocator on GPU device . One can set it on any visible GPU. The allowed memory equals total memory * fraction. It will raise an OOM error when try to apply GPU memory more than the allowed value. This function is similar to Tensorflow's per_process_gpu_memory_fraction Note， this setting is just limit the cashing allocator in one process. If you are using multiprocess, you need to put this setting in to the subprocess to limit its GPU memory, because subprocess could have its own allocator. ## usage In some cases, one needs to split a GPU device as two parts. Can set limitation before GPU memory using. Eg. device: 0, each part takes half memory, the code as follows: ``` torch.cuda.set_per_process_memory_fraction(0.5, 0) ``` There is an example to show what it is. ```python import torch torch.cuda.set_per_process_memory_fraction(0.5, 0) torch.cuda.empty_cache() total_memory = torch.cuda.get_device_properties(0).total_memory # less than 0.5 will be ok: tmp_tensor = torch.empty(int(total_memory * 0.499), dtype=torch.int8, device='cuda') del tmp_tensordel tmp_tensor torch.cuda.empty_cache() # this allocation will raise a OOM: torch.empty(total_memory // 2, dtype=torch.int8, device='cuda') """ It raises an error as follows: RuntimeError: CUDA out of memory. Tried to allocate 5.59 GiB (GPU 0; 11.17 GiB total capacity; 0 bytes already allocated; 10.91 GiB free; 5.59 GiB allowed; 0 bytes reserved in total by PyTorch) """ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/48172 Reviewed By: bdhirsh Differential Revision: D25275381 Pulled By: VitalyFedyunin fbshipit-source-id: d8e7af31902c2eb795d416b57011cc8a22891b8f	2020-12-03 11:45:56 -08:00
Pritam Damania	2b221a9599	Remove PyCFunction casts as much as possible. (#46227 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46227 Follow up from https://github.com/pytorch/pytorch/issues/45419, in this PR I've removed as many PyCFunction casts as I could from the codebase. The only ones I didn't remove were the ones with `METH_VARARGS \| METH_KEYWORDS` which have 3 parameters instead of 2 and had to be casted. Example: ` {"copy_", (PyCFunction)(void(*)(void))THPStorage_(copy_), METH_VARARGS \| METH_KEYWORDS, nullptr},` ghstack-source-id: 114632704 Test Plan: waitforbuildbot Reviewed By: albanD Differential Revision: D24269435 fbshipit-source-id: 025cfd43a9a2a3e59f6b2951c1a78749193d77cf	2020-10-20 15:01:51 -07:00
Dmytro Dzhulgakov	06d978a9ad	[c10/cuda] Reorganize device_count() and robustly surface ASAN warnings (#42249 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42249 Main change is to bring Caffe2's superior error messages for cuda initialization into c10 and use them in all code paths. Basic logic: \| Case \| Call to device_count() \| init_cuda, e.g. allocating tensor \| \| -- \| -- \| -- \| \| all good \| non-zero \| just works \| \| no gpus \| 0, no warning \| throw exception with good message \| \| driver issues \| 0, produce warning \| throw exception with good message \| \| out of memory with ASAN \| 0, produce warning\| throw exception with ASAN message \| Previously, the error thrown from init_cuda was very generic and the ASAN warning (if any) was buried in the logs. Other clean up changes: * cache device_count() always in a static variable * move all asan macros in c10 Test Plan: Hard to unittest because of build modes. Verified manually that the behavior from the table above holds by running the following script in different modes (ASAN/no-ASAN, CUDA_VISIBLE_DEVICES=): ``` print('before import') import torch print('after import') print('devices: ', torch.cuda.device_count()) x = torch.tensor([1,2,3]) print('tensor creation') x = x.cuda() print('moved to cuda') ``` Reviewed By: ngimel Differential Revision: D22824329 fbshipit-source-id: 5314007313a3897fc955b02f8b21b661ae35fdf5	2020-08-05 11:39:31 -07:00
ziab	1c8217a7a6	Abstract cuda calls made from `torch_python` (#42251 ) Summary: * Make c10::cuda functions regular non-inlined functions * Add driver_version() and device_synchronize() functions With this change I don't see anymore direct calls to CUDA API when look at Modules.cpp.obj FYI malfet Pull Request resolved: https://github.com/pytorch/pytorch/pull/42251 Reviewed By: malfet Differential Revision: D22826505 Pulled By: ziab fbshipit-source-id: 8dc2f3e209d3710e2ce78411982a10e8c727573c	2020-07-30 19:18:33 -07:00
maokaiyu	9ed825746a	Use c10::cuda:: primitives rather than make CUDA runtime calls directly (#41405 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41405 Test Plan: Imported from GitHub: all checks have passed {F244195355} The Intern Builds & Tests have 127 success, 5 no signals, and 1 failure. Double check the failed test log file, the failure is result differences: - AssertionError: 0.435608434677124 != 0.4356083869934082 - AssertionError: 0.4393022060394287 != 0.4393021583557129 - AssertionError: 0.44707541465759276 != 0.44707536697387695 These are all very small numerical errors (within 0.0000001). Reviewed By: malfet Differential Revision: D22531486 Pulled By: threekindoms fbshipit-source-id: 21543ec76bb9b502885b5146c8ba5ede719be9ff	2020-07-16 15:11:57 -07:00
Nikita Shulga	b952eaf668	Preserve CUDA gencode flags (#41173 ) Summary: Add `torch._C._cuda_getArchFlags()` that returns list of architecture `torch_cuda` were compiled with Add `torch.cuda.get_arch_list()` and `torch.cuda.get_gencode_flags()` methods that returns architecture list and gencode flags PyTorch were compiled with Print warning if some of GPUs is not compatible with any of the CUBINs Pull Request resolved: https://github.com/pytorch/pytorch/pull/41173 Differential Revision: D22459998 Pulled By: malfet fbshipit-source-id: 65d40ae29e54a0ba0f3f2da11b821fdb4d452d95	2020-07-09 14:59:35 -07:00
lixinyu	4a235b87be	pop warning message for cuda module when asan is built in (#35088 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35088 Test Plan: Imported from OSS Differential Revision: D20552708 Pulled By: glaringlee fbshipit-source-id: 0b809712378596ccf83211bf8ae39cd71c27dbba	2020-06-30 08:00:37 -07:00
Nikita Shulga	8b5732e8ad	Move `torch.cuda` annotations inline (#40075 ) Summary: Also enable `torch.cuda` typechecking Pull Request resolved: https://github.com/pytorch/pytorch/pull/40075 Differential Revision: D22121275 Pulled By: malfet fbshipit-source-id: dbecef09911334e8f3d87f5ecab66349da9f2325	2020-06-18 15:52:29 -07:00
anjali411	1f09f7ea44	Python API for Complex Storage and storage copy logic (#35771 ) Summary: Following up on this: https://github.com/pytorch/pytorch/pull/35851 cross dtype storage copy is not being used internally, so I have not included cross dtype copy for complex. Pull Request resolved: https://github.com/pytorch/pytorch/pull/35771 Differential Revision: D21319650 Pulled By: anjali411 fbshipit-source-id: 07c72996ee598eba0cf401ad61534494d6f5b5b3	2020-05-01 11:47:22 -07:00
HC Zhu	ea97fa1f2a	[PyTorch][Dist] Trigger pre/post hooks of output function nodes under distributed autograd (#34501 ) Summary: # Goals Do the following things during a distributed backward pass. 1. Accumulate the gradient of a variable to RPC context once the gradient is ready instead of at the very end of the backward pass. 2. Run post/pre hooks installed in`AccumulateGrad` nodes once the gradient is ready for the variable. Currently, the hooks in `AccumulateGrad` are not executed just because the function `AccumulateGrad` itself is not even evaluated by the local engine. 3. Make it extensible to support post hooks installed by DDP's reducer. # Introduce GradCapturePreHook ## Why do we need this? ### Root issue: * dist engine uses the autograd.grad-like API on the vanilla engine and then in the Future callback populates the context with the gradients. This is a bad emulation of the .backward() call on the vanilla engine. ### Practical issue: * The leaf’s hook are not called (because associated with the AccumulateGrad that is not call in the autograd.grad-like API). Modules like DDP rely on these hooks. * The Future is marked as completed before the context is actually populated with the grads leading to unexpected behavior on the user side. * The Future callback is only called at the complete end of the backward and so too late for DDP if they want to overlap compute/transfert. ### Proposed solution: * Provide hooks in the autograd.grad-like API that will allow the distributed engine to populate the context and call the hooks to better emulate the .backward call. ## Who can install a grad capture pre-hook? This will be an internal hook at C++ level and it won’t be exposed to PyThon code. Only call-sites directly interacting with the local engine can install such hooks. ## Signature The returned `grad` will be captured. ``` virtual const torch::Tensor& grad operator()(const torch::Tensor& grads) = 0; ``` ## Where are hooks installed? Grad capture pre-hooks are install in GraphTask::ExecInfo::Capture. ExecInfo is per node. Every backward run will have its own GraphTask instance. ## When/How will hooks be called? When the local engine captures the grads for a node, all grad capture pre hooks are called one by one in the order they are added. The output grads of the hooks will replace the original grads. The output of the last hook will be used for grad capturing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/34501 Test Plan: All existing tests should pass. ``` python setup.py develop python test/distributed/rpc/test_dist_autograd_spawn.py DistAutogradTestWithSpawn.test_post_hooks ``` Differential Revision: D20953673 Pulled By: hczhu fbshipit-source-id: 543b3844823330ea9f9856bab7c5cb2679290a53	2020-04-21 13:23:18 -07:00
Jithun Nair	dc1f9eee53	Avoid printing erroneous warning about "MIOpen not found" for ROCm builds (#33837 ) Summary: Older versions of MIOpen (<=2.2) don't have the `miopenGetVersion` api, but MIOpen is always a part of the ROCm builds, so do NOT set `lib` to None for ROCm builds. `__cudnn_version` will be `None` for older versions of MIOpen. Setting `lib` to `None` ends up printing the following erroneous warning when running unit tests: ``` /root/.local/lib/python3.6/site-packages/torch/backends/cudnn/__init__.py:120: UserWarning: cuDNN/MIOpen library not found. Check your LD_LIBRARY_PATH }.get(sys.platform, 'LD_LIBRARY_PATH'))) ``` Eg.: https://ci.pytorch.org/jenkins/job/pytorch-builds/job/py3.6-clang7-rocmdeb-ubuntu16.04-test2/18387/consoleFull Pull Request resolved: https://github.com/pytorch/pytorch/pull/33837 Differential Revision: D20369285 Pulled By: xw285cornell fbshipit-source-id: e82e6f8f5bccb486213cf868f40aece41ce11f98	2020-04-17 20:31:01 -07:00
Nikita Shulga	2458f6c63e	Move all nccl from torch_python to torch_cuda (#36193 ) Summary: Because `torch_python` is supposed to be thin wrapper around `torch` In this PR, all invocation of functions from nccl library are moved from python_nccl.cpp (which is part of torch_python) to nccl.cpp (which is part of torch_cuda) Pull Request resolved: https://github.com/pytorch/pytorch/pull/36193 Test Plan: CI Differential Revision: D20930047 Pulled By: malfet fbshipit-source-id: 7f278610077df6ac5dc3471c1a1b5d51e653ef9c	2020-04-08 18:01:47 -07:00
Pavel Belevich	3328a2f903	Rename CPUGenerator to CPUGeneratorImpl and CUDAGenerator to CUDAGeneratorImpl (#36026 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36026 Differential Revision: D20856458 Pulled By: pbelevich fbshipit-source-id: 6d105593dca67640d508a4aebf7edf028d52af32	2020-04-07 08:05:23 -07:00
Peter Bell	5fc5cf6571	Stop using ctypes to interface with CUDA libraries. (#33678 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/33016, Continuation of https://github.com/pytorch/pytorch/issues/31160 Pull Request resolved: https://github.com/pytorch/pytorch/pull/33678 Differential Revision: D20249187 Pulled By: ezyang fbshipit-source-id: 172ce4a0fee7fbe01436a421d1af22ef6173b6ed	2020-03-11 07:22:46 -07:00
Emilio Castillo	31cc311143	Expose `CUDACachingAllocator` `raw_alloc` and `raw_delete` to python (#33860 ) Summary: This PR aims to improve the interoperability with [CuPy](https://github.com/cupy/cupy/pulls). Instead of having two separate and conflicting memory pools. With this PR, CuPy can directly alloc memory from the PyTorch allocator by means of this proposal https://github.com/cupy/cupy/pull/3126 We would like to gather feedback to know if this approach makes sense for PyTorch, or other alternative designs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/33860 Differential Revision: D20212788 Pulled By: ngimel fbshipit-source-id: bc1e08a66da1992d26021147bf645dc65239581c	2020-03-03 17:50:11 -08:00
Edward Yang	1111a6b810	Use pybind11::gil_scoped_* functions instead of AutoGIL/AutoNoGIL (#30274 ) Summary: Reland of https://github.com/pytorch/pytorch/pull/29095 Pull Request resolved: https://github.com/pytorch/pytorch/pull/30274 Differential Revision: D18762293 Pulled By: ezyang fbshipit-source-id: d3d50c2dd12bcb678ab25fa708eb6587cc4b66f9	2019-12-02 12:19:58 -08:00
Mike Ruberry	eff4c4d7c1	Revert D18301806: Use pybind11::gil_scoped_* functions instead of AutoGIL/AutoNoGIL Test Plan: revert-hammer Differential Revision: D18301806 Original commit changeset: 03da6a26c41e fbshipit-source-id: c1324ee8d154e7e16f5dd4f1cf3625aaa566cd39	2019-11-21 14:50:07 -08:00
Alan Du	f4b9690f2d	Use pybind11::gil_scoped_* functions instead of AutoGIL/AutoNoGIL (#29095 ) Summary: Given that pybind11 implements these gil functions, I don't think it makes sense for Pytorch to have its own bespoke versions. Fixes https://github.com/pytorch/pytorch/issues/29065 Pull Request resolved: https://github.com/pytorch/pytorch/pull/29095 Differential Revision: D18301806 Pulled By: ezyang fbshipit-source-id: 03da6a26c41ee65aaadf7b67b9f0b14d2def2a5a	2019-11-21 13:44:40 -08:00
Peter Bell	bb119d957e	Move torch.cuda's atfork handler into C++ (#29101 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/23401 We cannot rely on `multiprocessing.util.register_after_fork` since it is only called for processes created by the `multiprocessing` module and not `os.fork()`. Moving to `pthread_atfork` does always get called. However, I don't think it's safe to call python functions inside of the `atfork` handler so the python code has to be a bit more careful when checking `_initialized`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/29101 Differential Revision: D18355451 Pulled By: ezyang fbshipit-source-id: 4d4253a3669796212c099dad4e5bdfdb0df40469	2019-11-11 07:34:27 -08:00
Xiang Gao	02921e7985	Use cuDNN's handle pool mechanism to manage cublas handles (#29233 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/6962 The PR implements the handle pool mechanism for cublas as suggested by mcarilli in https://github.com/pytorch/pytorch/issues/6962#issuecomment-530563872. ~~I didn't add any unit test here yet because as mcarilli mentioned:~~ > ~~On my local machine, out of curiosity I also rewrote that test to use gemms instead of convolutions. The race condition seemed rarer, but the test did show that cublas use is not thread safe. I can share the script if you want.~~ ~~Please share your script with me mcarilli. And if the race condition is rare, would it still be possible for the CI to detect it?~~ cc: colesbury Pull Request resolved: https://github.com/pytorch/pytorch/pull/29233 Differential Revision: D18372007 Pulled By: ezyang fbshipit-source-id: 3492bf13410598e8452e89cf4e3e63e8df9c8c3d	2019-11-07 12:50:18 -08:00
Gao, Xiang	2d2fe14a60	Install CUDA for clang-tidy (#27967 ) Summary: fixes: https://github.com/pytorch/pytorch/issues/28009 clang-tidy is reporting `'cuda_runtime_api.h' file not found` when a PR modifying some file including this header. Installation script take from official site: https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&target_distro=Ubuntu&target_version=1804&target_type=debnetwork Pull Request resolved: https://github.com/pytorch/pytorch/pull/27967 Differential Revision: D17952383 Pulled By: ezyang fbshipit-source-id: 85807d93bd46eb902a84b2126784349ce3a01cfa	2019-10-16 10:02:19 -07:00
Jerry Ma	1610ea8ef8	Comprehensive-ish instrumentation for CUDA memory allocator (#27361 ) Summary: Adds comprehensive memory instrumentation to the CUDA caching memory allocator. # Counters Added comprehensive instrumentation for the following stats: - Allocation requests (`allocation`) - Allocated memory (`allocated_bytes`) - Reserved segments from cudaMalloc (`segment`) - Reserved memory (`reserved_bytes`) - Active memory blocks (`active`) - Active memory (`active_bytes`) - Inactive, non-releasable blocks (`inactive_split`) - Inactive, non-releasable memory (`inactive_split_bytes`) - Number of failed cudaMalloc calls that result in a cache flush and retry (`cuda_malloc_retries`) - Number of OOMs (`num_ooms`) Except for the last two, these stats are segmented between all memory, large blocks, and small blocks. Along with the current value of each stat, historical counts of allocs/frees as well as peak usage are tracked by the allocator. # Snapshots Added the capability to get a "memory snapshot" – that is, to generate a complete dump of the allocator block/segment state. # Implementation: major changes - Added `torch.cuda.memory_stats()` (and associated C++ changes) which returns all instrumented stats as a dictionary. - Added `torch.cuda.snapshot()` (and associated C++ changes) which returns a complete dump of the allocator block/segment state as a list of segments. - Added memory summary generator in `torch.cuda.memory_summary()` for ease of client access to the instrumentation stats. Potentially useful to dump when catching OOMs. Sample output here: https://pastebin.com/uKZjtupq # Implementation: minor changes - Add error-checking helper functions for Python dicts and lists in `torch/csrc/utils/`. - Existing memory management functions in `torch.cuda` moved from `__init__.py` to `memory.py` and star-imported to the main CUDA module. - Add various helper functions to `torch.cuda` to return individual items from `torch.cuda.memory_stats()`. - `torch.cuda.reset_max_memory_cached()` and `torch.cuda.reset_max_memory_allocated()` are deprecated in favor of `reset_peak_stats`. It's a bit difficult to think of a case where only one of those stats should be reset, and IMO this makes the peak stats collectively more consistent. - `torch.cuda.memory_cached()` and `torch.cuda.max_memory_cached()` are deprecated in favor of `*memory_reserved()`. - Style (add access modifiers in the allocator class, random nit fixes, etc.) # Testing - Added consistency check for stats in `test_cuda.py`. This verifies that the data from `memory_stats()` is faithful to the data from `snapshot()`. - Ran on various basic workflows (toy example, CIFAR) # Performance Running the following speed benchmark: https://pastebin.com/UNndQg50 - Before this PR: 45.98 microseconds per tensor creation - After this PR: 46.65 microseconds per tensor creation Pull Request resolved: https://github.com/pytorch/pytorch/pull/27361 Differential Revision: D17758747 Pulled By: jma127 fbshipit-source-id: 5a84e82d696c40c505646b9a1b4e0c3bba38aeb6	2019-10-08 15:42:48 -07:00
Ralf Gommers	1b4951d3a5	Fix remaining invalid function cast warnings that show up with GCC 8/9 (#26104 ) Summary: Follow-up to gh-25483, more of the same fixes for warnings like: ``` ../torch/csrc/autograd/python_variable.cpp:503:31: warning: cast between incompatible function types from ‘PyObject* ()(THPVariable)’ {aka ‘_object* ()(THPVariable)’} to ‘getter’ {aka ‘_object* ()(_object, void*)’} [-Wcast-function-type] 503 \| {"_backward_hooks", (getter)THPVariable_get_backwards_hooks, (setter)THPVariable_set_backwards_hooks, nullptr, nullptr}, \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ``` This takes the build log output for a full rebuild with GCC 9.1 from ~10,000 to ~7,000 lines. `clang-tidy` is going to complain, no way around that - see discussion at the end of gh-25483. Pull Request resolved: https://github.com/pytorch/pytorch/pull/26104 Differential Revision: D17396831 Pulled By: ezyang fbshipit-source-id: d71696bfe4dbe25519e4bcb7753151c118bd39f7	2019-09-17 07:43:37 -07:00
Guanheng Zhang	b22adeb007	Fix error message for a wrong fork CUDA (#23322 ) Summary: Re-land https://github.com/pytorch/pytorch/pull/23030 Pull Request resolved: https://github.com/pytorch/pytorch/pull/23322 Differential Revision: D16469442 Pulled By: zhangguanheng66 fbshipit-source-id: 70b63ab6265efa3f289111ef0ce46bb3c0d353bc	2019-07-25 12:58:14 -07:00
Edward Yang	1f608d09cf	Revert D16440000: [pytorch][PR] Re-land "Fix error message for a wrong fork CUDA" Differential Revision: D16440000 Original commit changeset: e05683275522 fbshipit-source-id: b688f24c1e6d3d8f63c2d415262a3f0ab1b85914	2019-07-24 12:05:36 -07:00
Guanheng Zhang	aa660b8eb7	Re-land "Fix error message for a wrong fork CUDA" (#23209 ) Summary: Re-land https://github.com/pytorch/pytorch/pull/23030 Pull Request resolved: https://github.com/pytorch/pytorch/pull/23209 Differential Revision: D16440000 Pulled By: zhangguanheng66 fbshipit-source-id: e05683275522835a33d5a7e6d76b7e94774e4d98	2019-07-24 07:01:04 -07:00
Jesse Hellemn	06d11f0434	Revert D16368004: [pytorch][PR] Fix error message for a wrong fork CUDA Differential Revision: D16368004 Original commit changeset: 44b6977790ce fbshipit-source-id: c81a232bd52219e56a19c64650c4b6dedeb167cb	2019-07-22 18:46:48 -07:00
Guanheng Zhang	a6e45a69a8	Fix error message for a wrong fork CUDA (#23030 ) Summary: Fix https://github.com/pytorch/pytorch/issues/17357 Unblock 1.2 release. Pull Request resolved: https://github.com/pytorch/pytorch/pull/23030 Differential Revision: D16368004 Pulled By: zhangguanheng66 fbshipit-source-id: 44b6977790ce768efa4777bae41d4b26dae5f288	2019-07-22 15:04:32 -07:00
SsnL	8482efb203	pin_memory malloc now uses existing context if available. (#22229 ) Summary: This is achieved by using `cuDevicePrimaryCtxGetState` as a way to check whether a primary context exists on a device. It is not too slow, from this benchmark of a single call to it on CUDA 10.1, Titan Xp, driver 415.27: ``` --------------------------------------------------------------------- Benchmark Time CPU Iterations --------------------------------------------------------------------- BM_cuDevicePrimaryCtxGetState 301 ns 301 ns 2319746 ``` Commits: 1. Add `CUDAHooks::getDeviceWithPrimaryContext` which returns a device index with primary context (if exists). Link `c10/cuda` against `libcuda` for device API calls. 2. Use `getDeviceWithPrimaryContext` to check primary context in `pin_memory`. Fix `OptionalDeviceGuard` doc. 3. Refactor `test_cuda_primary_ctx.py` to support multiple tests. Add test for this in that file. Fixes https://github.com/pytorch/pytorch/issues/21081. Pull Request resolved: https://github.com/pytorch/pytorch/pull/22229 Differential Revision: D16170194 Pulled By: zou3519 fbshipit-source-id: 485a45f211b7844c9e69c63f3b3b75194a796c5d	2019-07-16 10:18:30 -07:00
Iurii Zdebskyi	3a8d7463bd	Enabled BFloat16 storage (#21523 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/21523 ghimport-source-id: 698b3cbd6b21c09b9ff8bf8011980df8e35c33b0 Test Plan: Imported from OSS Differential Revision: D15819368 Pulled By: izdeby fbshipit-source-id: f6b3bba7b3ca8ee677bd80a231dbb3920c07d61c	2019-07-09 21:51:06 -07:00
Syed Tousif Ahmed	effcc398c4	Refactor Random Number Generators in ATen (#21555 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/21555 ghimport-source-id: dd900a8c3e1ef9ef1e011b8bb5476626d18cc462 Test Plan: Imported from OSS Differential Revision: D15875780 Pulled By: ezyang fbshipit-source-id: 6e04e90af62ab9c9593d74f344a3a084aaaf6f43	2019-06-19 13:54:09 -07:00
Will Feng	8cde4c4d22	Remove Variable::Impl and DifferentiableViewImpl (#17072 ) Summary: As part of the Variable/Tensor merge work: https://github.com/pytorch/pytorch/issues/13638, we make the following changes in this PR: 1. Remove the `Variable::Impl` class and the `DifferentiableViewImpl` class 2. Change all `Variable.data()` call sites to either use `Variable` directly, or use `Variable.tensor_data()` 3. Remove `Variable.data()` API 3. Add `Variable.variable_data()` that matches `tensor.data` in Python API, which creates a new `Variable` that shares the same storage and tensor metadata with the original `Variable`, but with a completely new autograd history. After this PR, Variable doesn't wrap a Tensor internally anymore, and both Variable and Tensor use the same TensorImpl class as its `impl_`. The only difference is that Variable always has AutogradMeta in its TensorImpl, but Tensor doesn't. Note that this PR is BC-breaking in the following use cases: Use Case 1: Previously, `x.data = y` works even if `x` and `y` are of different TensorImpl type (e.g. `x` is a CPU dense tensor whose impl is of type TensorImpl, while `y` is a CPU sparse tensor whose impl is of type SparseTensorImpl). However, after this PR, `x.data = y` doesn't work anymore if `x` and `y` are of different TensorImpl type, because the underlying implementation `variable.set_data(tensor)` no longer works if `variable` and `tensor` have different TensorImpl type. Use Case 2: If a tensor `x`'s `grad` is sparse, accumulating dense gradients to `x` will change the tensor that `x.grad` is pointing to. This is better illustrated with the following example: ```python params = torch.tensor([1.5, 1.5]).requires_grad_() with torch.no_grad(): # Change gradient to a sparse tensor params.grad = torch.sparse_coo_tensor(torch.tensor([[1, 1]]).long(), torch.tensor([1., 1.])) grad_saved = params.grad params.backward(torch.tensor([1.5, 1.5])) assert id(grad_saved) == id(params.grad) # This will fail after this PR ``` The assertion in the last line will fail after this PR, because adding dense gradients to sparse gradients will change the `params.grad` tensor reference. Pull Request resolved: https://github.com/pytorch/pytorch/pull/17072 Differential Revision: D14075257 Pulled By: yf225 fbshipit-source-id: 0e681df641270dea586042dd26db59f2e76b5957	2019-05-23 21:09:04 -07:00
Roy Li	53bb739b67	Remove uses of TypeID (#19452 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/19452 ghimport-source-id: 816ae7fe1a18d76f064d5796dec44dca6a138a21 Differential Revision: D15009920 Pulled By: li-roy fbshipit-source-id: 722f05a927528148555561da62839f84dba645c6	2019-04-19 12:07:35 -07:00
Pieter Noordhuis	563de88aa5	Revert D14909203: Remove usages of TypeID Differential Revision: D14909203 Original commit changeset: d716179c484a fbshipit-source-id: 992ff1fcd6d35d3f2ae768c7e164b7a0ba871914	2019-04-18 17:47:39 -07:00
Roy Li	01d7d3de46	Remove usages of TypeID (#19183 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/19183 ghimport-source-id: 9af190b072523459fa61e5e79419b88ac8586a4d Differential Revision: D14909203 Pulled By: li-roy fbshipit-source-id: d716179c484aebfe3ec30087c5ecd4a11848ffc3	2019-04-17 23:55:47 -07:00
Soumith Chintala	b5d8844bbe	push magma init into lazyInitCUDA (#18527 ) Summary: Tries to fix C++ API's usage of MAGMA-based functions. Attempts to Fix https://github.com/pytorch/pytorch/issues/18074 Pull Request resolved: https://github.com/pytorch/pytorch/pull/18527 Differential Revision: D14691694 Pulled By: soumith fbshipit-source-id: dd04e74418e486d73ea4a92193ddf79352ed71ba	2019-04-03 12:47:34 -07:00
Edward Yang	515238e0a5	Unify cudaGetDeviceCount implementations. (#18445 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/18445 ghimport-source-id: 30d018737bf6989bc68b7e3676f44e0ca6141fde Stack from [ghstack](https://github.com/ezyang/ghstack): * #18242 Test running a CUDA build on CPU machine. * #18445 Unify cudaGetDeviceCount implementations. I went about doing this by searching for calls to cudaGetDeviceCount, and then methodically replacing them with references to c10::cuda::device_count() or at::cuda::device_count(). There is a point to doing this: the various implementations wildly differed in their handling of what to do when cudaGetDeviceCount returns an error. The final standardized behavior is that all errors are swallowed and we return device count of zero. This indirectly fixes running CUDA builds on CPU, which was broken in #17847. I added 'noexcept' to the 'deviceCount' virtual method on DeviceGuardImpl. This is a BC-breaking change for anyone inheriting from DeviceGuardImpl but all you need to do is put 'noexcept' on your method and it is backwards compatible with older libtorch. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: D14612189 fbshipit-source-id: 3c8d186e3dd623c0e27625212c7ce30f75d943cb	2019-03-26 09:50:14 -07:00
Vitaly Fedyunin	5653a914f7	Implement reference counting for shared IPC CUDA tensors (#16854 ) Summary: This is to fix #16141 and similar issues. The idea is to track a reference to every shared CUDA Storage and deallocate memory only after a consumer process deallocates received Storage. ezyang Done with cleanup. Same (insignificantly better) performance as in file-per-share solution, but handles millions of shared tensors easily. Note [ ] documentation in progress. Pull Request resolved: https://github.com/pytorch/pytorch/pull/16854 Differential Revision: D13994490 Pulled By: VitalyFedyunin fbshipit-source-id: 565148ec3ac4fafb32d37fde0486b325bed6fbd1	2019-03-25 10:24:38 -07:00
Roy Li	7aae51cded	Replace tensor.type().scalarType() calls with tensor.scalar_type() Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/17515 Reviewed By: ezyang Differential Revision: D14233250 fbshipit-source-id: 6c7af8d2291c0c2b148001b30cf03834f34366c0	2019-03-08 14:08:18 -08:00
Will Feng	393c97fda7	Fix variable checking in THCPModule_setRNGState (#17474 ) Summary: See https://github.com/pytorch/pytorch/pull/16325/files#r259576901 Pull Request resolved: https://github.com/pytorch/pytorch/pull/17474 Differential Revision: D14209549 Pulled By: yf225 fbshipit-source-id: 2ae091955ae17f5d1540f7d465739c4809c327f8	2019-02-25 11:05:51 -08:00
Iurii Zdebskyi	444039c47b	Bool tensor. Part 0: Boolean storage implementation (#16810 ) Summary: This is the first commit from a series of planned changes in order to add boolean tensors to PyTorch. The whole plan looks like this: 0. Storage Implementation (this change) 1. Tensor Creation. 2. Tensor Conversions. 3. Tensor Indexing. 4. Tensor Operations. 5. Back compatibility related changes. This feature was requested by the community: https://github.com/pytorch/pytorch/issues/4764 https://github.com/pytorch/pytorch/issues/4219 https://github.com/pytorch/pytorch/issues/4288 Change: Added boolean type to the Storage class for CPU and CUDA backends. Tested via: 1. unit tests 2. running this: -> import torch -> torch.BoolStorage <class 'torch.BoolStorage'> -> torch.cuda.BoolStorage <class 'torch.cuda.BoolStorage'> Pull Request resolved: https://github.com/pytorch/pytorch/pull/16810 Reviewed By: gchanan Differential Revision: D14087246 Pulled By: izdeby fbshipit-source-id: 042642ced1cb0fd1bb6bff05f9ca871a5c54ee5e	2019-02-19 08:22:13 -08:00
Will Feng	202eaa4ef4	Use non-Variable type for callsites that check type equality (#16325 ) Summary: When Variable and Tensor are merged, the dynamic type of the tensors passed to certain functions will become variables, and expecting `type()` on those variables to still return non-Variable types will cause type mismatch error. One way to fix this problem is to use the thread-local guard `at::AutoNonVariableTypeMode` to force `type()` to return non-Variable type, but ideally we want to limit the use of `at::AutoNonVariableTypeMode` to be only in VariableType.cpp. Another way to fix the problem is to use `at::globalContext().getNonVariableType()` instead to get the non-Variable type of the tensor, which is what this PR is trying to achieve. Pull Request resolved: https://github.com/pytorch/pytorch/pull/16325 Differential Revision: D14012022 Pulled By: yf225 fbshipit-source-id: 77ef1d2a02f78bff0063bdd72596e34046f1e00d	2019-02-10 09:47:50 -08:00
Edward Yang	e936a69085	Move THCCachingAllocator to c10_cuda. (#16119 ) Summary: Some renaming and renamespacing also took place. I was originally planning not to do anything, but it turns out that it was easier to make HIPify work by using a namespace CUDACachingAllocator:: rather than THCCachingAllocator_, since :: is a word boundary but _ is not. Pull Request resolved: https://github.com/pytorch/pytorch/pull/16119 Reviewed By: smessmer Differential Revision: D13718768 fbshipit-source-id: 884a481d99027fd3e34471c020f826aa12225656	2019-01-24 12:06:56 -08:00
Edward Yang	24b50f1411	Remove unnecessary includes and headers from THCCachingAllocator, move to at::cuda:: namespace (#16117 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/16117 This means I can move it to c10_cuda with minimal fuss. Reviewed By: smessmer Differential Revision: D13717836 fbshipit-source-id: a94c7dc649af64542480fc1c226b289588886c00	2019-01-24 12:06:54 -08:00
Shen Li	2235fb256e	Add default_stream() and enhance current_stream() (#16200 ) Summary: Closes #16156 Pull Request resolved: https://github.com/pytorch/pytorch/pull/16200 Differential Revision: D13747455 Pulled By: mrshenli fbshipit-source-id: 00c0d5f341c3ac7a757bdb4631a17e11fbc6d3ec	2019-01-22 14:35:19 -08:00
Shen Li	292edfb087	Change current device in stream context manager if necessary (#16128 ) Summary: Fixes #16019 Pull Request resolved: https://github.com/pytorch/pytorch/pull/16128 Differential Revision: D13721850 Pulled By: mrshenli fbshipit-source-id: 422c6c0b97c1cd46e127e265b532cb8c74a3aac5	2019-01-18 12:39:51 -08:00
Edward Yang	411173757e	Rename away uses of THAllocator and THCDeviceAllocator (#16061 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/16061 I discovered I needed to delete these names in preparation of moving THCCachingAllocator to c10_cuda; might as well also fix all the other sites too. Reviewed By: dzhulgakov Differential Revision: D13686869 fbshipit-source-id: e8cc55d39ac4bfd3e3a22c761f89a7a111ce5f5e	2019-01-16 05:36:47 -08:00

... 4 5 6 7 8

365 Commits