pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-27 00:54:52 +08:00

Author	SHA1	Message	Date
Can Balioglu	081b56fd41	Improve readability of cuda_lazy_init (#80788 ) This PR cleans up the implementation of `cuda_lazy_init.cpp` and improves its readability. No behavioral changes are introduced. Pull Request resolved: https://github.com/pytorch/pytorch/pull/80788 Approved by: https://github.com/ezyang	2022-07-04 16:47:11 +00:00
jjsjann123	9e86796fe3	simple c10 implementation for std::call_once (#78051 ) A long standing bug on std::call_once: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66146 It could hang during re-entry after an exception handling. Added a c10 implementation yielding a bulky mutex. Not the most efficient thing but at least it shouldn't hang. Pull Request resolved: https://github.com/pytorch/pytorch/pull/78051 Approved by: https://github.com/albanD	2022-06-28 15:47:03 +00:00
Michael Suo	30fb2c4aba	[lint] autoformat test/cpp and torch/csrc Let's have some fun. Pull Request resolved: https://github.com/pytorch/pytorch/pull/78828 Approved by: https://github.com/ezyang	2022-06-11 21:11:16 +00:00
Sherlock Huang	6db8440f35	Python Jiterator supports multiple outputs (#78139 ) This PR is part3. Part1: https://github.com/pytorch/pytorch/pull/77902 Part2: https://github.com/pytorch/pytorch/pull/77921 Python Jiterator now supports returning multiple outputs ``` fn = torch.cuda.jiterator._create_multi_output_jit_fn( """ template <typename T> T binary_2outputs(T i0, T i1, T& out0, T& out1) { out0 = i0 + i1; out1 = i0 - i1; } """, num_outputs=2) x = torch.rand(3, device='cuda') y = torch.rand(3, device='cuda') out0, out1 = fn(x, y) torch.allclose(out0, x+y) torch.allclose(out1, x-y) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/78139 Approved by: https://github.com/ngimel	2022-05-24 21:52:56 +00:00
Natalia Gimelshein	4ea176ea57	expose fast get_current_stream (#78165 ) Expose fast no-frills version of getting raw `cudaStream_t` in python (200 ns instead of 4 us) Pull Request resolved: https://github.com/pytorch/pytorch/pull/78165 Approved by: https://github.com/SherlockNoMad, https://github.com/soumith, https://github.com/gchanan	2022-05-24 15:54:47 +00:00
Kurt Mohler	aea6e2c396	Merge torch.cuda._UntypedStorage into torch._UntypedStorage (#75459 ) Fixes #74933 Pull Request resolved: https://github.com/pytorch/pytorch/pull/75459 Approved by: https://github.com/ezyang	2022-05-19 13:54:39 +00:00
Michael Carilli	929f1d5317	[RELAND] Adds torch.cuda.is_current_stream_capturing (#77789 ) Resubmit of https://github.com/pytorch/pytorch/pull/77673, which was reverted due to Windows test failures: https://github.com/pytorch/pytorch/pull/77673#issuecomment-1130425845. I suspect these failures happened because I don't explicitly set a side stream for graph capture in the new test. Not setting a side stream explicitly is alright on Linux because cuda tests implicitly use a side stream. I think Windows cuda tests implicitly use the default stream, breaking capture and leaving the backend in a bad state. Other graphs tests explicitly set side streams and don't error in Windows builds, so i'm 95% sure doing the same for the new test will work. Pull Request resolved: https://github.com/pytorch/pytorch/pull/77789 Approved by: https://github.com/ezyang	2022-05-18 23:18:53 +00:00
Jeff Daily	de86146c61	rocblas alt impl during backward pass only (#71881 ) In preparation of adopting future rocblas library options, it is necessary to track when the backward pass of training is executing. The scope-based helper class `BackwardPassGuard` is provided to toggle state. Pull Request resolved: https://github.com/pytorch/pytorch/pull/71881 Approved by: https://github.com/albanD	2022-05-18 19:42:58 +00:00
PyTorch MergeBot	0d8a0f186b	Revert "Adds torch.cuda.is_current_stream_capturing (#77673 )" This reverts commit d03d43df527e48771875537ad20212d5cb333215. Reverted https://github.com/pytorch/pytorch/pull/77673 on behalf of https://github.com/suo	2022-05-18 19:31:49 +00:00
Michael Carilli	d03d43df52	Adds torch.cuda.is_current_stream_capturing (#77673 ) Exposes a way to query if CUDA graph capture is underway on the current stream. Pull Request resolved: https://github.com/pytorch/pytorch/pull/77673 Approved by: https://github.com/ezyang	2022-05-18 16:46:35 +00:00
Sherlockk Huang	8b6a78f39f	Python Interface for Jiterator This PR allows user to author a CUDA kernel in python. ``` from torch.cuda.jiterator import create_jit_fn code_string = "template <typename T> T my_kernel(T x, T y, T alpha) { return -x * y + x - y + alpha; }" jitted_fn = create_jit_fn(code_string, alpha=0) a = torch.rand(3, device='cuda') b = torch.rand(3, device='cuda') result = jitted_fn(a, b, alpha=1.0) ``` Limitations: - Only supports elementwise kernel - 1~8 tensor inputs (empty input, e.g. factory methods, is not supported) - inputs tensors must live in cuda device - cpu Scalar is not supported - kwargs must be pre-declared when calling create_jit_fn - kwargs must be convertible to at::Scalar, one of float64, int64_t, bool. (complex not support for now) TODOs: - [x] consolidate union and c10::variant implementation - [x] plug into existing op testing framework - [ ] rename files, place files in the right folder - [ ] place util functions in the right file - [x] enforce assumptions in python interface e.g <8 inputs, kwargs types - [x] Add user-facing documentation Pull Request resolved: https://github.com/pytorch/pytorch/pull/76394 Approved by: https://github.com/mruberry	2022-05-06 18:44:28 +00:00
Andrey Talman	efd274bbcb	Fix for windows builds with python 3.10 , getting rid of ssize_t (ssize_t is not a C++ defined type) (#71390 ) Summary: Fix for windows builds with python 3.10 , getting rid of ssize_t Here is the completed bin build : https://app.circleci.com/pipelines/github/pytorch/pytorch/441527/workflows/144edb79-b398-4d70-92fe-b63158c1b439/jobs/16954881 Pull Request resolved: https://github.com/pytorch/pytorch/pull/71390 Reviewed By: samdow Differential Revision: D33637686 Pulled By: atalman fbshipit-source-id: fcdfca672dc20385a3d2339c20e69bd2d1717e88 (cherry picked from commit 2ac58b0dc13f152bea180dd3d64b7c36fe0ba755)	2022-01-18 22:12:41 +00:00
Shintaro Iwasaki	5cae40c169	[pytorch][aten][cuda] move CUDAGeneratorImpl.h to ATen/cuda (#70650 ) Summary: This patch moves a CUDA-specific file, `CUDAGeneratorImpl.h` to `ATen/cuda` as the following TODO comment in `CUDAGeneratorImpl.h` suggests: ``` // TODO: this file should be in ATen/cuda, not top level ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/70650 Reviewed By: jianyuh, xw285cornell Differential Revision: D33414890 Pulled By: shintaro-iwasaki fbshipit-source-id: 4ff839205f4e4ea4c8767f164d583eb7072f1b8b	2022-01-10 22:27:04 -08:00
Peter Bell	b08d64202a	Remove THGeneral (#69041 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69041 `TH_CONCAT_{N}` is still being used by THP so I've moved that into it's own header but all the compiled code is gone. Test Plan: Imported from OSS Reviewed By: anjali411 Differential Revision: D32872477 Pulled By: ngimel fbshipit-source-id: 06c82d8f96dbcee0715be407c61dfc7d7e8be47a	2021-12-13 16:14:28 -08:00
Peter Bell	e279963eef	Remove remaining THC code (#69039 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69039 Test Plan: Imported from OSS Reviewed By: anjali411 Differential Revision: D32872476 Pulled By: ngimel fbshipit-source-id: 7972aacc24aef9450fb59b707ed6396c501bcb31	2021-12-08 12:18:08 -08:00
Peter Bell	bf01cd5228	Move THC_sleep to ATen (#69038 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69038 Test Plan: Imported from OSS Reviewed By: anjali411 Differential Revision: D32872479 Pulled By: ngimel fbshipit-source-id: 97c7592b16eee2ecc66c42507c358aa92cc8ee50	2021-12-06 10:20:43 -08:00
Kurt Mohler	5883523c1d	Remove dtype from torch.Storage and use only torch.ByteStorage (#62030 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62030 Remove dtype tracking from Python Storage interface, remove all the different `<type>Storage` classes except for `ByteStorage`, and update serialization accordingly, while maintaining as much FC/BC as possible Fixes https://github.com/pytorch/pytorch/issues/47442 * THE SERIALIZATION FORMAT IS FULLY FC/BC. We worked very hard to make sure this is the case. We will probably want to break FC at some point to make the serialization structure of tensors make more sense, but not today. * There is now only a single torch.ByteStorage class. Methods like `Tensor.set_` no longer check that the dtype of storage is appropriate. * As we no longer know what dtype of a storage is, we've removed the size method from Storage, replacing it with nbytes. This is to help catch otherwise silent errors where you confuse number of elements with number of bytes. * `Storage._new_shared` takes a `nbytes` kwarg and will reject previous positional only calls. `Storage._new_with_file` and `_set_from_file` require explicit element size arguments. * It's no longer possible to convert storages to different types using the float/double/etc methods. Instead, do the conversion using a tensor. * It's no longer possible to allocate a typed storage directly using FloatStorage/DoubleStorage/etc constructors. Instead, construct a tensor and extract its storage. The classes still exist but they are used purely for unpickling. * The preexisting serialization format stores dtype with storage, and in fact this dtype is used to determine the dtype of the tensor overall. To accommodate this case, we introduce a new TypedStorage concept that exists only during unpickling time which is used to temporarily store the dtype so we can construct a tensor. If you overrode the handling of pickling/unpickling, you MUST add handling for TypedStorage or your serialization code will degrade to standard file-based serialization. Original pull request: https://github.com/pytorch/pytorch/pull/59671 Reviewed By: soulitzer, ngimel Differential Revision: D29466819 Pulled By: ezyang fbshipit-source-id: 4a14e5d3c2b08e06e558683d97f7378a3180b00e	2021-10-05 13:50:34 -07:00
Peter Bell	f6dfac6974	Migrate THCCachingHostAllocator to ATen (#65746 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65746 This also removes the cudaHostAllocator field on THCState, since there doesn't seem to be an API anywhere for customizing it. Test Plan: Imported from OSS Reviewed By: mrshenli Differential Revision: D31236630 Pulled By: ngimel fbshipit-source-id: 2a8e756222ae70565e77f8e7139d60ec5be32276	2021-09-30 21:26:38 -07:00
Pruthvi Madugundu	085e2f7bdd	[ROCm] Changes not to rely on CUDA_VERSION or HIP_VERSION (#65610 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65610 - Replace HIP_PLATFORM_HCC with USE_ROCM - Dont rely on CUDA_VERSION or HIP_VERSION and use USE_ROCM and ROCM_VERSION. - In the next PR - Will be removing the mapping from CUDA_VERSION to HIP_VERSION and CUDA to HIP in hipify. - HIP_PLATFORM_HCC is deprecated, so will add HIP_PLATFORM_AMD to support HIP host code compilation on gcc. cc jeffdaily sunway513 jithunnair-amd ROCmSupport amathews-amd Reviewed By: jbschlosser Differential Revision: D30909053 Pulled By: ezyang fbshipit-source-id: 224a966ebf1aaec79beccbbd686fdf3d49267e06	2021-09-29 09:55:43 -07:00
Natalia Gimelshein	d783617216	enable warnings on cuda synchronization (#62092 ) Summary: This creates `torch.cuda.set_warn_on_synchronization()` function that would warn or error when synchronizing operation is performed. We could wrap it in a context manager for ease of use, but it would be a lie, because it sets global, and not thread-local state. Since it's intended for debugging, maybe that's ok though. As all `torch.cuda.*` functions, it's going through CPython, not pybind, so the argument is converted to long before being passed to c10 function. I'll make python argument a python enum class, but without pybind it'll still have to go thourgh long conversion. For a test script ``` import torch torch.cuda.set_warn_on_synchronization(1) x=torch.randn(10, device="cuda") x.nonzero() y=torch.randn((), device="cuda") if y: print("something") torch.multinomial(x.abs(), 10, replacement=False) torch.randperm(20000, device="cuda") ind = torch.randint(10, (3,), device="cuda") mask = torch.randint(2, (10,), device="cuda", dtype=torch.bool) val = torch.randn((), device="cuda") x[mask]=1. x[mask] = val torch.cuda.synchronize() ``` the output is ``` /../playground/sync_warn_test.py:4: UserWarning: called a synchronizing operation (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:145.) x.nonzero() /../playground/sync_warn_test.py:7: UserWarning: called a synchronizing operation (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:145.) if y: something /../playground/sync_warn_test.py:9: UserWarning: called a synchronizing operation (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:145.) torch.multinomial(x.abs(), 10, replacement=False) /../playground/sync_warn_test.py:15: UserWarning: called a synchronizing operation (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:145.) x[mask] = val ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/62092 Reviewed By: mruberry Differential Revision: D29968792 Pulled By: ngimel fbshipit-source-id: cc6f817212c164727ed99ecf6ab050dc29631b9e	2021-07-30 09:13:01 -07:00
Nikita Shulga	a9b0a921d5	Disable `avoid-non-const-global-variables` lint check (#62008 ) Summary: As GoogleTest `TEST` macro is non-compliant with it as well as `DEFINE_DISPATCH` All changes but the ones to `.clang-tidy` are generated using following script: ``` for i in `find . -type f -iname ".c" -or -iname "*.h"\|xargs grep cppcoreguidelines-avoid-non-const-global-variables\|cut -f1 -d:\|sort\|uniq`; do sed -i "/\/\/ NOLINTNEXTLINE(cppcoreguidelines-avoid-non-const-global-variables)/d" $i; done ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/62008 Reviewed By: driazati, r-barnes Differential Revision: D29838584 Pulled By: malfet fbshipit-source-id: 1b2f8602c945bd4ce50a9bfdd204755556e31d13	2021-07-22 18:04:40 -07:00
Mike Guo	6ecc1a4c4f	Make pytorch clang-tidy clean (#60649 ) Summary: This PR suppresses clang-tidy warnings in the codebase (for now) so that we can re-enable clang-tidy checks on master. I ran this script to add the `NOLINTNEXTLINE` comments (on a devserver): ```bash python3 setup.py develop # Uses same script that's run on CI and adds the -j (parallel), -s (add comments), -k (continue if diagnostic errors are found) options python3 tools/clang_tidy.py \ -j \ -s \ -k \ -v \ --paths torch/csrc/ \ -g"-torch/csrc/jit/passes/onnx/helper.cpp" \ -g"-torch/csrc/jit/passes/onnx/shape_type_inference.cpp" \ -g"-torch/csrc/jit/serialization/onnx.cpp" \ -g"-torch/csrc/jit/serialization/export.cpp" \ -g"-torch/csrc/jit/serialization/import.cpp" \ -g"-torch/csrc/jit/serialization/import_legacy.cpp" \ -g"-torch/csrc/onnx/init.cpp" \ -g"-torch/csrc/cuda/nccl." \ -g"-torch/csrc/cuda/python_nccl.cpp" \ -g"-torch/csrc/autograd/FunctionsManual.cpp" \ -g"-torch/csrc/generic/.cpp" \ -g"-torch/csrc/jit/codegen/cuda/runtime/*" \ -g"-torch/csrc/deploy/interpreter/interpreter.cpp" \ -g"-torch/csrc/deploy/interpreter/interpreter.h" \ -g"-torch/csrc/deploy/interpreter/interpreter_impl.h" \ -g"-torch/csrc/deploy/interpreter/test_main.cpp" ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/60649 Test Plan: Verified changes by re-running the script (without the `-s` option) and seeing no warnings/errors. Reviewed By: walterddr, janeyx99 Differential Revision: D29504258 Pulled By: 1ntEgr8 fbshipit-source-id: 78310b30ee8213b73ddb4771ad874665323e7a4e	2021-07-01 12:21:07 -07:00
Edward Yang	85af24f52b	Remove some unnecessary functions from CUDAHooks (#59655 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59655 CUDAHooks is to be used solely when you need to call into CUDA functionality from a context where you cannot directly link to CUDA libraries. Neither of hasPrimaryContext nor getDevceIndexWithPrimaryContext (sic) needs to be used in such contexts. By moving them out of CUDAHooks and calling them directly a dynamic dispatch can be skipped. I also fixed the typo in getDev(i)ceIndexWithPrimaryContext Signed-off-by: Edward Z. Yang <ezyang@fb.com> Test Plan: Imported from OSS Reviewed By: ngimel Differential Revision: D28972946 Pulled By: ezyang fbshipit-source-id: edcd7a7b62aec97928f07fbf3bf413b9fb027517	2021-06-28 10:38:51 -07:00
Michael Wootton	2f3be2735f	Don't split oversize cached blocks (#44742 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/35901 This change is designed to prevent fragmentation in the Caching Allocator. Permissive block splitting in the allocator allows very large blocks to be split into many pieces. Once split too finely it is unlikely all pieces will be 'free' at that same time so the original allocation can never be returned. Anecdotally, we've seen a model run out of memory failing to alloc a 50 MB block on a 32 GB card while the caching allocator is holding 13 GB of 'split free blocks' Approach: - Large blocks above a certain size are designated "oversize". This limit is currently set 1 decade above large, 200 MB - Oversize blocks can not be split - Oversize blocks must closely match the requested size (e.g. a 200 MB request will match an existing 205 MB block, but not a 300 MB block) - In lieu of splitting oversize blocks there is a mechanism to quickly free a single oversize block (to the system allocator) to allow an appropriate size block to be allocated. This will be activated under memory pressure and will prevent _release_cached_blocks()_ from triggering Initial performance tests show this is similar or quicker than the original strategy. Additional tests are ongoing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/44742 Reviewed By: zou3519 Differential Revision: D29186394 Pulled By: ezyang fbshipit-source-id: c88918836db3f51df59de6d1b3e03602ebe306a9	2021-06-21 11:46:08 -07:00
Richard Barnes	e3d75b8475	irange for PyTorch sans jit (#59481 ) Summary: Switches most of the simple for loops outside of `jit` directories to use `c10::irange`. Generated with D28874212. Pull Request resolved: https://github.com/pytorch/pytorch/pull/59481 Test Plan: Sandcastle Reviewed By: ngimel Differential Revision: D28909681 fbshipit-source-id: ec9ab1bd602933238d9d0f73d4d8d027b75d9d85	2021-06-09 14:46:11 -07:00
Richard Barnes	f914ab193e	Use irange in a few places in torch/csrc (#55100 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55100 Test Plan: Sandcastle Reviewed By: ngimel Differential Revision: D27447708 fbshipit-source-id: 4f21133bd76f29d73a51befcae649ab55637b36e	2021-06-03 00:58:51 -07:00
Nikita Shulga	3a66a1cb99	[clang-tidy] Exclude cppcoreguidelines-avoid-magic-numbers (#57841 ) Summary: Add cppcoreguidelines-avoid-magic-numbers exclusion to clang-tidy Remove existing nolint warnings using following script: ``` for file in `git ls-files \| grep -v \.py`; do gsed '/^ *\/\/ NOLINTNEXTLINE(cppcoreguidelines-avoid-magic-numbers)/d' -i $file; done ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/57841 Reviewed By: samestep Differential Revision: D28295045 Pulled By: malfet fbshipit-source-id: 7c6e8d1213c9593f169ed3df6a916498f1a97163	2021-05-07 20:02:33 -07:00
Nikita Shulga	eac02f85cf	Fix more clang-tidy errors (#57235 ) Summary: In my last PR I've missed CUDA and distributed folders, fixing this now This change is autogenerated by `python tool/clang_tidy.py -s` Pull Request resolved: https://github.com/pytorch/pytorch/pull/57235 Reviewed By: janeyx99 Differential Revision: D28084444 Pulled By: malfet fbshipit-source-id: bf222f69ee90c7872c3cb0931e8cdb84f0cb3cda	2021-04-28 23:29:10 -07:00
Natalia Gimelshein	f94c95a2dd	Revert D23752058: [pytorch][PR] Don't split oversize cached blocks Test Plan: revert-hammer Differential Revision: D23752058 (`67dcd62310`) Original commit changeset: ccb7c13e3cf8 fbshipit-source-id: 12ae9702135ea510e9714ed97fb75ca3b9f97c27	2021-04-14 09:24:08 -07:00
Michael Wootton	67dcd62310	Don't split oversize cached blocks (#44742 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/35901 This change is designed to prevent fragmentation in the Caching Allocator. Permissive block splitting in the allocator allows very large blocks to be split into many pieces. Once split too finely it is unlikely all pieces will be 'free' at that same time so the original allocation can never be returned. Anecdotally, we've seen a model run out of memory failing to alloc a 50 MB block on a 32 GB card while the caching allocator is holding 13 GB of 'split free blocks' Approach: - Large blocks above a certain size are designated "oversize". This limit is currently set 1 decade above large, 200 MB - Oversize blocks can not be split - Oversize blocks must closely match the requested size (e.g. a 200 MB request will match an existing 205 MB block, but not a 300 MB block) - In lieu of splitting oversize blocks there is a mechanism to quickly free a single oversize block (to the system allocator) to allow an appropriate size block to be allocated. This will be activated under memory pressure and will prevent _release_cached_blocks()_ from triggering Initial performance tests show this is similar or quicker than the original strategy. Additional tests are ongoing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/44742 Reviewed By: ngimel Differential Revision: D23752058 Pulled By: ezyang fbshipit-source-id: ccb7c13e3cf8ef2707706726ac9aaac3a5e3d5c8	2021-04-14 03:04:41 -07:00
Nikita Shulga	dea529a779	Add torch.cuda.can_device_access_peer (#50446 ) Summary: And unrelying torch._C._cuda_canDeviceAccessPeer, which is a wrapper around cudaDeviceCanAccessPeer Pull Request resolved: https://github.com/pytorch/pytorch/pull/50446 Reviewed By: mrshenli Differential Revision: D25890405 Pulled By: malfet fbshipit-source-id: ef09405f115bbe73ba301d608d56cd8f8453201b	2021-01-12 20:30:45 -08:00
peterjc123	815d38395a	PyLong_{As/From}{Long/UnsignedLong} lint checks (#49280 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/45581 Pull Request resolved: https://github.com/pytorch/pytorch/pull/49280 Reviewed By: mruberry Differential Revision: D25592330 Pulled By: ezyang fbshipit-source-id: 5c16d6aed88ad1feaa7f129b4cd44c0561be2de2	2020-12-17 09:32:08 -08:00
x00480351	47aa253632	[Feature] Allow user to specify a fraction of the GPU memory. (#48172 ) Summary: Add a new function, torch.cuda.set_per_process_memory_fraction(fraction, device), to torch.cuda. Related: https://github.com/pytorch/pytorch/issues/18626 The fraction (float type, from 0 to 1) is used to limit memory of cashing allocator on GPU device . One can set it on any visible GPU. The allowed memory equals total memory * fraction. It will raise an OOM error when try to apply GPU memory more than the allowed value. This function is similar to Tensorflow's per_process_gpu_memory_fraction Note， this setting is just limit the cashing allocator in one process. If you are using multiprocess, you need to put this setting in to the subprocess to limit its GPU memory, because subprocess could have its own allocator. ## usage In some cases, one needs to split a GPU device as two parts. Can set limitation before GPU memory using. Eg. device: 0, each part takes half memory, the code as follows: ``` torch.cuda.set_per_process_memory_fraction(0.5, 0) ``` There is an example to show what it is. ```python import torch torch.cuda.set_per_process_memory_fraction(0.5, 0) torch.cuda.empty_cache() total_memory = torch.cuda.get_device_properties(0).total_memory # less than 0.5 will be ok: tmp_tensor = torch.empty(int(total_memory * 0.499), dtype=torch.int8, device='cuda') del tmp_tensordel tmp_tensor torch.cuda.empty_cache() # this allocation will raise a OOM: torch.empty(total_memory // 2, dtype=torch.int8, device='cuda') """ It raises an error as follows: RuntimeError: CUDA out of memory. Tried to allocate 5.59 GiB (GPU 0; 11.17 GiB total capacity; 0 bytes already allocated; 10.91 GiB free; 5.59 GiB allowed; 0 bytes reserved in total by PyTorch) """ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/48172 Reviewed By: bdhirsh Differential Revision: D25275381 Pulled By: VitalyFedyunin fbshipit-source-id: d8e7af31902c2eb795d416b57011cc8a22891b8f	2020-12-03 11:45:56 -08:00
Pritam Damania	2b221a9599	Remove PyCFunction casts as much as possible. (#46227 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46227 Follow up from https://github.com/pytorch/pytorch/issues/45419, in this PR I've removed as many PyCFunction casts as I could from the codebase. The only ones I didn't remove were the ones with `METH_VARARGS \| METH_KEYWORDS` which have 3 parameters instead of 2 and had to be casted. Example: ` {"copy_", (PyCFunction)(void(*)(void))THPStorage_(copy_), METH_VARARGS \| METH_KEYWORDS, nullptr},` ghstack-source-id: 114632704 Test Plan: waitforbuildbot Reviewed By: albanD Differential Revision: D24269435 fbshipit-source-id: 025cfd43a9a2a3e59f6b2951c1a78749193d77cf	2020-10-20 15:01:51 -07:00
Dmytro Dzhulgakov	06d978a9ad	[c10/cuda] Reorganize device_count() and robustly surface ASAN warnings (#42249 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42249 Main change is to bring Caffe2's superior error messages for cuda initialization into c10 and use them in all code paths. Basic logic: \| Case \| Call to device_count() \| init_cuda, e.g. allocating tensor \| \| -- \| -- \| -- \| \| all good \| non-zero \| just works \| \| no gpus \| 0, no warning \| throw exception with good message \| \| driver issues \| 0, produce warning \| throw exception with good message \| \| out of memory with ASAN \| 0, produce warning\| throw exception with ASAN message \| Previously, the error thrown from init_cuda was very generic and the ASAN warning (if any) was buried in the logs. Other clean up changes: * cache device_count() always in a static variable * move all asan macros in c10 Test Plan: Hard to unittest because of build modes. Verified manually that the behavior from the table above holds by running the following script in different modes (ASAN/no-ASAN, CUDA_VISIBLE_DEVICES=): ``` print('before import') import torch print('after import') print('devices: ', torch.cuda.device_count()) x = torch.tensor([1,2,3]) print('tensor creation') x = x.cuda() print('moved to cuda') ``` Reviewed By: ngimel Differential Revision: D22824329 fbshipit-source-id: 5314007313a3897fc955b02f8b21b661ae35fdf5	2020-08-05 11:39:31 -07:00
ziab	1c8217a7a6	Abstract cuda calls made from `torch_python` (#42251 ) Summary: * Make c10::cuda functions regular non-inlined functions * Add driver_version() and device_synchronize() functions With this change I don't see anymore direct calls to CUDA API when look at Modules.cpp.obj FYI malfet Pull Request resolved: https://github.com/pytorch/pytorch/pull/42251 Reviewed By: malfet Differential Revision: D22826505 Pulled By: ziab fbshipit-source-id: 8dc2f3e209d3710e2ce78411982a10e8c727573c	2020-07-30 19:18:33 -07:00
maokaiyu	9ed825746a	Use c10::cuda:: primitives rather than make CUDA runtime calls directly (#41405 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41405 Test Plan: Imported from GitHub: all checks have passed {F244195355} The Intern Builds & Tests have 127 success, 5 no signals, and 1 failure. Double check the failed test log file, the failure is result differences: - AssertionError: 0.435608434677124 != 0.4356083869934082 - AssertionError: 0.4393022060394287 != 0.4393021583557129 - AssertionError: 0.44707541465759276 != 0.44707536697387695 These are all very small numerical errors (within 0.0000001). Reviewed By: malfet Differential Revision: D22531486 Pulled By: threekindoms fbshipit-source-id: 21543ec76bb9b502885b5146c8ba5ede719be9ff	2020-07-16 15:11:57 -07:00
Nikita Shulga	b952eaf668	Preserve CUDA gencode flags (#41173 ) Summary: Add `torch._C._cuda_getArchFlags()` that returns list of architecture `torch_cuda` were compiled with Add `torch.cuda.get_arch_list()` and `torch.cuda.get_gencode_flags()` methods that returns architecture list and gencode flags PyTorch were compiled with Print warning if some of GPUs is not compatible with any of the CUBINs Pull Request resolved: https://github.com/pytorch/pytorch/pull/41173 Differential Revision: D22459998 Pulled By: malfet fbshipit-source-id: 65d40ae29e54a0ba0f3f2da11b821fdb4d452d95	2020-07-09 14:59:35 -07:00
lixinyu	4a235b87be	pop warning message for cuda module when asan is built in (#35088 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35088 Test Plan: Imported from OSS Differential Revision: D20552708 Pulled By: glaringlee fbshipit-source-id: 0b809712378596ccf83211bf8ae39cd71c27dbba	2020-06-30 08:00:37 -07:00
Nikita Shulga	8b5732e8ad	Move `torch.cuda` annotations inline (#40075 ) Summary: Also enable `torch.cuda` typechecking Pull Request resolved: https://github.com/pytorch/pytorch/pull/40075 Differential Revision: D22121275 Pulled By: malfet fbshipit-source-id: dbecef09911334e8f3d87f5ecab66349da9f2325	2020-06-18 15:52:29 -07:00
anjali411	1f09f7ea44	Python API for Complex Storage and storage copy logic (#35771 ) Summary: Following up on this: https://github.com/pytorch/pytorch/pull/35851 cross dtype storage copy is not being used internally, so I have not included cross dtype copy for complex. Pull Request resolved: https://github.com/pytorch/pytorch/pull/35771 Differential Revision: D21319650 Pulled By: anjali411 fbshipit-source-id: 07c72996ee598eba0cf401ad61534494d6f5b5b3	2020-05-01 11:47:22 -07:00
HC Zhu	ea97fa1f2a	[PyTorch][Dist] Trigger pre/post hooks of output function nodes under distributed autograd (#34501 ) Summary: # Goals Do the following things during a distributed backward pass. 1. Accumulate the gradient of a variable to RPC context once the gradient is ready instead of at the very end of the backward pass. 2. Run post/pre hooks installed in`AccumulateGrad` nodes once the gradient is ready for the variable. Currently, the hooks in `AccumulateGrad` are not executed just because the function `AccumulateGrad` itself is not even evaluated by the local engine. 3. Make it extensible to support post hooks installed by DDP's reducer. # Introduce GradCapturePreHook ## Why do we need this? ### Root issue: * dist engine uses the autograd.grad-like API on the vanilla engine and then in the Future callback populates the context with the gradients. This is a bad emulation of the .backward() call on the vanilla engine. ### Practical issue: * The leaf’s hook are not called (because associated with the AccumulateGrad that is not call in the autograd.grad-like API). Modules like DDP rely on these hooks. * The Future is marked as completed before the context is actually populated with the grads leading to unexpected behavior on the user side. * The Future callback is only called at the complete end of the backward and so too late for DDP if they want to overlap compute/transfert. ### Proposed solution: * Provide hooks in the autograd.grad-like API that will allow the distributed engine to populate the context and call the hooks to better emulate the .backward call. ## Who can install a grad capture pre-hook? This will be an internal hook at C++ level and it won’t be exposed to PyThon code. Only call-sites directly interacting with the local engine can install such hooks. ## Signature The returned `grad` will be captured. ``` virtual const torch::Tensor& grad operator()(const torch::Tensor& grads) = 0; ``` ## Where are hooks installed? Grad capture pre-hooks are install in GraphTask::ExecInfo::Capture. ExecInfo is per node. Every backward run will have its own GraphTask instance. ## When/How will hooks be called? When the local engine captures the grads for a node, all grad capture pre hooks are called one by one in the order they are added. The output grads of the hooks will replace the original grads. The output of the last hook will be used for grad capturing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/34501 Test Plan: All existing tests should pass. ``` python setup.py develop python test/distributed/rpc/test_dist_autograd_spawn.py DistAutogradTestWithSpawn.test_post_hooks ``` Differential Revision: D20953673 Pulled By: hczhu fbshipit-source-id: 543b3844823330ea9f9856bab7c5cb2679290a53	2020-04-21 13:23:18 -07:00
Jithun Nair	dc1f9eee53	Avoid printing erroneous warning about "MIOpen not found" for ROCm builds (#33837 ) Summary: Older versions of MIOpen (<=2.2) don't have the `miopenGetVersion` api, but MIOpen is always a part of the ROCm builds, so do NOT set `lib` to None for ROCm builds. `__cudnn_version` will be `None` for older versions of MIOpen. Setting `lib` to `None` ends up printing the following erroneous warning when running unit tests: ``` /root/.local/lib/python3.6/site-packages/torch/backends/cudnn/__init__.py:120: UserWarning: cuDNN/MIOpen library not found. Check your LD_LIBRARY_PATH }.get(sys.platform, 'LD_LIBRARY_PATH'))) ``` Eg.: https://ci.pytorch.org/jenkins/job/pytorch-builds/job/py3.6-clang7-rocmdeb-ubuntu16.04-test2/18387/consoleFull Pull Request resolved: https://github.com/pytorch/pytorch/pull/33837 Differential Revision: D20369285 Pulled By: xw285cornell fbshipit-source-id: e82e6f8f5bccb486213cf868f40aece41ce11f98	2020-04-17 20:31:01 -07:00
Nikita Shulga	2458f6c63e	Move all nccl from torch_python to torch_cuda (#36193 ) Summary: Because `torch_python` is supposed to be thin wrapper around `torch` In this PR, all invocation of functions from nccl library are moved from python_nccl.cpp (which is part of torch_python) to nccl.cpp (which is part of torch_cuda) Pull Request resolved: https://github.com/pytorch/pytorch/pull/36193 Test Plan: CI Differential Revision: D20930047 Pulled By: malfet fbshipit-source-id: 7f278610077df6ac5dc3471c1a1b5d51e653ef9c	2020-04-08 18:01:47 -07:00
Pavel Belevich	3328a2f903	Rename CPUGenerator to CPUGeneratorImpl and CUDAGenerator to CUDAGeneratorImpl (#36026 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36026 Differential Revision: D20856458 Pulled By: pbelevich fbshipit-source-id: 6d105593dca67640d508a4aebf7edf028d52af32	2020-04-07 08:05:23 -07:00
Peter Bell	5fc5cf6571	Stop using ctypes to interface with CUDA libraries. (#33678 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/33016, Continuation of https://github.com/pytorch/pytorch/issues/31160 Pull Request resolved: https://github.com/pytorch/pytorch/pull/33678 Differential Revision: D20249187 Pulled By: ezyang fbshipit-source-id: 172ce4a0fee7fbe01436a421d1af22ef6173b6ed	2020-03-11 07:22:46 -07:00
Emilio Castillo	31cc311143	Expose `CUDACachingAllocator` `raw_alloc` and `raw_delete` to python (#33860 ) Summary: This PR aims to improve the interoperability with [CuPy](https://github.com/cupy/cupy/pulls). Instead of having two separate and conflicting memory pools. With this PR, CuPy can directly alloc memory from the PyTorch allocator by means of this proposal https://github.com/cupy/cupy/pull/3126 We would like to gather feedback to know if this approach makes sense for PyTorch, or other alternative designs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/33860 Differential Revision: D20212788 Pulled By: ngimel fbshipit-source-id: bc1e08a66da1992d26021147bf645dc65239581c	2020-03-03 17:50:11 -08:00
Edward Yang	1111a6b810	Use pybind11::gil_scoped_* functions instead of AutoGIL/AutoNoGIL (#30274 ) Summary: Reland of https://github.com/pytorch/pytorch/pull/29095 Pull Request resolved: https://github.com/pytorch/pytorch/pull/30274 Differential Revision: D18762293 Pulled By: ezyang fbshipit-source-id: d3d50c2dd12bcb678ab25fa708eb6587cc4b66f9	2019-12-02 12:19:58 -08:00
Mike Ruberry	eff4c4d7c1	Revert D18301806: Use pybind11::gil_scoped_* functions instead of AutoGIL/AutoNoGIL Test Plan: revert-hammer Differential Revision: D18301806 Original commit changeset: 03da6a26c41e fbshipit-source-id: c1324ee8d154e7e16f5dd4f1cf3625aaa566cd39	2019-11-21 14:50:07 -08:00
Alan Du	f4b9690f2d	Use pybind11::gil_scoped_* functions instead of AutoGIL/AutoNoGIL (#29095 ) Summary: Given that pybind11 implements these gil functions, I don't think it makes sense for Pytorch to have its own bespoke versions. Fixes https://github.com/pytorch/pytorch/issues/29065 Pull Request resolved: https://github.com/pytorch/pytorch/pull/29095 Differential Revision: D18301806 Pulled By: ezyang fbshipit-source-id: 03da6a26c41ee65aaadf7b67b9f0b14d2def2a5a	2019-11-21 13:44:40 -08:00

1 2 3 4

194 Commits