pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-21 05:34:18 +08:00

Author	SHA1	Message	Date
Mark Saroufim	4e8dd11be1	simplify nvrtc discovery login in compile_kernel (#156674 ) Followup from https://github.com/pytorch/pytorch/pull/156332 Tested a bunch while I was working on https://github.com/pytorch/pytorch/pull/156380 Works just fine on dev gpus Pull Request resolved: https://github.com/pytorch/pytorch/pull/156674 Approved by: https://github.com/malfet	2025-06-24 08:55:40 +00:00
Daniel Galvez	4c0aa37dda	Support stream capture of event record and wait nodes in cuda graphs (#155372 ) These are created by the user passing cudaEventRecordExternal and cudaEventWaitExternal to cudaEventRecordWithFlags() and cudaStreamWaitEvent() respectively. We do this by allowing the user to specify external=True when constructing a torch.cuda.Event(). If external=False, the cudaEventRecord and cudaStreamWaitEvent API's have a different meaning described here: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cross-stream-dependencies-and-events In short, they will be used to experess fork and join operations in the graph if external=False. External events can be used for expressing a fine-grained dependency on the outcome of some nodes in a cuda graph (rather than all nodes). They can also be used for timing parts of a cuda graph's execution, rather than timing the entire graph's execution. Finishes #146145 I'm a dummy and don't know how to use ghstack at this time. The first commit is a bug fix for _CudaKernel, which would previously always launch work on the NULL stream, rather than the user-passed stream. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155372 Approved by: https://github.com/ngimel	2025-06-17 21:44:51 +00:00
Mark Saroufim	5b368fa0b7	Add torch.cuda._compile_kernel() (#151484 ) Followup work on top https://github.com/pytorch/pytorch/pull/149480 Wrapper on top of nvrtc inspired by https://gist.github.com/malfet/2c9a25976dd7396430c38af603f791da from @malfet Compiling toy kernels with this setup takes 0.01s vs 90s using `load_inline()` on my local H100. This was primarily motivated by the timeouts I was seeing in the popcorn leaderboard but would also be useful to integrate into KernelBench This PR is in the same spirit as https://github.com/pytorch/pytorch/pull/148972 which was a similar UX for Metal For now we are planning on landing this as a private function because we expect to iterate both on the user facing API and the internals implementation, will open up a seperate issue to discuss the path towards making this work public and give a broader overview of the state of custom cuda kernel authoring in PyTorch Future work, as a prereq to making the work public * divup primitive * support multiple kernels * Expose _get_nvrtc_version from native code * interop with torch.compile * AMD support Pull Request resolved: https://github.com/pytorch/pytorch/pull/151484 Approved by: https://github.com/malfet	2025-04-24 07:14:31 +00:00
Yu, Guangye	46e3f670b4	refactor code to share across different devices (#120602 ) # Motivation Refactor utils code to make it possible to share across CUDA, XPU, and other backends. # Solution Move `_dummy_type` and `_LazySeedTracker` to torch._utils; # Additional Context When upstreaming, refactor these code changes by isolating them into in an additional PR to minimize their impact on the CUDA code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120602 Approved by: https://github.com/albanD, https://github.com/jgong5, https://github.com/gujinghui, https://github.com/EikanWang	2024-02-28 09:42:58 +00:00
zabboud	01478f1afa	Fix pydocstyle errors listed in issue 112589 (#113227 ) Fixes #112589 Fixed errors relating to pydocstyle in the following files. The remaining errors are related to docstrings at the module level and at methods within each module (see details below) pydocstyle torch/cuda/_utils.py --count before: 3 after: 0 pydocstyle torch/cuda/jiterator.py --count before: 3 after: 1 remaining errors: ``` torch/cuda/jiterator.py:1 at module level: D100: Missing docstring in public module ``` pydocstyle torch/cuda/graphs.py --count before: 25 after: 7 remaining errors: ``` torch/cuda/graphs.py:1 at module level: D100: Missing docstring in public module torch/cuda/graphs.py:54 in public method `__new__`: D102: Missing docstring in public method torch/cuda/graphs.py:108 in public method `debug_dump`: D205: 1 blank line required between summary line and description (found 0) torch/cuda/graphs.py:108 in public method `debug_dump`: D400: First line should end with a period (not ':') torch/cuda/graphs.py:150 in public method `__init__`: D107: Missing docstring in __init__ torch/cuda/graphs.py:172 in public method `__enter__`: D105: Missing docstring in magic method torch/cuda/graphs.py:186 in public method `__exit__`: D105: Missing docstring in magic method ``` pydocstyle torch/cuda/_sanitizer.py --count before: 35 after: 31 remaining errors: ``` torch/cuda/_sanitizer.py:43 in public class `AccessType`: D101: Missing docstring in public class torch/cuda/_sanitizer.py:47 in public method `__str__`: D105: Missing docstring in magic method torch/cuda/_sanitizer.py:84 in public method `__init__`: D107: Missing docstring in __init__ torch/cuda/_sanitizer.py:96 in public method `__str__`: D105: Missing docstring in magic method torch/cuda/_sanitizer.py:139 in public method `__init__`: D107: Missing docstring in __init__ torch/cuda/_sanitizer.py:142 in public method `__str__`: D105: Missing docstring in magic method torch/cuda/_sanitizer.py:218 in public class `StreamSynchronizations`: D101: Missing docstring in public class torch/cuda/_sanitizer.py:219 in public method `__init__`: D107: Missing docstring in __init__ torch/cuda/_sanitizer.py:256 in public method `create_stream`: D102: Missing docstring in public method torch/cuda/_sanitizer.py:268 in public method `create_event`: D102: Missing docstring in public method torch/cuda/_sanitizer.py:272 in public method `delete_event`: D102: Missing docstring in public method torch/cuda/_sanitizer.py:276 in public method `update_seq_num`: D102: Missing docstring in public method torch/cuda/_sanitizer.py:280 in public method `record_state`: D102: Missing docstring in public method torch/cuda/_sanitizer.py:291 in public method `stream_wait_for_event`: D102: Missing docstring in public method torch/cuda/_sanitizer.py:298 in public method `all_streams_wait_for_event`: D102: Missing docstring in public method torch/cuda/_sanitizer.py:307 in public method `all_streams_wait_for_stream`: D102: Missing docstring in public method torch/cuda/_sanitizer.py:316 in public method `sync_all_streams`: D102: Missing docstring in public method torch/cuda/_sanitizer.py:323 in public method `is_ordered_after`: D102: Missing docstring in public method torch/cuda/_sanitizer.py:339 in public method `__init__`: D107: Missing docstring in __init__ torch/cuda/_sanitizer.py:460 in public function `zip_by_key`: D103: Missing docstring in public function torch/cuda/_sanitizer.py:466 in public function `zip_arguments`: D103: Missing docstring in public function torch/cuda/_sanitizer.py:478 in public class `ArgumentHandler`: D101: Missing docstring in public class torch/cuda/_sanitizer.py:479 in public method `__init__`: D107: Missing docstring in __init__ torch/cuda/_sanitizer.py:505 in public method `parse_inputs`: D102: Missing docstring in public method torch/cuda/_sanitizer.py:520 in public method `parse_outputs`: D102: Missing docstring in public method torch/cuda/_sanitizer.py:527 in public class `CUDASanitizerDispatchMode`: D101: Missing docstring in public class torch/cuda/_sanitizer.py:528 in public method `__init__`: D107: Missing docstring in __init__ torch/cuda/_sanitizer.py:562 in public method `__torch_dispatch__`: D105: Missing docstring in magic method torch/cuda/_sanitizer.py:597 in public method `__init__`: D107: Missing docstring in __init__ torch/cuda/_sanitizer.py:601 in public method `enable`: D102: Missing docstring in public method torch/cuda/_sanitizer.py:605 in public method `__del__`: D105: Missing docstring in magic method ``` pydocstyle torch/storage.py --count before: 90 after: 37 remaining errors: ``` torch/storage.py:1 at module level: D100: Missing docstring in public module torch/storage.py:310 in public class `UntypedStorage`: D101: Missing docstring in public class torch/storage.py:311 in public method `__getitem__`: D105: Missing docstring in magic method torch/storage.py:317 in public method `is_cuda`: D102: Missing docstring in public method torch/storage.py:321 in public method `is_hpu`: D102: Missing docstring in public method torch/storage.py:325 in public method `share_memory_`: D102: Missing docstring in public method torch/storage.py:444 in public class `TypedStorage`: D101: Missing docstring in public class torch/storage.py:453 in public method `fill_`: D102: Missing docstring in public method torch/storage.py:458 in public method `__new__`: D102: Missing docstring in public method torch/storage.py:530 in public method `__init__`: D107: Missing docstring in __init__ torch/storage.py:599 in public method `is_cuda`: D102: Missing docstring in public method torch/storage.py:604 in public method `is_hpu`: D102: Missing docstring in public method torch/storage.py:624 in public method `__len__`: D105: Missing docstring in magic method torch/storage.py:653 in public method `__setitem__`: D105: Missing docstring in magic method torch/storage.py:681 in public method `__getitem__`: D105: Missing docstring in magic method torch/storage.py:715 in public method `copy_`: D102: Missing docstring in public method torch/storage.py:723 in public method `nbytes`: D102: Missing docstring in public method torch/storage.py:731 in public method `type`: D102: Missing docstring in public method torch/storage.py:744 in public method `cuda`: D102: Missing docstring in public method torch/storage.py:751 in public method `hpu`: D102: Missing docstring in public method torch/storage.py:758 in public method `element_size`: D102: Missing docstring in public method torch/storage.py:766 in public method `get_device`: D102: Missing docstring in public method torch/storage.py:770 in public method `__str__`: D105: Missing docstring in magic method torch/storage.py:781 in public method `__repr__`: D105: Missing docstring in magic method torch/storage.py:785 in public method `__iter__`: D105: Missing docstring in magic method torch/storage.py:789 in public method `__copy__`: D105: Missing docstring in magic method torch/storage.py:793 in public method `__deepcopy__`: D105: Missing docstring in magic method torch/storage.py:801 in public method `__sizeof__`: D105: Missing docstring in magic method torch/storage.py:877 in public method `device`: D102: Missing docstring in public method torch/storage.py:881 in public method `size`: D102: Missing docstring in public method torch/storage.py:891 in public method `pickle_storage_type`: D102: Missing docstring in public method torch/storage.py:902 in public method `__reduce__`: D105: Missing docstring in magic method torch/storage.py:907 in public method `data_ptr`: D102: Missing docstring in public method torch/storage.py:915 in public method `resize_`: D102: Missing docstring in public method torch/storage.py:931 in public method `from_buffer`: D102: Missing docstring in public method torch/storage.py:1032 in public method `from_file`: D402: First line should not be the function's "signature" torch/storage.py:1075 in public method `is_shared`: D102: Missing docstring in public method ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/113227 Approved by: https://github.com/kit1980	2023-11-13 22:05:45 +00:00
Edward Z. Yang	3bf922a6ce	Apply UFMT to low traffic torch modules (#106249 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/106249 Approved by: https://github.com/Skylion007	2023-07-29 23:37:30 +00:00
Justin Chu	79c5e33349	[BE] Enable ruff's UP rules and autoformat nn/ mps/ and torch/ (#105436 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105436 Approved by: https://github.com/malfet, https://github.com/albanD	2023-07-21 07:38:46 +00:00
Peter Bell	eece6da162	[inductor] Reduce device context manager overhead (#91045 ) This adds `torch.cuda._DeviceGuard` which is a stripped down version of `torch.cuda.device` with lower overhead. To do this, it only accepts `int` as the device so we don't need to call `_get_device_index` and is implemented with a new C++ helper `torch._C._cuda_exchangeDevice` that allows `_DeviceGuard.__enter__` to be just a single function call. On my machine, I see a drop from 3.8us of overhead to 0.94 us with this simple benchmark: ```python def set_device(): with torch.cuda.device(0): pass %timeit set_device() ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/91045 Approved by: https://github.com/ngimel, https://github.com/anijain2305	2023-01-12 16:51:59 +00:00
albanD	8713119c89	Stream actually overrides __new__ so we need to patch it as well (#89592 ) Avoids ``` $ python foo.py Traceback (most recent call last): File "foo.py", line 3, in <module> a = torch.cuda.Stream() File "/home/albandes/local/pytorch/3.8_debug_source/torch/cuda/streams.py", line 34, in __new__ return super(Stream, cls).__new__(cls, priority=priority, kwargs) TypeError: object.__new__() takes exactly one argument (the type to instantiate) ``` And now gets ``` $ python foo.py Traceback (most recent call last): File "foo.py", line 3, in <module> a = torch.cuda.Stream() File "/home/albandes/local/pytorch/3.8_debug_source/torch/cuda/streams.py", line 34, in __new__ return super(Stream, cls).__new__(cls, priority=priority, kwargs) File "/home/albandes/local/pytorch/3.8_debug_source/torch/cuda/_utils.py", line 44, in err_fn raise RuntimeError( RuntimeError: Tried to instantiate dummy base class Stream ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/89592 Approved by: https://github.com/soumith	2022-11-29 21:43:23 +00:00
Nikitha Malgi	197f9f0826	Merge CUDA Streams and Events (#53902 ) Summary: ----------- - Updates current_stream and default stream API's to take `optional[device]` argument - Adds parsing logic to replace `torch.cuda.Stream` and `torch.cuda.Event` -> `torch.classes.cuda.Stream` and `torch.classes.cuda.Event` for JIT - Merges StreamContext manager for both Eager and JIT. Pull Request resolved: https://github.com/pytorch/pytorch/pull/53902 Test Plan: ------ Run JIT tests: python test/test_jit.py -v TestCUDA Run eager tests: python test/test_cuda.py -v TestCuda Reviewed By: glaringlee Differential Revision: D27494627 Pulled By: nikithamalgifb fbshipit-source-id: b30b0570e38a33fb335c83762eb06ffd46a44b5c	2021-04-05 08:19:55 -07:00
Jianyu Huang	7fc03dd7c9	Back out "[pytorch][PR] Merge CUDA Streams and Events" (#54996 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54996 Original commit changeset: 45d9fee9a582 Test Plan: CI Reviewed By: jspark1105 Differential Revision: D27444718 fbshipit-source-id: deb627230817923eaf84ade50ecb14bfbce4e779	2021-03-31 10:21:35 -07:00
Nikitha Malgi	416ba5c48f	Merge CUDA Streams and Events (#53902 ) Summary: ----------- - Updates current_stream and default stream API's to take `optional[device]` argument - Adds parsing logic to replace `torch.cuda.Stream` and `torch.cuda.Event` -> `torch.classes.cuda.Stream` and `torch.classes.cuda.Event` for JIT - Merges StreamContext manager for both Eager and JIT. Pull Request resolved: https://github.com/pytorch/pytorch/pull/53902 Test Plan: ------ Run JIT tests: python test/test_jit.py -v TestCUDA Run eager tests: python test/test_cuda.py -v TestCuda Reviewed By: SplitInfinity Differential Revision: D27285996 Pulled By: nikithamalgifb fbshipit-source-id: 45d9fee9a582b5f4c82330f5f99eb88584804270	2021-03-26 14:19:39 -07:00
Nikita Shulga	43f0ccd1ec	torch.cuda.memory_allocated to return `{}` if not initialized (#51179 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/49952 Pull Request resolved: https://github.com/pytorch/pytorch/pull/51179 Reviewed By: ngimel Differential Revision: D26094932 Pulled By: malfet fbshipit-source-id: 0ec28ef9b0604245753d3f2b0e3536286700668d	2021-01-28 20:38:17 -08:00
Guilherme Leobas	4f9d0757f3	Add type informations to torch.cuda (#47134 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/47133 Pull Request resolved: https://github.com/pytorch/pytorch/pull/47134 Reviewed By: smessmer Differential Revision: D24955031 Pulled By: ezyang fbshipit-source-id: 87f4623643715baa6ac0627383f009956f80cd46	2020-11-13 21:34:35 -08:00
chengjun	8d570bc708	Decouple DataParallel/DistributedDataParallel from CUDA (#38454 ) Summary: Decouple DataParallel/DistributedDataParallel from CUDA to support more device types. - Move torch/cuda/comm.py to torch/nn/parallel/comm.py with minor changes for common devices support. Torch.cuda.comm is kept as is for backward compatibility - Provide common APIs to arbitrary device types without changing existing CUDA APIs in torch.cuda space. - Replace the torch.cuda calls in DataParellel/DistributedDataParallel with the new APIs. Related RFC: [https://github.com/pytorch/pytorch/issues/36160](https://github.com/pytorch/pytorch/issues/36160) Pull Request resolved: https://github.com/pytorch/pytorch/pull/38454 Differential Revision: D22051557 Pulled By: mrshenli fbshipit-source-id: 7842dad0e5d3ca0f6fb760bda49182dcf6653af8	2020-07-07 12:48:16 -07:00
SsnL	de7ac60cf4	Add out= variants for cuda.comm.broadcast/gather/scatter (#39681 ) Summary: Partially fixes https://github.com/pytorch/pytorch/issues/38911 Pull Request resolved: https://github.com/pytorch/pytorch/pull/39681 Differential Revision: D22161342 Pulled By: mrshenli fbshipit-source-id: 60295077159b02087823e93bb6ebac9d70adea0a	2020-06-24 12:58:19 -07:00
Nikita Shulga	5766da503b	Device name should be a string, not bytes (#40322 ) Summary: I.e. do not accept `bytes` as possible type of `device` argument in `torch.cuda._get_device_index` Pull Request resolved: https://github.com/pytorch/pytorch/pull/40322 Differential Revision: D22176885 Pulled By: malfet fbshipit-source-id: 2f3a46174161f1cdcf6a6ad94a31e54b18ad6186	2020-06-22 19:27:25 -07:00
Nikita Shulga	8b5732e8ad	Move `torch.cuda` annotations inline (#40075 ) Summary: Also enable `torch.cuda` typechecking Pull Request resolved: https://github.com/pytorch/pytorch/pull/40075 Differential Revision: D22121275 Pulled By: malfet fbshipit-source-id: dbecef09911334e8f3d87f5ecab66349da9f2325	2020-06-18 15:52:29 -07:00
Nikita Shulga	76fbfba644	Move _dummy_type to _utils.py (#40177 ) Summary: Use it from both __init__ and streams to define dummy types when CUDA is missing Fix accidental reference of global `storage_name` from `_dummy_type` Add type annotations Pull Request resolved: https://github.com/pytorch/pytorch/pull/40177 Differential Revision: D22106922 Pulled By: malfet fbshipit-source-id: 52fbfd91d70a78eb14d7ffda109c02ad1231497e	2020-06-17 22:50:02 -07:00
Edward Yang	da2004e132	Upgrade lint. (#39483 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39483 I fixed all of the new errors that occurred because of the upgrade. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Test Plan: Imported from OSS Differential Revision: D21884575 Pulled By: ezyang fbshipit-source-id: 45c8e1f1ecb410c8d7c46dd3922ad70e982a0685	2020-06-04 12:56:43 -07:00
Derek Kim	fbdafb006e	Fix trivial typos in torch.cuda._utils (#16026 ) Summary: Trivial typo fixings. Maybe the indefinite article "an" is needed before each "specified index" but I'm not perfectly sure. Pull Request resolved: https://github.com/pytorch/pytorch/pull/16026 Differential Revision: D13709499 Pulled By: ezyang fbshipit-source-id: 698b000bb8aa063afd81db6e67046456a439b2ce	2019-01-17 10:40:43 -08:00
SsnL	fab8085111	_get_device_index supports parsing device strings Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/14929 Reviewed By: weiyangfb Differential Revision: D13394498 Pulled By: soumith fbshipit-source-id: 948c6118abdf6c1e1a8a17709333954cafb2345e	2018-12-09 21:12:46 -08:00
Tongzhou Wang	8e33451e2e	Make torch.cuda.* take device objects; Update distributed docs (#10833 ) Summary: Commits: 1. Make `torch.cuda.*` take device objects 2. Update `torch.distributed` docs to emphasize calling `torch.cuda.set_device` before `init_process_group` Pull Request resolved: https://github.com/pytorch/pytorch/pull/10833 Differential Revision: D9514241 Pulled By: SsnL fbshipit-source-id: 2497464305fb1e63d6c495291a5744aaa7e2696e	2018-08-27 15:24:42 -07:00

23 Commits