These are created by the user passing cudaEventRecordExternal and
cudaEventWaitExternal to cudaEventRecordWithFlags() and
cudaStreamWaitEvent() respectively.
We do this by allowing the user to specify external=True when
constructing a torch.cuda.Event().
If external=False, the cudaEventRecord and cudaStreamWaitEvent API's
have a different meaning described here:
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cross-stream-dependencies-and-events
In short, they will be used to experess fork and join operations in
the graph if external=False.
External events can be used for expressing a fine-grained dependency
on the outcome of some nodes in a cuda graph (rather than all
nodes). They can also be used for timing parts of a cuda graph's
execution, rather than timing the entire graph's execution.
Finishes #146145
I'm a dummy and don't know how to use ghstack at this time. The first commit is a bug fix for _CudaKernel, which would previously always launch work on the NULL stream, rather than the user-passed stream.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155372
Approved by: https://github.com/ngimel
Followup work on top https://github.com/pytorch/pytorch/pull/149480
Wrapper on top of nvrtc inspired by https://gist.github.com/malfet/2c9a25976dd7396430c38af603f791da from @malfet
Compiling toy kernels with this setup takes 0.01s vs 90s using `load_inline()` on my local H100. This was primarily motivated by the timeouts I was seeing in the popcorn leaderboard but would also be useful to integrate into KernelBench
This PR is in the same spirit as https://github.com/pytorch/pytorch/pull/148972 which was a similar UX for Metal
For now we are planning on landing this as a private function because we expect to iterate both on the user facing API and the internals implementation, will open up a seperate issue to discuss the path towards making this work public and give a broader overview of the state of custom cuda kernel authoring in PyTorch
Future work, as a prereq to making the work public
* divup primitive
* support multiple kernels
* Expose _get_nvrtc_version from native code
* interop with torch.compile
* AMD support
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151484
Approved by: https://github.com/malfet
Fixes#112589
Fixed errors relating to pydocstyle in the following files. The remaining errors are related to docstrings at the module level and at methods within each module (see details below)
pydocstyle torch/cuda/_utils.py --count
before: 3
after: 0
pydocstyle torch/cuda/jiterator.py --count
before: 3
after: 1
**remaining errors:**
```
torch/cuda/jiterator.py:1 at module level:
D100: Missing docstring in public module
```
pydocstyle torch/cuda/graphs.py --count
before: 25
after: 7
**remaining errors:**
```
torch/cuda/graphs.py:1 at module level:
D100: Missing docstring in public module
torch/cuda/graphs.py:54 in public method `__new__`:
D102: Missing docstring in public method
torch/cuda/graphs.py:108 in public method `debug_dump`:
D205: 1 blank line required between summary line and description (found 0)
torch/cuda/graphs.py:108 in public method `debug_dump`:
D400: First line should end with a period (not ':')
torch/cuda/graphs.py:150 in public method `__init__`:
D107: Missing docstring in __init__
torch/cuda/graphs.py:172 in public method `__enter__`:
D105: Missing docstring in magic method
torch/cuda/graphs.py:186 in public method `__exit__`:
D105: Missing docstring in magic method
```
pydocstyle torch/cuda/_sanitizer.py --count
before: 35
after: 31
**remaining errors:**
```
torch/cuda/_sanitizer.py:43 in public class `AccessType`:
D101: Missing docstring in public class
torch/cuda/_sanitizer.py:47 in public method `__str__`:
D105: Missing docstring in magic method
torch/cuda/_sanitizer.py:84 in public method `__init__`:
D107: Missing docstring in __init__
torch/cuda/_sanitizer.py:96 in public method `__str__`:
D105: Missing docstring in magic method
torch/cuda/_sanitizer.py:139 in public method `__init__`:
D107: Missing docstring in __init__
torch/cuda/_sanitizer.py:142 in public method `__str__`:
D105: Missing docstring in magic method
torch/cuda/_sanitizer.py:218 in public class `StreamSynchronizations`:
D101: Missing docstring in public class
torch/cuda/_sanitizer.py:219 in public method `__init__`:
D107: Missing docstring in __init__
torch/cuda/_sanitizer.py:256 in public method `create_stream`:
D102: Missing docstring in public method
torch/cuda/_sanitizer.py:268 in public method `create_event`:
D102: Missing docstring in public method
torch/cuda/_sanitizer.py:272 in public method `delete_event`:
D102: Missing docstring in public method
torch/cuda/_sanitizer.py:276 in public method `update_seq_num`:
D102: Missing docstring in public method
torch/cuda/_sanitizer.py:280 in public method `record_state`:
D102: Missing docstring in public method
torch/cuda/_sanitizer.py:291 in public method `stream_wait_for_event`:
D102: Missing docstring in public method
torch/cuda/_sanitizer.py:298 in public method `all_streams_wait_for_event`:
D102: Missing docstring in public method
torch/cuda/_sanitizer.py:307 in public method `all_streams_wait_for_stream`:
D102: Missing docstring in public method
torch/cuda/_sanitizer.py:316 in public method `sync_all_streams`:
D102: Missing docstring in public method
torch/cuda/_sanitizer.py:323 in public method `is_ordered_after`:
D102: Missing docstring in public method
torch/cuda/_sanitizer.py:339 in public method `__init__`:
D107: Missing docstring in __init__
torch/cuda/_sanitizer.py:460 in public function `zip_by_key`:
D103: Missing docstring in public function
torch/cuda/_sanitizer.py:466 in public function `zip_arguments`:
D103: Missing docstring in public function
torch/cuda/_sanitizer.py:478 in public class `ArgumentHandler`:
D101: Missing docstring in public class
torch/cuda/_sanitizer.py:479 in public method `__init__`:
D107: Missing docstring in __init__
torch/cuda/_sanitizer.py:505 in public method `parse_inputs`:
D102: Missing docstring in public method
torch/cuda/_sanitizer.py:520 in public method `parse_outputs`:
D102: Missing docstring in public method
torch/cuda/_sanitizer.py:527 in public class `CUDASanitizerDispatchMode`:
D101: Missing docstring in public class
torch/cuda/_sanitizer.py:528 in public method `__init__`:
D107: Missing docstring in __init__
torch/cuda/_sanitizer.py:562 in public method `__torch_dispatch__`:
D105: Missing docstring in magic method
torch/cuda/_sanitizer.py:597 in public method `__init__`:
D107: Missing docstring in __init__
torch/cuda/_sanitizer.py:601 in public method `enable`:
D102: Missing docstring in public method
torch/cuda/_sanitizer.py:605 in public method `__del__`:
D105: Missing docstring in magic method
```
pydocstyle torch/storage.py --count
before: 90
after: 37
**remaining errors:**
```
torch/storage.py:1 at module level:
D100: Missing docstring in public module
torch/storage.py:310 in public class `UntypedStorage`:
D101: Missing docstring in public class
torch/storage.py:311 in public method `__getitem__`:
D105: Missing docstring in magic method
torch/storage.py:317 in public method `is_cuda`:
D102: Missing docstring in public method
torch/storage.py:321 in public method `is_hpu`:
D102: Missing docstring in public method
torch/storage.py:325 in public method `share_memory_`:
D102: Missing docstring in public method
torch/storage.py:444 in public class `TypedStorage`:
D101: Missing docstring in public class
torch/storage.py:453 in public method `fill_`:
D102: Missing docstring in public method
torch/storage.py:458 in public method `__new__`:
D102: Missing docstring in public method
torch/storage.py:530 in public method `__init__`:
D107: Missing docstring in __init__
torch/storage.py:599 in public method `is_cuda`:
D102: Missing docstring in public method
torch/storage.py:604 in public method `is_hpu`:
D102: Missing docstring in public method
torch/storage.py:624 in public method `__len__`:
D105: Missing docstring in magic method
torch/storage.py:653 in public method `__setitem__`:
D105: Missing docstring in magic method
torch/storage.py:681 in public method `__getitem__`:
D105: Missing docstring in magic method
torch/storage.py:715 in public method `copy_`:
D102: Missing docstring in public method
torch/storage.py:723 in public method `nbytes`:
D102: Missing docstring in public method
torch/storage.py:731 in public method `type`:
D102: Missing docstring in public method
torch/storage.py:744 in public method `cuda`:
D102: Missing docstring in public method
torch/storage.py:751 in public method `hpu`:
D102: Missing docstring in public method
torch/storage.py:758 in public method `element_size`:
D102: Missing docstring in public method
torch/storage.py:766 in public method `get_device`:
D102: Missing docstring in public method
torch/storage.py:770 in public method `__str__`:
D105: Missing docstring in magic method
torch/storage.py:781 in public method `__repr__`:
D105: Missing docstring in magic method
torch/storage.py:785 in public method `__iter__`:
D105: Missing docstring in magic method
torch/storage.py:789 in public method `__copy__`:
D105: Missing docstring in magic method
torch/storage.py:793 in public method `__deepcopy__`:
D105: Missing docstring in magic method
torch/storage.py:801 in public method `__sizeof__`:
D105: Missing docstring in magic method
torch/storage.py:877 in public method `device`:
D102: Missing docstring in public method
torch/storage.py:881 in public method `size`:
D102: Missing docstring in public method
torch/storage.py:891 in public method `pickle_storage_type`:
D102: Missing docstring in public method
torch/storage.py:902 in public method `__reduce__`:
D105: Missing docstring in magic method
torch/storage.py:907 in public method `data_ptr`:
D102: Missing docstring in public method
torch/storage.py:915 in public method `resize_`:
D102: Missing docstring in public method
torch/storage.py:931 in public method `from_buffer`:
D102: Missing docstring in public method
torch/storage.py:1032 in public method `from_file`:
D402: First line should not be the function's "signature"
torch/storage.py:1075 in public method `is_shared`:
D102: Missing docstring in public method
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113227
Approved by: https://github.com/kit1980
This adds `torch.cuda._DeviceGuard` which is a stripped down version of
`torch.cuda.device` with lower overhead. To do this, it only accepts `int` as
the device so we don't need to call `_get_device_index` and is implemented
with a new C++ helper `torch._C._cuda_exchangeDevice` that allows
`_DeviceGuard.__enter__` to be just a single function call. On my machine,
I see a drop from 3.8us of overhead to 0.94 us with this simple benchmark:
```python
def set_device():
with torch.cuda.device(0):
pass
%timeit set_device()
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91045
Approved by: https://github.com/ngimel, https://github.com/anijain2305
Avoids
```
$ python foo.py
Traceback (most recent call last):
File "foo.py", line 3, in <module>
a = torch.cuda.Stream()
File "/home/albandes/local/pytorch/3.8_debug_source/torch/cuda/streams.py", line 34, in __new__
return super(Stream, cls).__new__(cls, priority=priority, **kwargs)
TypeError: object.__new__() takes exactly one argument (the type to instantiate)
```
And now gets
```
$ python foo.py
Traceback (most recent call last):
File "foo.py", line 3, in <module>
a = torch.cuda.Stream()
File "/home/albandes/local/pytorch/3.8_debug_source/torch/cuda/streams.py", line 34, in __new__
return super(Stream, cls).__new__(cls, priority=priority, **kwargs)
File "/home/albandes/local/pytorch/3.8_debug_source/torch/cuda/_utils.py", line 44, in err_fn
raise RuntimeError(
RuntimeError: Tried to instantiate dummy base class Stream
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89592
Approved by: https://github.com/soumith
Summary:
Decouple DataParallel/DistributedDataParallel from CUDA to support more device types.
- Move torch/cuda/comm.py to torch/nn/parallel/comm.py with minor changes for common devices support. Torch.cuda.comm is kept as is for backward compatibility
- Provide common APIs to arbitrary device types without changing existing CUDA APIs in torch.cuda space.
- Replace the torch.cuda calls in DataParellel/DistributedDataParallel with the new APIs.
Related RFC: [https://github.com/pytorch/pytorch/issues/36160](https://github.com/pytorch/pytorch/issues/36160)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38454
Differential Revision: D22051557
Pulled By: mrshenli
fbshipit-source-id: 7842dad0e5d3ca0f6fb760bda49182dcf6653af8
Summary:
I.e. do not accept `bytes` as possible type of `device` argument in
`torch.cuda._get_device_index`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40322
Differential Revision: D22176885
Pulled By: malfet
fbshipit-source-id: 2f3a46174161f1cdcf6a6ad94a31e54b18ad6186
Summary:
Use it from both __init__ and streams to define dummy types when CUDA is missing
Fix accidental reference of global `storage_name` from `_dummy_type`
Add type annotations
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40177
Differential Revision: D22106922
Pulled By: malfet
fbshipit-source-id: 52fbfd91d70a78eb14d7ffda109c02ad1231497e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39483
I fixed all of the new errors that occurred because of the upgrade.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Differential Revision: D21884575
Pulled By: ezyang
fbshipit-source-id: 45c8e1f1ecb410c8d7c46dd3922ad70e982a0685