Commit Graph

23 Commits

Author SHA1 Message Date
4e8dd11be1 simplify nvrtc discovery login in compile_kernel (#156674)
Followup from https://github.com/pytorch/pytorch/pull/156332

Tested a bunch while I was working on https://github.com/pytorch/pytorch/pull/156380

Works just fine on dev gpus
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156674
Approved by: https://github.com/malfet
2025-06-24 08:55:40 +00:00
4c0aa37dda Support stream capture of event record and wait nodes in cuda graphs (#155372)
These are created by the user passing cudaEventRecordExternal and
cudaEventWaitExternal to cudaEventRecordWithFlags() and
cudaStreamWaitEvent() respectively.

We do this by allowing the user to specify external=True when
constructing a torch.cuda.Event().

If external=False, the cudaEventRecord and cudaStreamWaitEvent API's
have a different meaning described here:

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cross-stream-dependencies-and-events

In short, they will be used to experess fork and join operations in
the graph if external=False.

External events can be used for expressing a fine-grained dependency
on the outcome of some nodes in a cuda graph (rather than all
nodes). They can also be used for timing parts of a cuda graph's
execution, rather than timing the entire graph's execution.

Finishes #146145

I'm a dummy and don't know how to use ghstack at this time. The first commit is a bug fix for _CudaKernel, which would previously always launch work on the NULL stream, rather than the user-passed stream.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155372
Approved by: https://github.com/ngimel
2025-06-17 21:44:51 +00:00
5b368fa0b7 Add torch.cuda._compile_kernel() (#151484)
Followup work on top https://github.com/pytorch/pytorch/pull/149480

Wrapper on top of nvrtc inspired by https://gist.github.com/malfet/2c9a25976dd7396430c38af603f791da from @malfet

Compiling toy kernels with this setup takes 0.01s vs 90s using `load_inline()` on my local H100. This was primarily motivated by the timeouts I was seeing in the popcorn leaderboard but would also be useful to integrate into KernelBench

This PR is in the same spirit as https://github.com/pytorch/pytorch/pull/148972 which was a similar UX for Metal

For now we are planning on landing this as a private function because we expect to iterate both on the user facing API and the internals implementation, will open up a seperate issue to discuss the path towards making this work public and give a broader overview of the state of custom cuda kernel authoring in PyTorch

Future work, as a prereq to making the work public
* divup primitive
* support multiple kernels
* Expose _get_nvrtc_version from native code
* interop with torch.compile
* AMD support
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151484
Approved by: https://github.com/malfet
2025-04-24 07:14:31 +00:00
46e3f670b4 refactor code to share across different devices (#120602)
# Motivation
Refactor utils code to make it possible to share across CUDA, XPU, and other backends.

# Solution
Move `_dummy_type` and `_LazySeedTracker` to torch._utils;

# Additional Context
When upstreaming, refactor these code changes by isolating them into in an additional PR to minimize their impact on the CUDA code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120602
Approved by: https://github.com/albanD, https://github.com/jgong5, https://github.com/gujinghui, https://github.com/EikanWang
2024-02-28 09:42:58 +00:00
01478f1afa Fix pydocstyle errors listed in issue 112589 (#113227)
Fixes #112589

Fixed errors relating to pydocstyle in the following files. The remaining errors are related to docstrings at the module level and at methods within each module (see details below)

pydocstyle torch/cuda/_utils.py --count
before: 3
after: 0

pydocstyle torch/cuda/jiterator.py --count
before: 3
after: 1

**remaining errors:**
```
torch/cuda/jiterator.py:1 at module level:
        D100: Missing docstring in public module
```

pydocstyle torch/cuda/graphs.py --count
before: 25
after: 7

**remaining errors:**
```
torch/cuda/graphs.py:1 at module level:
        D100: Missing docstring in public module
torch/cuda/graphs.py:54 in public method `__new__`:
        D102: Missing docstring in public method
torch/cuda/graphs.py:108 in public method `debug_dump`:
        D205: 1 blank line required between summary line and description (found 0)
torch/cuda/graphs.py:108 in public method `debug_dump`:
        D400: First line should end with a period (not ':')
torch/cuda/graphs.py:150 in public method `__init__`:
        D107: Missing docstring in __init__
torch/cuda/graphs.py:172 in public method `__enter__`:
        D105: Missing docstring in magic method
torch/cuda/graphs.py:186 in public method `__exit__`:
        D105: Missing docstring in magic method
```

pydocstyle torch/cuda/_sanitizer.py --count
before: 35
after: 31

**remaining errors:**
```
torch/cuda/_sanitizer.py:43 in public class `AccessType`:
        D101: Missing docstring in public class
torch/cuda/_sanitizer.py:47 in public method `__str__`:
        D105: Missing docstring in magic method
torch/cuda/_sanitizer.py:84 in public method `__init__`:
        D107: Missing docstring in __init__
torch/cuda/_sanitizer.py:96 in public method `__str__`:
        D105: Missing docstring in magic method
torch/cuda/_sanitizer.py:139 in public method `__init__`:
        D107: Missing docstring in __init__
torch/cuda/_sanitizer.py:142 in public method `__str__`:
        D105: Missing docstring in magic method
torch/cuda/_sanitizer.py:218 in public class `StreamSynchronizations`:
        D101: Missing docstring in public class
torch/cuda/_sanitizer.py:219 in public method `__init__`:
        D107: Missing docstring in __init__
torch/cuda/_sanitizer.py:256 in public method `create_stream`:
        D102: Missing docstring in public method
torch/cuda/_sanitizer.py:268 in public method `create_event`:
        D102: Missing docstring in public method
torch/cuda/_sanitizer.py:272 in public method `delete_event`:
        D102: Missing docstring in public method
torch/cuda/_sanitizer.py:276 in public method `update_seq_num`:
        D102: Missing docstring in public method
torch/cuda/_sanitizer.py:280 in public method `record_state`:
        D102: Missing docstring in public method
torch/cuda/_sanitizer.py:291 in public method `stream_wait_for_event`:
        D102: Missing docstring in public method
torch/cuda/_sanitizer.py:298 in public method `all_streams_wait_for_event`:
        D102: Missing docstring in public method
torch/cuda/_sanitizer.py:307 in public method `all_streams_wait_for_stream`:
        D102: Missing docstring in public method
torch/cuda/_sanitizer.py:316 in public method `sync_all_streams`:
        D102: Missing docstring in public method
torch/cuda/_sanitizer.py:323 in public method `is_ordered_after`:
        D102: Missing docstring in public method
torch/cuda/_sanitizer.py:339 in public method `__init__`:
        D107: Missing docstring in __init__
torch/cuda/_sanitizer.py:460 in public function `zip_by_key`:
        D103: Missing docstring in public function
torch/cuda/_sanitizer.py:466 in public function `zip_arguments`:
        D103: Missing docstring in public function
torch/cuda/_sanitizer.py:478 in public class `ArgumentHandler`:
        D101: Missing docstring in public class
torch/cuda/_sanitizer.py:479 in public method `__init__`:
        D107: Missing docstring in __init__
torch/cuda/_sanitizer.py:505 in public method `parse_inputs`:
        D102: Missing docstring in public method
torch/cuda/_sanitizer.py:520 in public method `parse_outputs`:
        D102: Missing docstring in public method
torch/cuda/_sanitizer.py:527 in public class `CUDASanitizerDispatchMode`:
        D101: Missing docstring in public class
torch/cuda/_sanitizer.py:528 in public method `__init__`:
        D107: Missing docstring in __init__
torch/cuda/_sanitizer.py:562 in public method `__torch_dispatch__`:
        D105: Missing docstring in magic method
torch/cuda/_sanitizer.py:597 in public method `__init__`:
        D107: Missing docstring in __init__
torch/cuda/_sanitizer.py:601 in public method `enable`:
        D102: Missing docstring in public method
torch/cuda/_sanitizer.py:605 in public method `__del__`:
        D105: Missing docstring in magic method
```

pydocstyle torch/storage.py --count
before: 90
after: 37

**remaining errors:**
```
torch/storage.py:1 at module level:
        D100: Missing docstring in public module
torch/storage.py:310 in public class `UntypedStorage`:
        D101: Missing docstring in public class
torch/storage.py:311 in public method `__getitem__`:
        D105: Missing docstring in magic method
torch/storage.py:317 in public method `is_cuda`:
        D102: Missing docstring in public method
torch/storage.py:321 in public method `is_hpu`:
        D102: Missing docstring in public method
torch/storage.py:325 in public method `share_memory_`:
        D102: Missing docstring in public method
torch/storage.py:444 in public class `TypedStorage`:
        D101: Missing docstring in public class
torch/storage.py:453 in public method `fill_`:
        D102: Missing docstring in public method
torch/storage.py:458 in public method `__new__`:
        D102: Missing docstring in public method
torch/storage.py:530 in public method `__init__`:
        D107: Missing docstring in __init__
torch/storage.py:599 in public method `is_cuda`:
        D102: Missing docstring in public method
torch/storage.py:604 in public method `is_hpu`:
        D102: Missing docstring in public method
torch/storage.py:624 in public method `__len__`:
        D105: Missing docstring in magic method
torch/storage.py:653 in public method `__setitem__`:
        D105: Missing docstring in magic method
torch/storage.py:681 in public method `__getitem__`:
        D105: Missing docstring in magic method
torch/storage.py:715 in public method `copy_`:
        D102: Missing docstring in public method
torch/storage.py:723 in public method `nbytes`:
        D102: Missing docstring in public method
torch/storage.py:731 in public method `type`:
        D102: Missing docstring in public method
torch/storage.py:744 in public method `cuda`:
        D102: Missing docstring in public method
torch/storage.py:751 in public method `hpu`:
        D102: Missing docstring in public method
torch/storage.py:758 in public method `element_size`:
        D102: Missing docstring in public method
torch/storage.py:766 in public method `get_device`:
        D102: Missing docstring in public method
torch/storage.py:770 in public method `__str__`:
        D105: Missing docstring in magic method
torch/storage.py:781 in public method `__repr__`:
        D105: Missing docstring in magic method
torch/storage.py:785 in public method `__iter__`:
        D105: Missing docstring in magic method
torch/storage.py:789 in public method `__copy__`:
        D105: Missing docstring in magic method
torch/storage.py:793 in public method `__deepcopy__`:
        D105: Missing docstring in magic method
torch/storage.py:801 in public method `__sizeof__`:
        D105: Missing docstring in magic method
torch/storage.py:877 in public method `device`:
        D102: Missing docstring in public method
torch/storage.py:881 in public method `size`:
        D102: Missing docstring in public method
torch/storage.py:891 in public method `pickle_storage_type`:
        D102: Missing docstring in public method
torch/storage.py:902 in public method `__reduce__`:
        D105: Missing docstring in magic method
torch/storage.py:907 in public method `data_ptr`:
        D102: Missing docstring in public method
torch/storage.py:915 in public method `resize_`:
        D102: Missing docstring in public method
torch/storage.py:931 in public method `from_buffer`:
        D102: Missing docstring in public method
torch/storage.py:1032 in public method `from_file`:
        D402: First line should not be the function's "signature"
torch/storage.py:1075 in public method `is_shared`:
        D102: Missing docstring in public method

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113227
Approved by: https://github.com/kit1980
2023-11-13 22:05:45 +00:00
3bf922a6ce Apply UFMT to low traffic torch modules (#106249)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106249
Approved by: https://github.com/Skylion007
2023-07-29 23:37:30 +00:00
79c5e33349 [BE] Enable ruff's UP rules and autoformat nn/ mps/ and torch/ (#105436)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105436
Approved by: https://github.com/malfet, https://github.com/albanD
2023-07-21 07:38:46 +00:00
eece6da162 [inductor] Reduce device context manager overhead (#91045)
This adds `torch.cuda._DeviceGuard` which is a stripped down version of
`torch.cuda.device` with lower overhead. To do this, it only accepts `int` as
the device so we don't need to call `_get_device_index` and is implemented
with a new C++ helper `torch._C._cuda_exchangeDevice` that allows
`_DeviceGuard.__enter__` to be just a single function call. On my machine,
I see a drop from 3.8us of overhead to 0.94 us with this simple benchmark:

```python
def set_device():
    with torch.cuda.device(0):
        pass

%timeit set_device()
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91045
Approved by: https://github.com/ngimel, https://github.com/anijain2305
2023-01-12 16:51:59 +00:00
8713119c89 Stream actually overrides __new__ so we need to patch it as well (#89592)
Avoids
```
$ python foo.py
Traceback (most recent call last):
  File "foo.py", line 3, in <module>
    a = torch.cuda.Stream()
  File "/home/albandes/local/pytorch/3.8_debug_source/torch/cuda/streams.py", line 34, in __new__
    return super(Stream, cls).__new__(cls, priority=priority, **kwargs)
TypeError: object.__new__() takes exactly one argument (the type to instantiate)
```
And now gets
```
$ python foo.py
Traceback (most recent call last):
  File "foo.py", line 3, in <module>
    a = torch.cuda.Stream()
  File "/home/albandes/local/pytorch/3.8_debug_source/torch/cuda/streams.py", line 34, in __new__
    return super(Stream, cls).__new__(cls, priority=priority, **kwargs)
  File "/home/albandes/local/pytorch/3.8_debug_source/torch/cuda/_utils.py", line 44, in err_fn
    raise RuntimeError(
RuntimeError: Tried to instantiate dummy base class Stream

```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89592
Approved by: https://github.com/soumith
2022-11-29 21:43:23 +00:00
197f9f0826 Merge CUDA Streams and Events (#53902)
Summary:
-----------
- Updates current_stream and default stream API's to take `optional[device]` argument
- Adds parsing logic to replace `torch.cuda.Stream` and `torch.cuda.Event` -> `torch.classes.cuda.Stream` and `torch.classes.cuda.Event` for JIT
- Merges StreamContext manager for both Eager and JIT.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53902

Test Plan:
------
Run JIT tests:
python test/test_jit.py -v TestCUDA

Run eager tests:
python test/test_cuda.py -v TestCuda

Reviewed By: glaringlee

Differential Revision: D27494627

Pulled By: nikithamalgifb

fbshipit-source-id: b30b0570e38a33fb335c83762eb06ffd46a44b5c
2021-04-05 08:19:55 -07:00
7fc03dd7c9 Back out "[pytorch][PR] Merge CUDA Streams and Events" (#54996)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54996

Original commit changeset: 45d9fee9a582

Test Plan: CI

Reviewed By: jspark1105

Differential Revision: D27444718

fbshipit-source-id: deb627230817923eaf84ade50ecb14bfbce4e779
2021-03-31 10:21:35 -07:00
416ba5c48f Merge CUDA Streams and Events (#53902)
Summary:
-----------
- Updates current_stream and default stream API's to take `optional[device]` argument
- Adds parsing logic to replace `torch.cuda.Stream` and `torch.cuda.Event` -> `torch.classes.cuda.Stream` and `torch.classes.cuda.Event` for JIT
- Merges StreamContext manager for both Eager and JIT.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53902

Test Plan:
------
Run JIT tests:
python test/test_jit.py -v TestCUDA

Run eager tests:
python test/test_cuda.py -v TestCuda

Reviewed By: SplitInfinity

Differential Revision: D27285996

Pulled By: nikithamalgifb

fbshipit-source-id: 45d9fee9a582b5f4c82330f5f99eb88584804270
2021-03-26 14:19:39 -07:00
43f0ccd1ec torch.cuda.memory_allocated to return {} if not initialized (#51179)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/49952

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51179

Reviewed By: ngimel

Differential Revision: D26094932

Pulled By: malfet

fbshipit-source-id: 0ec28ef9b0604245753d3f2b0e3536286700668d
2021-01-28 20:38:17 -08:00
4f9d0757f3 Add type informations to torch.cuda (#47134)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/47133

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47134

Reviewed By: smessmer

Differential Revision: D24955031

Pulled By: ezyang

fbshipit-source-id: 87f4623643715baa6ac0627383f009956f80cd46
2020-11-13 21:34:35 -08:00
8d570bc708 Decouple DataParallel/DistributedDataParallel from CUDA (#38454)
Summary:
Decouple DataParallel/DistributedDataParallel from CUDA to support more device types.
- Move torch/cuda/comm.py to torch/nn/parallel/comm.py with minor changes for common devices support. Torch.cuda.comm is kept as is for backward compatibility
- Provide common APIs to arbitrary device types without changing existing CUDA APIs in torch.cuda space.
- Replace the torch.cuda calls in DataParellel/DistributedDataParallel with the new APIs.

Related RFC: [https://github.com/pytorch/pytorch/issues/36160](https://github.com/pytorch/pytorch/issues/36160)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/38454

Differential Revision: D22051557

Pulled By: mrshenli

fbshipit-source-id: 7842dad0e5d3ca0f6fb760bda49182dcf6653af8
2020-07-07 12:48:16 -07:00
de7ac60cf4 Add out= variants for cuda.comm.broadcast/gather/scatter (#39681)
Summary:
Partially fixes https://github.com/pytorch/pytorch/issues/38911
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39681

Differential Revision: D22161342

Pulled By: mrshenli

fbshipit-source-id: 60295077159b02087823e93bb6ebac9d70adea0a
2020-06-24 12:58:19 -07:00
5766da503b Device name should be a string, not bytes (#40322)
Summary:
I.e. do not accept `bytes` as possible type of `device` argument in
`torch.cuda._get_device_index`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40322

Differential Revision: D22176885

Pulled By: malfet

fbshipit-source-id: 2f3a46174161f1cdcf6a6ad94a31e54b18ad6186
2020-06-22 19:27:25 -07:00
8b5732e8ad Move torch.cuda annotations inline (#40075)
Summary:
Also enable `torch.cuda` typechecking
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40075

Differential Revision: D22121275

Pulled By: malfet

fbshipit-source-id: dbecef09911334e8f3d87f5ecab66349da9f2325
2020-06-18 15:52:29 -07:00
76fbfba644 Move _dummy_type to _utils.py (#40177)
Summary:
Use it from both __init__ and streams to define dummy types when CUDA is missing
Fix accidental reference of global `storage_name` from `_dummy_type`
Add type annotations
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40177

Differential Revision: D22106922

Pulled By: malfet

fbshipit-source-id: 52fbfd91d70a78eb14d7ffda109c02ad1231497e
2020-06-17 22:50:02 -07:00
da2004e132 Upgrade lint. (#39483)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39483

I fixed all of the new errors that occurred because of the upgrade.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D21884575

Pulled By: ezyang

fbshipit-source-id: 45c8e1f1ecb410c8d7c46dd3922ad70e982a0685
2020-06-04 12:56:43 -07:00
fbdafb006e Fix trivial typos in torch.cuda._utils (#16026)
Summary:
Trivial typo fixings.

Maybe the indefinite article "an" is needed before each "specified index" but I'm not perfectly sure.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16026

Differential Revision: D13709499

Pulled By: ezyang

fbshipit-source-id: 698b000bb8aa063afd81db6e67046456a439b2ce
2019-01-17 10:40:43 -08:00
fab8085111 _get_device_index supports parsing device strings
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/14929

Reviewed By: weiyangfb

Differential Revision: D13394498

Pulled By: soumith

fbshipit-source-id: 948c6118abdf6c1e1a8a17709333954cafb2345e
2018-12-09 21:12:46 -08:00
8e33451e2e Make torch.cuda.* take device objects; Update distributed docs (#10833)
Summary:
Commits:

1. Make `torch.cuda.*` take device objects
2. Update `torch.distributed` docs to emphasize calling `torch.cuda.set_device` before `init_process_group`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10833

Differential Revision: D9514241

Pulled By: SsnL

fbshipit-source-id: 2497464305fb1e63d6c495291a5744aaa7e2696e
2018-08-27 15:24:42 -07:00