Commit Graph

231 Commits

Author SHA1 Message Date
e1ea584b1c Revert "[memory profiling] add a facility to gather combined C++/Python/TorchScript stack traces. (#95541)"
This reverts commit 4e1060c609c094fd5f58041ebed803f74410ee36.

Reverted https://github.com/pytorch/pytorch/pull/95541 on behalf of https://github.com/DanilBaibak due to breaking internal builds
2023-03-15 13:28:41 +00:00
85639c1a88 [allocator] Generalize recording to a pool (#96542)
Previously the allocator would query whether a stream was recording a graph,
and look up the pool associated with a graph. This change has the allocator
directly associate a stream with a mempool, decoupling "record this stream to a pool"
from the action of "record all actions to a cuda graph".
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96542
Approved by: https://github.com/eellison
2023-03-15 04:28:49 +00:00
4e1060c609 [memory profiling] add a facility to gather combined C++/Python/TorchScript stack traces. (#95541)
This refactors the stack trace facility specific to memory profiling
    in python+cuda to make a generic facility to generate combined stack
    traces.

    The generic facility (combined_traceback.h) does not require
    python to be around to work, but will return python stacks if it is
    present.

    This facility is then used to add support for stack trace gathering in memory profiling that
    happens directly from C++.

    It is also used to expose a python API for gathering and symbolizing
    combineds stacks.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95541
Approved by: https://github.com/ezyang
2023-03-14 18:26:05 +00:00
da265652d6 Return Live Data Pointers from Checkpoint, swap onto tensors (#95020)
When we checkpoint the state of the private pool allocator, we will need to make sure that its current live allocated blocks will get properly cleaned up when the tensors they correspond to die. Return DataPtrs for these new allocated blocks that the callee can swap onto live Tensors.

The exact api for setting the checkpoint can be manipulated after this as the cudagraph implementation is built out, but this at least shows its sufficiently general.

This should be the last PR touching cuda caching allocator necessary for new cudagraphs integration.

Differential Revision: [D43999888](https://our.internmc.facebook.com/intern/diff/D43999888)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95020
Approved by: https://github.com/zdevito
2023-03-14 01:22:19 +00:00
1cc32aedb0 Handle additional live allocations not in checkpointed state (#94943)
We choose to ignore certain blocks that are currently allocated when we set the pool to its checkpoint. For those blocks, we need to swap out the deleter function of their corresponding blocks so that a deallocation is not triggered when they die.

Differential Revision: [D43999886](https://our.internmc.facebook.com/intern/diff/D43999886)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94943
Approved by: https://github.com/zdevito
2023-03-14 01:00:47 +00:00
d798de2b05 Checkpoint CUDA Allocator Private Pool State (#94653)
Copying note from cuda caching allocator:

```
   * Note [Checkpointing PrivatePoolState]
   *
   * Refer above to Note [Interaction with CUDA graph capture]. Allocations made
   * during graph capture are made from a separate private pool. During graph
   * capture allocations behave as usual. During graph replay the allocator
   * state does not change even as new tensors are created. The private pool
   * will not free its blocks to the main caching allocator until cuda graph use
   * is finished to prevent an allocation from eager clobbering the memory from
   * a live but unaccounted for tensor that was created during replay.
   *
   * `make_graphed_callables`, a series of separate callables chained in
   * successive cuda graphs, can share a memory pool because after a cuda graph
   * recording the allocations in the shared private pool exactly reflect the
   * tensors that are allocated.
   *
   * We would like to extend callable chaining to support a graphed callable
   * tree. In this scenario, we have a tree of callable chains which will be
   * captured with cuda graphs. In the diagram below, we have a tree with four
   * callables, A, B, C, and D. Suppose we have captured, and subsequently
   * replayed, A, B, and C. Then on a new invocation, we replay A and B, but
   * would now like to record D. At this point the private pool will not reflect
   * any of the live tensors created during graph replay. Allocations made
   * during a new recording with the pool could overwrite those live tensors.
   *
   * In order to record a new graph capture after replaying prior callables in
   * the tree, we need the allocator to reflect the state of the live tensors.
   * We checkpoint the state of the private after each recording, and then
   * reapply it when we are starting a new recording chain. Additionally, we
   * must free the allocations for any tensors that died between the end of our
   * previous graph replaying and our new recording (TODO). All of the allocated
   * segments that existed in the checkpointed state must still exist in the
   * pool. There may also exist new segments, which we will free (TODO : link
   * note [live tensors between iterations] when it exists).
   *
   *
   *  ---------------> A ---------------> B ---------------> C
   *                                |
   *                                |
   *                                |
   *                                |
   *                                  ---------------> D
```

A few TODOs:
- need to add logic for freeing tensors that have died between a last replay and current new recording
- Add logic for free that might be called on a pointer multiple times (because we are manually freeing live tensors)

The two scenarios above have not been exercised in the tests yet.

Differential Revision: [D43999889](https://our.internmc.facebook.com/intern/diff/D43999889)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94653
Approved by: https://github.com/zdevito
2023-03-14 00:47:30 +00:00
4b372e3958 [memory profiling] C++ tracing support (#95357)
Adds the ability to quickly generate stack traces for C++,
and combine Python, TorchScript, and C++ frames into a single trace.

This makes it possible for the memory tracer to record allocations inside
C++ code (e.g. convolution temporaries, backward operators).

The unwinder code is ~10x faster than execinfo.h's backward because it
cache fast unwinder routines for instruction pointers that have already been seen.
It is also only 1.2--2x slower than copying the entire stack (the approach perf takes),
while using 2 orders of magnitude less space per stack.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95357
Approved by: https://github.com/bertmaher
2023-03-12 07:24:14 +00:00
48490cec28 [memory profiling] Move Context object to c10 (#96280)
Minor refactor so that follow up PR can have objects that meet the GatheredContext
inferface without having to depend on CUDA.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96280
Approved by: https://github.com/eellison
2023-03-12 07:24:14 +00:00
266089a3fe [memory snapshots] record scripted stack traces (#95356)
Adds support for seeing both python and script stack traces in memory
debugging.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95356
Approved by: https://github.com/aaronenyeshi
2023-03-12 07:24:14 +00:00
cyy
6786a24fd2 fix some tiny code issues (#95757)
This PR tries to fix:
1. a misspelled NDEBUG preprocessing condition.
2. get ride of all writable-strings warnings.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95757
Approved by: https://github.com/soulitzer
2023-03-01 23:27:32 +00:00
4f84c57c87 Fix potential deadlock when recording memory traces (#95273)
See comment in the diff

Differential Revision: [D43490668](https://our.internmc.facebook.com/intern/diff/D43490668)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95273
Approved by: https://github.com/eellison
2023-02-27 19:04:47 +00:00
54b7c7d5e9 Added requested_bytes to CUDA Caching Allocator Stats (#88575)
Summary:
The caching allocator can be configured to round memory allocations in order to reduce fragmentation. Sometimes however, the overhead from rounding can be higher than the fragmentation it helps reduce.

We have added a new stat to CUDA caching allocator stats to help track if rounding is adding too much overhead and help tune the roundup_power2_divisions flag:
    - "requested_bytes.{current,peak,allocated,freed}": memory requested by client code, compare this with allocated_bytes to check if allocation rounding adds too much overhead

Test Plan: Added test case in caffe2/test/test_cuda.py

Differential Revision: D40810674

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88575
Approved by: https://github.com/zdevito
2023-02-09 21:37:25 +00:00
cyy
27efdc5eed fix writable-strings warnings (#93246)
clang reports "ISO C++11 does not allow conversion from string
literal to 'char *'"

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93246
Approved by: https://github.com/malfet
2023-02-04 02:11:15 +00:00
cyy
bfe5e1258b avoid unnecessary static_cast (#93898)
avoid unnecessary static_cast
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93898
Approved by: https://github.com/Skylion007
2023-02-03 03:44:43 +00:00
cyy
e292ddff4e More clang-tidy fixes (#92944)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92944
Approved by: https://github.com/Skylion007
2023-01-25 19:11:51 +00:00
523d4f2562 Revert "[cuDNN][cuDNN V8 API] Always build assuming cuDNN >= 8.0 (#91527)"
This reverts commit 4d07ad74f1c11efa55501433d6cf1f06840f5207.

Reverted https://github.com/pytorch/pytorch/pull/91527 on behalf of https://github.com/DanilBaibak due to Break internal build
2023-01-16 13:28:09 +00:00
4d07ad74f1 [cuDNN][cuDNN V8 API] Always build assuming cuDNN >= 8.0 (#91527)
We've been building with V8 (incl. V8 API) by default for a while now; this PR cleans up some guards for cuDNN < 8.0.

CC @ptrblck @ngimel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91527
Approved by: https://github.com/ngimel
2023-01-13 18:55:37 +00:00
eece6da162 [inductor] Reduce device context manager overhead (#91045)
This adds `torch.cuda._DeviceGuard` which is a stripped down version of
`torch.cuda.device` with lower overhead. To do this, it only accepts `int` as
the device so we don't need to call `_get_device_index` and is implemented
with a new C++ helper `torch._C._cuda_exchangeDevice` that allows
`_DeviceGuard.__enter__` to be just a single function call. On my machine,
I see a drop from 3.8us of overhead to 0.94 us with this simple benchmark:

```python
def set_device():
    with torch.cuda.device(0):
        pass

%timeit set_device()
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91045
Approved by: https://github.com/ngimel, https://github.com/anijain2305
2023-01-12 16:51:59 +00:00
e096d2db5a [BC-Breaking] Separate stream_id, device_index, and device_type in pack and unpack for Streams (#81596)
#75854

A naive attempt at working around the limitations of using a single 64-bit integer to pack `stream_id`, `device_index`, and `device_type`.

Stills needs sanity checks, testing, and minimization of BC-breaking changes.

Currently a Holder for the `StreamData3` struct is used for `IValue` compatibility. While doing this seems to work for `ivalue.h` and `ivalue_inl.h`, this doesn't seem to be naively working for the JIT CUDA stream wrapper? (Something about ambiguous calls if an `intrusive_ptr` to `c10::ivalue::StreamData3Holder` is used as the return type for `pack()`. It turns out that the methods required to access the fields for rematerializing a CUDA Stream are basically already present anyway, so `pack` is simply removed in the wrapper for now and the methods to access the required fields are called directly.

CC @ptrblck

Pull Request resolved: https://github.com/pytorch/pytorch/pull/81596
Approved by: https://github.com/ezyang
2023-01-12 14:16:49 +00:00
07e595e88a Add device_idx to free_fn in CUDAPluggableAllocator (#91398)
This was requested by nvidia folks, track also the device_id in the free function.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91398
Approved by: https://github.com/albanD
2023-01-12 05:03:48 +00:00
c9d4390d13 Add Pluggable CUDA allocator backend (#86786)
Fixes #43144

This uses the Backend system added by [82682](https://github.com/pytorch/pytorch/pull/82682) to change allocators dynamically during the code execution. This will allow us to use RMM, use CUDA managed memory for some portions of the code that do not fit in GPU memory. Write static memory allocators to reduce fragmentation while training models and improve interoperability with external DL compilers/libraries.

For example, we could have the following allocator in c++

```c++
#include <sys/types.h>
#include <cuda_runtime_api.h>
#include <iostream>

extern "C" {
void* my_malloc(ssize_t size, int device, cudaStream_t stream) {
   void *ptr;
   std::cout<<"alloc "<< size<<std::endl;
   cudaMalloc(&ptr, size);
   return ptr;
}

void my_free(void* ptr) {
   std::cout<<"free "<<std::endl;
   cudaFree(ptr);
}
}
```

Compile it as a shared library
```
nvcc allocator.cc -o alloc.so -shared --compiler-options '-fPIC'
```

And use it from PyTorch as follows

```python
import torch

# Init caching
# b = torch.zeros(10, device='cuda')
new_alloc = torch.cuda.memory.CUDAPluggableAllocator('alloc.so', 'my_malloc', 'my_free')
old = torch.cuda.memory.get_current_allocator()
torch.cuda.memory.change_current_allocator(new_alloc)
b = torch.zeros(10, device='cuda')
# This will error since the current allocator was already instantiated
torch.cuda.memory.change_current_allocator(old)
```

Things to discuss
- How to test this, needs compiling external code ...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86786
Approved by: https://github.com/albanD
2022-11-23 17:54:36 +00:00
0d2c2110f1 [allocator] Introduce the abstract class CUDACachingAllocator (#87251)
This replaces the manual function pointers, making it easier to write
new drop-in allocators.

Note that most allocation goes through the Allocator interface, which
CUDAAllocator inherits from, and this arrangement avoids adding and
additional layer of dispatch along this pathway compared to what existed before.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87251
Approved by: https://github.com/wconstab
2022-10-20 01:17:00 +00:00
f56ce8dbad [allocator] Move getFreeMutex (#87237)
It isn't used at all the allocators and this change makes that more clear.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87237
Approved by: https://github.com/wconstab
2022-10-19 18:00:40 +00:00
25725fd624 (Re-open) Adds cudaMallocAsync as an alternative backend for the CUDA allocator (#82682)
Rebased version of @mcarilli 's cudaMallocAsync #65365 for continued testing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82682
Approved by: https://github.com/ngimel
2022-10-12 03:44:21 +00:00
eqy
352d926482 [CUBLAS][CUDA GRAPHS] (re-re-re-re-open of #83461) Explicitly set the workspace for cuBLAS handles (#86645)
re-opening (again) in hopes of working around failed/stuck CLA check

CC @ptrblck @ngimel @huydhn
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86645
Approved by: https://github.com/zdevito
2022-10-11 16:03:49 +00:00
91b1bae1df Caching allocator tracing (#86241)
We currently can take snapshots of the state of the allocated cuda memory, but we do not have a way to correlate these snapshots with the actions the allocator that were taken between snapshots. This PR adds a simple fixed-sized buffer that records the major actions that the allocator takes (ALLOC, FREE, SEGMENT_ALLOC, SEGMENT_FREE, OOM, SNAPSHOT) and includes these with the snapshot information. Capturing period snapshots with a big enough trace buffer makes it possible to see how the allocator state changes over time.

We plan to use this functionality to guide how settings in the allocator can be adjusted and eventually have a more robust overall algorithm.

As a component of this functionality, we also add the ability to get a callback when the allocator will throw an OOM, primarily so that snapshots can be taken immediately to see why the program ran out of memory (most programs have some C++ state that would free tensors before the OutOfMemory exception can be caught).

This PR also updates the _memory_viz.py script to pretty-print the trace information and provide a better textual summary of snapshots distinguishing between internal and external fragmentation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86241
Approved by: https://github.com/ngimel
2022-10-07 23:19:54 +00:00
adf5919720 Add option to record C++ backtraces in _record_memory_history (#86145)
I used this to debug https://github.com/pytorch/pytorch/issues/86136 so it is useful. The implementation is not so fast so it is not enabled by default.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86145
Approved by: https://github.com/albanD, https://github.com/zdevito
2022-10-06 04:07:37 +00:00
97d6b5bbf8 Refactor _cuda_recordMemoryHistory to use pybind11 (#86139)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86139
Approved by: https://github.com/albanD
2022-10-06 04:07:37 +00:00
db13049b88 [allocator tracing] missing GIL acquire (#86254)
Bug where the context destructor needs to hold the GIL to free the context.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86254
Approved by: https://github.com/ezyang
2022-10-05 05:54:05 +00:00
71eb04403c Revert "[CUBLAS][CUDA GRAPHS] (re-re-open of #83461) Explicitly set the workspace for cuBLAS handles (#85447)"
This reverts commit b04b2fa9aa52cacbdc9aaaf477d55b0af845ce81.

Reverted https://github.com/pytorch/pytorch/pull/85447 on behalf of https://github.com/seemethere due to Caused a CUDA memory leak, detected by our performance benchmark suite
2022-09-30 20:53:41 +00:00
b04b2fa9aa [CUBLAS][CUDA GRAPHS] (re-re-open of #83461) Explicitly set the workspace for cuBLAS handles (#85447)
Now includes @dagitses 's optimizations and fixes for teardown

CC @ngimel @ptrblck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85447
Approved by: https://github.com/malfet
2022-09-28 16:04:58 +00:00
0ac6311356 Revert "[CUBLAS][CUDA GRAPHS] (re-open of #83461) Explicitly set the workspace for cuBLAS handles (#85292)"
This reverts commit 4012e623e8689b873cae94590766d990d155017c.

Reverted https://github.com/pytorch/pytorch/pull/85292 on behalf of https://github.com/dagitses due to broke an internal test during shutdown. Re-submit with #85399 in stack
2022-09-21 17:57:49 +00:00
eqy
4012e623e8 [CUBLAS][CUDA GRAPHS] (re-open of #83461) Explicitly set the workspace for cuBLAS handles (#85292)
re-open of #83461 with fix for 10.2 build

CC @ngimel @malfet
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85292
Approved by: https://github.com/malfet
2022-09-20 16:31:54 +00:00
d23ce29761 allow changing the cuda allocator settings even after the process started (#84970)
Summary:
- expose a python call to set the allocator settings, it uses the same format as the value for PYTORCH_CUDA_ALLOCATOR
- keep the implementation contained within the cpp file to avoid increasing build times, only expose a function to call the setting
- make some of the Allocator Config methods public, now it looks more like a singleton

Test Plan: added the unit test

Differential Revision: D39487522

Pull Request resolved: https://github.com/pytorch/pytorch/pull/84970
Approved by: https://github.com/zdevito
2022-09-17 09:42:42 +00:00
726d040692 annotated allocator snapshots (#82146)
Record stack trace information for each allocated segment in the allocator.
It takes around 1.5us to record 50 stack frames of context.
Since invoking a Pytorch operator is around 8us, this adds minimal overhead but we still leave it disabled by default so that we can test it more on real workloads first.

Stack information is kept both for allocated blocks and the last allocation used inactive blocks. We could potential keep around the _first_ allocation that caused the block to get allocated from cuda as well.

Potential Followups:
* stack frame entries are small (16 bytes), but the list of Frames is not compressed eventhough most frames will share some entries. So far this doesn't produce huge dumps (7MB for one real workload that uses all memory on the GPU), but it can be much smaller through compression.
* Code to format the information is slow (a few seconds) because it uses python and FlameGraph.pl
* Things allocated during the backward pass have no stack frames because they are run on another C++ thread.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82146
Approved by: https://github.com/albanD
2022-08-09 17:21:35 +00:00
56dea92d97 Fix set_requires_cuda_init (#81183)
Fixes the buggy `set_requires_cuda_init` introduced in #80788.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/81183
Approved by: https://github.com/ezyang
2022-07-11 15:36:42 +00:00
ae6dd20ba7 [cuDNN V8 API] (reopen 2) Allow the number of kernels profiled under torch.backends.cudnn.benchmark = True to be limitedCudnnv8 benchmark limit (#78299)
Reopen of #77002 to address comments by @malfet

CC @ngimel @ptrblck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78299
Approved by: https://github.com/ngimel
2022-07-07 23:25:23 +00:00
081b56fd41 Improve readability of cuda_lazy_init (#80788)
This PR cleans up the implementation of `cuda_lazy_init.cpp` and improves its readability. No behavioral changes are introduced.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/80788
Approved by: https://github.com/ezyang
2022-07-04 16:47:11 +00:00
9e86796fe3 simple c10 implementation for std::call_once (#78051)
A long standing bug on std::call_once: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66146
It could hang during re-entry after an exception handling.

Added a c10 implementation yielding a bulky mutex. Not the most efficient thing but at least it shouldn't hang.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78051
Approved by: https://github.com/albanD
2022-06-28 15:47:03 +00:00
30fb2c4aba [lint] autoformat test/cpp and torch/csrc
Let's have some fun.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/78828

Approved by: https://github.com/ezyang
2022-06-11 21:11:16 +00:00
6db8440f35 Python Jiterator supports multiple outputs (#78139)
This PR is part3.
Part1: https://github.com/pytorch/pytorch/pull/77902
Part2: https://github.com/pytorch/pytorch/pull/77921

Python Jiterator now supports returning multiple outputs

```
fn = torch.cuda.jiterator._create_multi_output_jit_fn(
"""
template <typename T>
T binary_2outputs(T i0, T i1, T& out0, T& out1) {
    out0 = i0 + i1;
    out1 = i0 - i1;
}
""",
num_outputs=2)

x = torch.rand(3, device='cuda')
y = torch.rand(3, device='cuda')
out0, out1 = fn(x, y)

torch.allclose(out0, x+y)
torch.allclose(out1, x-y)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78139
Approved by: https://github.com/ngimel
2022-05-24 21:52:56 +00:00
4ea176ea57 expose fast get_current_stream (#78165)
Expose fast no-frills version of getting raw `cudaStream_t` in python (200 ns instead of 4 us)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/78165
Approved by: https://github.com/SherlockNoMad, https://github.com/soumith, https://github.com/gchanan
2022-05-24 15:54:47 +00:00
aea6e2c396 Merge torch.cuda._UntypedStorage into torch._UntypedStorage (#75459)
Fixes #74933

Pull Request resolved: https://github.com/pytorch/pytorch/pull/75459
Approved by: https://github.com/ezyang
2022-05-19 13:54:39 +00:00
929f1d5317 [RELAND] Adds torch.cuda.is_current_stream_capturing (#77789)
Resubmit of https://github.com/pytorch/pytorch/pull/77673, which was reverted due to Windows test failures: https://github.com/pytorch/pytorch/pull/77673#issuecomment-1130425845.

I suspect these failures happened because I don't explicitly set a side stream for graph capture in the new test.
Not setting a side stream explicitly is alright on Linux because cuda tests implicitly use a side stream.
I think Windows cuda tests implicitly use the default stream, breaking capture and leaving the backend in a bad state.
Other graphs tests explicitly set side streams and don't error in Windows builds, so i'm 95% sure doing the same for the new test will work.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77789
Approved by: https://github.com/ezyang
2022-05-18 23:18:53 +00:00
de86146c61 rocblas alt impl during backward pass only (#71881)
In preparation of adopting future rocblas library options, it is necessary to track when the backward pass of training is executing.  The scope-based helper class `BackwardPassGuard` is provided to toggle state.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71881
Approved by: https://github.com/albanD
2022-05-18 19:42:58 +00:00
0d8a0f186b Revert "Adds torch.cuda.is_current_stream_capturing (#77673)"
This reverts commit d03d43df527e48771875537ad20212d5cb333215.

Reverted https://github.com/pytorch/pytorch/pull/77673 on behalf of https://github.com/suo
2022-05-18 19:31:49 +00:00
d03d43df52 Adds torch.cuda.is_current_stream_capturing (#77673)
Exposes a way to query if CUDA graph capture is underway on the current stream.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77673
Approved by: https://github.com/ezyang
2022-05-18 16:46:35 +00:00
8b6a78f39f Python Interface for Jiterator
This PR allows user to author a CUDA kernel in python.

```
from torch.cuda.jiterator import create_jit_fn

code_string = "template <typename T> T my_kernel(T x, T y, T alpha) { return  -x * y + x - y + alpha; }"
jitted_fn = create_jit_fn(code_string, alpha=0)

a = torch.rand(3, device='cuda')
b = torch.rand(3, device='cuda')
result = jitted_fn(a, b, alpha=1.0)
```

Limitations:
- Only supports elementwise kernel
- 1~8 tensor inputs (empty input, e.g. factory methods, is not supported)
- inputs tensors must live in cuda device
- cpu Scalar is not supported
- kwargs must be pre-declared when calling create_jit_fn
- kwargs must be convertible to at::Scalar, one of float64, int64_t, bool. (complex not support for now)

TODOs:
- [x] consolidate union and c10::variant implementation
- [x] plug into existing op testing framework
- [ ] rename files, place files in the right folder
- [ ] place util functions in the right file
- [x] enforce assumptions in python interface e.g <8 inputs, kwargs types
- [x] Add user-facing documentation
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76394
Approved by: https://github.com/mruberry
2022-05-06 18:44:28 +00:00
efd274bbcb Fix for windows builds with python 3.10 , getting rid of ssize_t (ssize_t is not a C++ defined type) (#71390)
Summary:
Fix for windows builds with python 3.10 , getting rid of ssize_t

Here is the completed bin build : https://app.circleci.com/pipelines/github/pytorch/pytorch/441527/workflows/144edb79-b398-4d70-92fe-b63158c1b439/jobs/16954881

Pull Request resolved: https://github.com/pytorch/pytorch/pull/71390

Reviewed By: samdow

Differential Revision: D33637686

Pulled By: atalman

fbshipit-source-id: fcdfca672dc20385a3d2339c20e69bd2d1717e88
(cherry picked from commit 2ac58b0dc13f152bea180dd3d64b7c36fe0ba755)
2022-01-18 22:12:41 +00:00
5cae40c169 [pytorch][aten][cuda] move CUDAGeneratorImpl.h to ATen/cuda (#70650)
Summary:
This patch moves a CUDA-specific file, `CUDAGeneratorImpl.h` to `ATen/cuda` as the following TODO comment in  `CUDAGeneratorImpl.h` suggests:
```
// TODO: this file should be in ATen/cuda, not top level
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70650

Reviewed By: jianyuh, xw285cornell

Differential Revision: D33414890

Pulled By: shintaro-iwasaki

fbshipit-source-id: 4ff839205f4e4ea4c8767f164d583eb7072f1b8b
2022-01-10 22:27:04 -08:00