Add a stable TORCH_LIBRARY to C shim (#148124)

This PR adds two main parts:
- shim.h stable C APIs into torch::Library APIs
- a higher level API in torch/csrc/stable/library.h that calls into this shim.h + otherwise is self contained

Goal: custom kernel writers should be able to call the apis in the directories above in order to register their library in a way that allows their custom extension to run with a different libtorch version than it was built with.

Subplots resolved:

- Do we want a whole separate StableLibrary or do we want to freeze torch::Library and add `m.stable_impl(cstring, void (*fn)(void **, int64_t, int64_t)` into it
    - Yes, we want a separate StableLibrary. We cannot freeze Library and it is NOT header only.
- Should I use unint64_t as the common denominator instead of void* to support 32bit architectures better?
    -  Yes, and done
- Should I add a stable `def` and `fragment` when those can be done in python?
    - I think we do want these --- and now they're done
- Where should library_stable_impl.cpp live? -- no longer relevant
- I need some solid test cases to make sure everything's going ok. I've intentionally thrown in a bunch of random dtypes into the signature, but I still haven't tested returning multiple things, returning nothing, complex dtypes, etc.
    - Have since tested all the torch library endpoints. the others can be tested in a followup to separate components that need to be in shim.h vs can be added later

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148124
Approved by: https://github.com/albanD, https://github.com/zou3519, https://github.com/atalman
This commit is contained in:
Jane Xu
2025-03-11 07:44:21 -07:00
committed by PyTorch MergeBot
parent 4d10da731b
commit 971606befa
15 changed files with 765 additions and 9 deletions

View File

@ -227,6 +227,49 @@ class TestCppExtensionAOT(common.TestCase):
if return_code != 0:
return return_code
@unittest.skipIf(not TEST_CUDA, "some aspects of this test require CUDA")
def test_libtorch_agnostic(self):
import libtorch_agnostic
# (1) first test that SGD CPU kernel works
param = torch.rand(5, device="cpu")
grad = torch.rand_like(param)
weight_decay = 0.01
lr = 0.001
maximize = False
new_param = libtorch_agnostic.ops.sgd_out_of_place(
param, grad, weight_decay, lr, maximize
)
torch._fused_sgd_(
(param,),
(grad,),
(),
weight_decay=weight_decay,
momentum=0.0,
lr=lr,
dampening=0.0,
nesterov=False,
maximize=maximize,
is_first_step=False,
)
self.assertEqual(new_param, param)
# (2) then test that we don't hog unnecessary memory
def _run_identity(prior_mem, device):
t = torch.rand(32, 32, device=device)
self.assertGreater(torch.cuda.memory_allocated(device), prior_mem)
identi_t = libtorch_agnostic.ops.identity(t)
assert identi_t is t
device = torch.cuda.current_device()
init_mem = torch.cuda.memory_allocated(device)
for _ in range(3):
_run_identity(init_mem, device)
curr_mem = torch.cuda.memory_allocated(device)
self.assertEqual(curr_mem, init_mem)
@torch.testing._internal.common_utils.markDynamoStrictTest
class TestPybindTypeCasters(common.TestCase):