Set version to 0.8.1

Set version to 0.8.1.dev0 (#115 )
Add support for project-wide locking of layers (#114 )
2025-11-06 07:04:32 +08:00 · 2025-07-23 14:43:31 +02:00 · 2025-07-23 14:42:24 +02:00 · 2025-07-23 09:37:05 +02:00 · 2025-07-22 17:02:39 +02:00 · 2025-07-22 10:03:34 +02:00
19 changed files with 1909 additions and 194 deletions
--- a/.github/workflows/test.yml
+++ b/.github/workflows/test.yml
@ -63,7 +63,7 @@ jobs:
      - name: Check README generation
        # For now, just checks that generation doesn't fail.
        run: |
-          uv run kernels generate-readme kernels-community/triton-layer-norm --revision docs
+          uv run kernels generate-readme kernels-community/triton-layer-norm

      - name: Import check without torch
        run: |
--- a/README.md
+++ b/README.md
@ -57,7 +57,7 @@ the Hub.
 ## 📚 Documentation

 - [Using layers](docs/layers.md)
- [Locking kernel versions](docs/locking.md)
+- [Locking kernel/layer versions](docs/locking.md)
 - [Environment variables](docs/env.md)
 - [Using kernels in a Docker container](docs/docker.md)
 - [Kernel requirements](docs/kernel-requirements.md)
--- a/docs/layers.md
+++ b/docs/layers.md
@ -49,15 +49,46 @@ A model will not use Hub kernels by default, even if it contains extensible
 layers. To enable the use of Hub kernels in the model, it needs to be
 'kernelized' using the `kernelize` function. This function traverses the
 model graph and replaces the `forward` methods of extensible layers for which
-Hub kernels are registered. Kernelize can be used as follows:
+Hub kernels are registered. `kernelize` can be used as follows:

 ```python
 model = MyModel(...)
-model = kernelize(model)
+model = kernelize(model, mode=Mode.INFERENCE)
 ```

-**Note:** the `kernelize` function modifies the model in-place, the model
-itself is returned as a convenience.
+The `kernelize` function modifies the model in-place, the model itself is
+returned as a convenience. The `mode` specifies that the model will be used
+in inference. Similarly, you can ask `kernelize` to prepare the model for
+training:
+
+```python
+model = MyModel(...)
+model = kernelize(model, mode=Mode.TRAINING)
+```
+
+A model that is kernelized for training can also be used for inference, but
+not the other way around. If you want to change the mode of the kernelized
+model, you can just run `kernelize` on the model again with the new mode.
+
+If you want to compile a model with `torch.compile`, this should be indicated
+in the mode as well. You can do this by combining `Mode.INFERENCE` or
+`Mode.TRAINING` with `Mode.TORCH_COMPILE` using the set union (`|`) operator:
+
+```python
+model = MyModel(...)
+
+# Inference
+model = kernelize(model, mode=Mode.INFERENCE | Mode.TORCH_COMPILE)
+
+# Training
+model = kernelize(model, mode=Mode.TRAINING | Mode.TORCH_COMPILE)
+```
+
+When the `mode` argument is not specified,
+`Mode.TRAINING | Mode.TORCH_COMPILE` is used as the default. This mode
+aligns most closely with pure PyTorch layers which also support training
+and `torch.compile`. However, to select the most performant kernels, it
+is often good to make the mode specific as possible.

 ### Kernel device

@ -69,36 +100,32 @@ inferred (e.g. because the model has no parameters):

 ```python
 model = MyModel(...)
-model = kernelize(model, device="cuda")
+model = kernelize(model, device="cuda", mode=Mode.INFERENCE)
 ```

-### `torch.compile`
+### Fallback `forward`

-Not all Hub kernels support `torch.compile`. If you want to compile a model
-after kernelizing it, pass the `needs_torch_compile` argument to ensure that
-only kernels that support `torch.compile` will be loaded:
-
-```python
-model = MyModel(...)
-model = kernelize(model, needs_torch_compile=True)
-```
-
-### Fallback forward
-
-The `needs_torch_compile` argument will fall back to the layer's original
-`forward` if the registered kernels does not support `torch.compile`. You
+If the `TRAINING` and/or `TORCH_COMPILE` modes are used, but a registered
+kernel does not support backward passes or `torch.compile` respectively,
+`kernenize` will fall back to the original, non-kernelized, layer. You
 can let `kernelize` raise an exception instead by using `use_fallback=False`:

 ```python
 model = MyModel(...)
-model = kernelize(model, needs_torch_compile=True, use_fallback=False)
+model = kernelize(model, mode=Mode.INFERENCE | Mode.TORCH_COMPILE, use_fallback=False)
 ```

 This can be useful if you want to guarantee that Hub kernels are used.

+### Inspecting kernels which kernels are used
+
+The kernels that are used are logged at the `INFO` level by `kernelize`.
+See the [Python logging](https://docs.python.org/3/library/logging.html)
+documentation for information on how to configure logging.
+
 ## Registering a hub kernel for a layer

-`kernelize`` relies on kernel mappings to find Hub kernels for layers.
+`kernelize` relies on kernel mappings to find Hub kernels for layers.
 Kernel mappings map a kernel name such as `SiluAndMul` to a kernel on
 the Hub. For example:

@ -108,7 +135,6 @@ kernel_layer_mapping = {
        "cuda": LayerRepository(
            repo_id="kernels-community/activation",
            layer_name="SiluAndMul",
-            revision="layers",
        )
    }
 }
@ -132,3 +158,115 @@ with use_kernel_mapping(kernel_layer_mapping):

 This ensures that the mapping is not active anymore outside the
 `with`-scope.
+
+### Registering kernels for specific modes
+
+You might want to register two different kernels for a particular layer,
+where one kernel is optimized for a specific mode. You can do so by
+registering layer repositories for specific modes. For example:
+
+```python
+kernel_layer_mapping = {
+    "SiluAndMul": {
+        "cuda": {
+          Mode.INFERENCE: LayerRepository(
+              repo_id="kernels-community/activation-inference-optimized",
+              layer_name="SiluAndMul",
+          ),
+          Mode.TRAINING | Mode.TORCH_COMPILE: LayerRepository(
+              repo_id="kernels-community/activation-training-optimized",
+              layer_name="SiluAndMul",
+          ),
+      }
+    }
+}
+```
+
+The `kernelize` function will attempt to use the following registered
+kernels for a given mode:
+
+- `INFERENCE`: `INFERENCE` → `INFERENCE | TORCH_COMPILE` → `TRAINING` →
+  `TRAINING | TORCH_COMPILE` → `FALLBACK`
+- `INFERENCE | TORCH_COMPILE`: `INFERENCE | TORCH_COMPILE` →
+  `TRAINING | TORCH_COMPILE` → `FALLBACK`
+- `TRAINING`: `TRAINING` → `TRAINING | TORCH_COMPILE` → `FALLBACK`
+- `TRAINING | TORCH_COMPILE`: `TRAINING | TORCH_COMPILE` → `FALLBACK`
+
+`Mode.FALLBACK` is a special mode that is used when no other mode matches. It
+is also used when a kernel is registered without a mode, as described in the
+previous section.
+
+```python
+kernel_layer_mapping = {
+    "SiluAndMul": {
+        "cuda": {
+            Mode.FALLBACK: LayerRepository(
+                repo_id="kernels-community/activation",
+                layer_name="SiluAndMul",
+            ),
+            Mode.INFERENCE: LayerRepository(
+                repo_id="kernels-community/activation-inference-optimized",
+                layer_name="SiluAndMul",
+            ),
+            Mode.TRAINING: LayerRepository(
+                repo_id="kernels-community/activation-training-optimized",
+                layer_name="SiluAndMul",
+            ),
+        }
+    }
+}
+```
+
+In this case, both `Mode.INFERENCE | Mode.TORCH_COMPILE` and
+`Mode.TRAINING | Mode.TORCH_COMPILE` will use the `Mode.FALLBACK` kernel,
+since the other kernels do not support `torch.compile`.
+
+### Registering kernels for specific CUDA capabilities
+
+Some kernels only work with newer CUDA architectures. For instance, some
+kernels require capability 9.0 for the TMA unit on Hopper GPUs. `kernels`
+supports registering layers for a range of CUDA capabilities. To do so,
+you need to register the layer for a `Device` with type `cuda` and
+set the supported range of CUDA capabilities with using `CUDAProperties`:
+
+```python
+kernel_layer_mapping = {
+    "SiluAndMul": {
+        Device(
+            type="cuda",
+            properties=CUDAProperties(
+                min_capability=75, max_capability=89
+            ),
+        ): LayerRepository(
+            repo_id="kernels-community/activation",
+            layer_name="SiluAndMul",
+        ),
+        Device(
+            type="cuda",
+            properties=CUDAProperties(
+                min_capability=90, max_capability=sys.maxsize
+            ),
+        ): LayerRepository(
+            repo_id="kernels-community/activation-hopper",
+            layer_name="SiluAndMul",
+        ),
+    }
+}
+```
+
+Capabilities behave as follows:
+
+- The minimum and maximum capabilities are inclusive.
+- When a new kernel is registered with the same min/max capabilities as
+  an existing kernel, the new kernel will replace the old kernel.
+- When there are multiple kernels that support a capability, the kernel
+  with the smaller capability interval will be used. E.g. given:
+
+  - `KernelA` with `min_capability=80` and `max_capability=89`;
+  - `KernelB` with `min_capability=75` and `max_capability=89`;
+  - `kernelize` runs on a system with capability 8.6.
+
+  Then `KernelA` will be used because the interval 80..89 is smaller
+  than 75..89. The motivation is that kernels with smaller ranges
+  tend to be more optimized for a specific set of GPUs. **This behavior
+  might still change in the future.**
--- a/docs/locking.md
+++ b/docs/locking.md
@ -1,4 +1,4 @@
-# Locking kernel versions
+# Locking kernel/layer versions

 Projects that use `setuptools` can lock the kernel versions that should be
 used. First specify the accepted versions in `pyproject.toml` and make
@ -26,6 +26,24 @@ activation = get_locked_kernel("kernels-community/activation")
 **Note:** the lock file is included in the package metadata, so it will only be visible
 to `kernels` after doing an (editable or regular) installation of your project.

+## Locked kernel layers
+
+Locking is also supported for kernel layers. To use locked layers, register them
+with the `LockedLayerRepository` class:
+
+```python
+kernel_layer_mapping = {
+    "SiluAndMul": {
+        "cuda": LockedLayerRepository(
+            repo_id="kernels-community/activation",
+            layer_name="SiluAndMul",
+        )
+    }
+}
+
+register_kernel_mapping(kernel_layer_mapping)
+```
+
 ## Pre-downloading locked kernels

 Locked kernels can be pre-downloaded by running `kernels download .` in your
--- a/flake.lock
+++ b/flake.lock
@ -58,16 +58,15 @@
        "nixpkgs": "nixpkgs"
      },
      "locked": {
-        "lastModified": 1749025620,
-        "narHash": "sha256-V/r5KOp8FRC5n3MINDzTeS3pZz57SasFVzx12WQRQ8U=",
+        "lastModified": 1750775451,
+        "narHash": "sha256-HiGqtwzIgUH7Xkh+wgpvHRZGooqrW0z663E6nauczA4=",
        "owner": "huggingface",
        "repo": "hf-nix",
-        "rev": "7ab84ffad440c530162f528a96fa062530a6c8e4",
+        "rev": "5943c3169e861618a6634bc8dbdb498e413ab9b7",
        "type": "github"
      },
      "original": {
        "owner": "huggingface",
-        "ref": "torch-cxx11",
        "repo": "hf-nix",
        "type": "github"
      }
--- a/flake.nix
+++ b/flake.nix
@ -1,6 +1,6 @@
 {
  inputs = {
-    hf-nix.url = "github:huggingface/hf-nix/torch-cxx11";
+    hf-nix.url = "github:huggingface/hf-nix";
    nixpkgs.follows = "hf-nix/nixpkgs";
    flake-utils.url = "github:numtide/flake-utils";
  };
@ -16,7 +16,7 @@
      let
        pkgs = import nixpkgs {
          inherit system;
-          inherit (hf-nix.lib) config;
+          config = hf-nix.lib.config system;
          overlays = [
            hf-nix.overlays.default
          ];
--- a/pyproject.toml
+++ b/pyproject.toml
@ -1,6 +1,6 @@
 [project]
 name = "kernels"
-version = "0.6.1.dev0"
+version = "0.8.1"
 description = "Download compute kernels"
 authors = [
  { name = "OlivierDehaene", email = "olivier@huggingface.co" },
@ -24,7 +24,7 @@ build-backend = "setuptools.build_meta"

 [dependency-groups]
 dev = [
-  "mypy == 1.14.1",
+  "mypy >= 1.15.0",
  "pytest >=8",
  # Whatever version is compatible with pytest.
  "pytest-benchmark",
--- a/src/kernels/init.py
+++ b/src/kernels/init.py
@ -1,6 +1,8 @@
 from kernels.layer import (
+    CUDAProperties,
    Device,
    LayerRepository,
+    Mode,
    kernelize,
    register_kernel_mapping,
    replace_kernel_forward_from_hub,
@ -9,6 +11,7 @@ from kernels.layer import (
 )
 from kernels.utils import (
    get_kernel,
+    get_local_kernel,
    get_locked_kernel,
    has_kernel,
    install_kernel,
@ -16,16 +19,19 @@ from kernels.utils import (
 )

 __all__ = [
+    "CUDAProperties",
+    "Device",
+    "LayerRepository",
+    "Mode",
    "get_kernel",
+    "get_local_kernel",
    "get_locked_kernel",
    "has_kernel",
-    "load_kernel",
    "install_kernel",
-    "use_kernel_forward_from_hub",
-    "use_kernel_mapping",
+    "kernelize",
+    "load_kernel",
    "register_kernel_mapping",
    "replace_kernel_forward_from_hub",
-    "LayerRepository",
-    "Device",
-    "kernelize",
+    "use_kernel_forward_from_hub",
+    "use_kernel_mapping",
 ]
--- a/src/kernels/_interval_tree.py
+++ b/src/kernels/_interval_tree.py
@ -0,0 +1,200 @@
+# AVL-balanced interval trees. We could use the intervaltree
+# packages, but it seems unmaintained and does not have type
+# annotations.
+
+from typing import Generic, List, Optional, Tuple, TypeVar
+
+T = TypeVar("T")
+
+
+class _Node(Generic[T]):
+    """A node in the interval tree."""
+
+    def __init__(self, start: int, end: int, data: T):
+        self.start: int = start
+        self.end: int = end
+        self.data: T = data
+        self.max_end: int = end
+        self.left: Optional["_Node[T]"] = None
+        self.right: Optional["_Node[T]"] = None
+        self.height: int = 1
+
+    def __repr__(self) -> str:
+        return f"Node({self.start}, {self.end})"
+
+
+class IntervalTree(Generic[T]):
+    """A data structure to hold and query (unique) intervals."""
+
+    root: Optional[_Node[T]]
+
+    def __init__(self):
+        self.root = None
+
+    def insert(self, start: int, end: int, data: T) -> None:
+        """
+        Inserts a new interval into the tree.
+
+        Args:
+            start: The starting point of the interval.
+            end: The ending point of the interval.
+            data: The data associated with this interval.
+        """
+        self.root = self._insert(self.root, start, end, data)
+
+    def _get_height(self, node: Optional[_Node[T]]) -> int:
+        if not node:
+            return 0
+        return node.height
+
+    def _get_balance(self, node: Optional[_Node[T]]) -> int:
+        if not node:
+            return 0
+        return self._get_height(node.left) - self._get_height(node.right)
+
+    def _update_node_attributes(self, node: _Node[T]) -> None:
+        node.height = 1 + max(self._get_height(node.left), self._get_height(node.right))
+        node.max_end = node.end
+        if node.left:
+            node.max_end = max(node.max_end, node.left.max_end)
+        if node.right:
+            node.max_end = max(node.max_end, node.right.max_end)
+
+    def _right_rotate(self, y: _Node[T]) -> _Node[T]:
+        """Performs a right rotation."""
+        x = y.left
+        assert x is not None
+        T2 = x.right
+
+        x.right = y
+        y.left = T2
+
+        self._update_node_attributes(y)
+        self._update_node_attributes(x)
+
+        return x
+
+    def _left_rotate(self, x: _Node[T]) -> _Node[T]:
+        """Performs a left rotation."""
+        y = x.right
+        assert y is not None
+        T2 = y.left
+
+        y.left = x
+        x.right = T2
+
+        self._update_node_attributes(x)
+        self._update_node_attributes(y)
+
+        return y
+
+    def _insert(
+        self, node: Optional[_Node[T]], start: int, end: int, data: T
+    ) -> _Node[T]:
+        """Recursive helper to insert a new node and balance the tree."""
+        if not node:
+            return _Node(start, end, data)
+
+        # Replace the data if the interval already exists.
+        if start == node.start and end == node.end:
+            node.data = data
+            return node
+
+        if start < node.start:
+            node.left = self._insert(node.left, start, end, data)
+        else:
+            node.right = self._insert(node.right, start, end, data)
+
+        self._update_node_attributes(node)
+
+        balance = self._get_balance(node)
+
+        # Left Left Case
+        if balance > 1 and node.left and start < node.left.start:
+            return self._right_rotate(node)
+
+        # Right Right Case
+        if balance < -1 and node.right and start >= node.right.start:
+            return self._left_rotate(node)
+
+        # Left Right Case
+        if balance > 1 and node.left and start >= node.left.start:
+            node.left = self._left_rotate(node.left)
+            return self._right_rotate(node)
+
+        # Right Left Case
+        if balance < -1 and node.right and start < node.right.start:
+            node.right = self._right_rotate(node.right)
+            return self._left_rotate(node)
+
+        return node
+
+    def search(self, point: int) -> List[T]:
+        """
+        Searches for all intervals that contain the given point.
+
+        Args:
+            point: The point to search for.
+
+        Returns:
+            A list of data items from all matching intervals.
+        """
+        results: List[T] = []
+        self._search(self.root, point, results)
+        return results
+
+    def _search(self, node: Optional[_Node[T]], point: int, results: List[T]) -> None:
+        """Recursive helper to find all overlapping intervals."""
+        if node is None or point > node.max_end:
+            return
+
+        if node.left:
+            self._search(node.left, point, results)
+
+        if node.start <= point <= node.end:
+            results.append(node.data)
+
+        if point >= node.start and node.right:
+            self._search(node.right, point, results)
+
+    def find_smallest_interval(self, point: int) -> Optional[T]:
+        """
+        Finds the item with the most specific (smallest) range for a given point.
+
+        Args:
+            point: The capability to look up.
+
+        Returns:
+            The data of the best-matching item, or None if no match is found.
+        """
+        matches: List[Tuple[int, int, T]] = []
+        self._find_with_intervals(self.root, point, matches)
+
+        if not matches:
+            return None
+
+        # Return the smallest interval, sort by memory location when
+        # there are multiple matches with the same interval size. This
+        # is just to ensure that we can compare against a trivial
+        # implementation in tests.
+        best_match = min(matches, key=lambda x: (x[1] - x[0], id(x[2])))
+        return best_match[2]
+
+    def _find_with_intervals(
+        self,
+        node: Optional[_Node[T]],
+        point: int,
+        results: List[Tuple[int, int, T]],
+    ) -> None:
+        """A modified search that collects interval ranges along with data."""
+        if node is None or point > node.max_end:
+            return
+
+        if node.left:
+            self._find_with_intervals(node.left, point, results)
+
+        if node.start <= point <= node.end:
+            results.append((node.start, node.end, node.data))
+
+        if point >= node.start and node.right:
+            self._find_with_intervals(node.right, point, results)
--- a/src/kernels/_versions.py
+++ b/src/kernels/_versions.py
@ -0,0 +1,52 @@
+from typing import Dict, Optional
+
+from huggingface_hub import HfApi
+from huggingface_hub.hf_api import GitRefInfo
+from packaging.specifiers import SpecifierSet
+from packaging.version import InvalidVersion, Version
+
+
+def _get_available_versions(repo_id: str) -> Dict[Version, GitRefInfo]:
+    """Get kernel versions that are available in the repository."""
+    versions = {}
+    for tag in HfApi().list_repo_refs(repo_id).tags:
+        if not tag.name.startswith("v"):
+            continue
+        try:
+            versions[Version(tag.name[1:])] = tag
+        except InvalidVersion:
+            continue
+
+    return versions
+
+
+def resolve_version_spec_as_ref(repo_id: str, version_spec: str) -> GitRefInfo:
+    """
+    Get the locks for a kernel with the given version spec.
+
+    The version specifier can be any valid Python version specifier:
+    https://packaging.python.org/en/latest/specifications/version-specifiers/#version-specifiers
+    """
+    versions = _get_available_versions(repo_id)
+    requirement = SpecifierSet(version_spec)
+    accepted_versions = sorted(requirement.filter(versions.keys()))
+
+    if len(accepted_versions) == 0:
+        raise ValueError(
+            f"No version of `{repo_id}` satisfies requirement: {version_spec}"
+        )
+
+    return versions[accepted_versions[-1]]
+
+
+def select_revision_or_version(
+    repo_id: str, revision: Optional[str], version: Optional[str]
+) -> str:
+    if revision is not None and version is not None:
+        raise ValueError("Either a revision or a version must be specified, not both.")
+    elif revision is None and version is None:
+        revision = "main"
+    elif version is not None:
+        revision = resolve_version_spec_as_ref(repo_id, version).target_commit
+    assert revision is not None
+    return revision
--- a/src/kernels/layer.py
+++ b/src/kernels/layer.py
@ -1,13 +1,32 @@
+from __future__ import annotations
+
+import functools
 import inspect
+import logging
 import os
+import sys
 import warnings
+from abc import ABC, abstractmethod
 from contextvars import ContextVar
 from copy import deepcopy
-from dataclasses import dataclass, field
+from dataclasses import dataclass
+from enum import Flag, auto
+from functools import lru_cache
+from pathlib import Path
 from types import MethodType
-from typing import TYPE_CHECKING, Dict, Optional, Type, Union
+from typing import (
+    TYPE_CHECKING,
+    Dict,
+    Optional,
+    Protocol,
+    Tuple,
+    Type,
+    Union,
+)

-from .utils import get_kernel
+from ._interval_tree import IntervalTree
+from ._versions import select_revision_or_version
+from .utils import _get_caller_locked_kernel, _get_locked_kernel, get_kernel

 if TYPE_CHECKING:
    import torch
@ -17,55 +36,292 @@ if TYPE_CHECKING:
 _DISABLE_KERNEL_MAPPING: bool = bool(int(os.environ.get("DISABLE_KERNEL_MAPPING", "0")))


+class Mode(Flag):
+    """
+    Kernelize mode
+
+    The `Mode` flag is used by `kernelize` to select kernels for the given
+    mode. Mappings can be registered for specific modes.
+
+    * `INFERENCE`: The kernel is used for inference.
+    * `TRAINING`: The kernel is used for training.
+    * `TORCH_COMPILE`: The kernel is used with `torch.compile`.
+    * `FALLBACK`: In a kernel mapping, this kernel is used when no other mode
+       matches.
+
+    Different modes can be combined. For instance, `INFERENCE | TORCH_COMPILE`
+    should be used for layers that are used for inference *with* `torch.compile`.
+    """
+
+    _NONE = 0
+    FALLBACK = auto()
+    TRAINING = auto()
+    INFERENCE = auto()
+    TORCH_COMPILE = auto()
+
+    def __or__(self, other: Mode) -> Mode:
+        union = super().__or__(other)
+
+        if Mode.INFERENCE in union and Mode.TRAINING in union:
+            raise ValueError("Mode.INFERENCE and Mode.TRAINING are mutually exclusive.")
+
+        if Mode.FALLBACK in union and union != Mode.FALLBACK:
+            raise ValueError("Mode.FALLBACK cannot be combined with other modes.")
+
+        return union
+
+
@dataclass(frozen=True)
 class Device:
    type: str
+    properties: Optional[CUDAProperties] = None

-    # In the future we might add compute capabilities, etc.
+    def __post_init__(self):
+        if self.properties is not None and isinstance(self.properties, CUDAProperties):
+            if self.type != "cuda":
+                raise ValueError("CUDAProperties is only supported for 'cuda' devices.")
+
+    def create_repo(self) -> _DeviceRepos:
+        """Create an appropriate repository set for this device type."""
+        if self.type == "cuda":
+            return _CUDARepos()
+        elif self.type == "mps":
+            return _MPSRepos()
+        else:
+            raise ValueError(f"Unknown device type: {self.type}")

    def __eq__(self, other):
-        return isinstance(other, Device) and self.type == other.type
+        if not isinstance(other, Device):
+            return NotImplemented
+        return self.type == other.type and self.properties == other.properties

    def __hash__(self):
-        return hash(self.type)
+        return hash((self.type, self.properties))
+
+
+@dataclass(frozen=True)
+class CUDAProperties:
+    min_capability: int
+    max_capability: int
+
+    def __eq__(self, other):
+        if not isinstance(other, CUDAProperties):
+            return NotImplemented
+        return (
+            self.min_capability == other.min_capability
+            and self.max_capability == other.max_capability
+        )
+
+    def __hash__(self):
+        return hash((self.min_capability, self.max_capability))
+
+
+class LayerRepositoryProtocol(Protocol):
+    @property
+    def layer_name(self) -> str: ...
+
+    @property
+    def repo_id(self) -> str: ...
+
+    @property
+    def revision(self) -> str: ...


-@dataclass
 class LayerRepository:
    """
    Repository and name of a layer.
    """

-    layer_name: str = field(
-        metadata={"help": "The name of the layer in the kernel repository."}
-    )
-    repo_id: str = field(metadata={"help": "The kernel hub repository with the layer."})
-    revision: str = field(
-        default="main", metadata={"help": "The revision of the layer."}
-    )
+    def __init__(
+        self,
+        repo_id: str,
+        *,
+        layer_name: str,
+        revision: Optional[str] = None,
+        version: Optional[str] = None,
+    ):
+        """
+        Construct a layer repository.
+
+        Args:
+            repo_id (`str`): The Hub repository containing the layer.
+            revision (`str`, *optional*, defaults to `"main"`): The specific
+                revision (branch, tag, or commit) to download.
+                Cannot be used together with `version`.
+            version (`str`, *optional*): The kernel version to download. This
+                can be a Python version specifier, such as `">=1.0.0,<2.0.0"`.
+                Cannot be used together with `revision`.
+        """
+
+        if revision is not None and version is not None:
+            raise ValueError(
+                "Either a revision or a version must be specified, not both."
+            )
+
+        self.repo_id = repo_id
+        self.layer_name = layer_name
+
+        # We are going to resolve these lazily, since we do not want
+        # to do a network request for every registered LayerRepository.
+        self._revision = revision
+        self._version = version
+
+    @property
+    @functools.lru_cache()
+    def revision(self) -> str:
+        return select_revision_or_version(
+            repo_id=self.repo_id, revision=self._revision, version=self._version
+        )

    def __eq__(self, other):
        return (
            isinstance(other, LayerRepository)
            and self.layer_name == other.layer_name
            and self.repo_id == other.repo_id
-            and self.revision == other.revision
+            and self._revision == other._revision
+            and self._version == other._version
        )

    def __hash__(self):
-        return hash((self.layer_name, self.repo_id, self.revision))
+        return hash((self.layer_name, self.repo_id, self._revision, self._version))


-_CACHED_LAYER: Dict[LayerRepository, Type["nn.Module"]] = {}
+class LockedLayerRepository:
+    """
+    Repository and name of a layer.
+
+    In contrast to `LayerRepository`, this class uses repositories that
+    are locked inside a project.
+    """
+
+    def __init__(
+        self,
+        repo_id: str,
+        *,
+        lockfile: Optional[Path] = None,
+        layer_name: str,
+    ):
+        """
+        Construct a layer repository.
+
+        Args:
+            repo_id (`str`): The Hub repository containing the layer.
+        """
+        self.repo_id = repo_id
+        self.lockfile = lockfile
+        self.layer_name = layer_name
+
+    @property
+    @functools.lru_cache()
+    def revision(self) -> str:
+        if self.lockfile is None:
+            locked_sha = _get_caller_locked_kernel(self.repo_id)
+        else:
+            with open(self.lockfile, "r") as f:
+                locked_sha = _get_locked_kernel(self.repo_id, f.read())
+
+        if locked_sha is None:
+            raise ValueError(f"Kernel `{self.repo_id}` is not locked")
+
+        return locked_sha
+
+    def __eq__(self, other):
+        return (
+            isinstance(other, LockedLayerRepository)
+            and self.layer_name == other.layer_name
+            and self.repo_id == other.repo_id
+        )
+
+    def __hash__(self):
+        return hash((self.layer_name, self.repo_id))


-_KERNEL_MAPPING: ContextVar[Dict[str, Dict[Device, LayerRepository]]] = ContextVar(
+_CACHED_LAYER: Dict[LayerRepositoryProtocol, Type["nn.Module"]] = {}
+
+
+class _DeviceRepos(ABC):
+    """
+    Device-specific kernel layer repositories.
+    """
+
+    @property
+    @abstractmethod
+    def repos(
+        self,
+    ) -> Optional[Dict[Mode, LayerRepositoryProtocol]]: ...
+
+    @abstractmethod
+    def insert(self, device: Device, repos: Dict[Mode, LayerRepositoryProtocol]):
+        """
+        Insert a repository for a specific device and mode.
+        """
+        ...
+
+
+class _MPSRepos(_DeviceRepos):
+    _repos: Dict[Mode, LayerRepositoryProtocol]
+
+    def __init__(self):
+        super().__init__()
+        self._repos = {}
+
+    @property
+    def repos(
+        self,
+    ) -> Optional[Dict[Mode, LayerRepositoryProtocol]]:
+        return self._repos
+
+    def insert(self, device: Device, repos: Dict[Mode, LayerRepositoryProtocol]):
+        if device.type != "mps":
+            raise ValueError(f"Device type must be 'mps', got {device.type}")
+
+        self._repos = repos
+
+
+class _CUDARepos(_DeviceRepos):
+    _repos: IntervalTree[Dict[Mode, LayerRepositoryProtocol]]
+
+    def __init__(self):
+        super().__init__()
+        self.repos_by_capability = IntervalTree()
+
+    @property
+    def repos(
+        self,
+    ) -> Optional[Dict[Mode, LayerRepositoryProtocol]]:
+        capability = _find_capability()
+        return self.repos_by_capability.find_smallest_interval(capability)
+
+    def insert(self, device: Device, repos: Dict[Mode, LayerRepositoryProtocol]):
+        assert device.properties is None or isinstance(
+            device.properties, CUDAProperties
+        )
+
+        min_capability = (
+            0 if device.properties is None else device.properties.min_capability
+        )
+        max_capability = (
+            sys.maxsize
+            if device.properties is None
+            else device.properties.max_capability
+        )
+
+        self.repos_by_capability.insert(min_capability, max_capability, repos)
+
+
+_KERNEL_MAPPING: ContextVar[Dict[str, Dict[str, _DeviceRepos]]] = ContextVar(
    "_KERNEL_MAPPING", default={}
 )


 def use_kernel_mapping(
-    mapping: Dict[str, Dict[Union[Device, str], LayerRepository]],
+    mapping: Dict[
+        str,
+        Dict[
+            Union[Device, str],
+            Union[LayerRepositoryProtocol, Dict[Mode, LayerRepositoryProtocol]],
+        ],
+    ],
    *,
    inherit_mapping: bool = True,
 ):
@ -93,14 +349,20 @@ def use_kernel_mapping(


 def register_kernel_mapping(
-    mapping: Dict[str, Dict[Union[Device, str], LayerRepository]],
+    mapping: Dict[
+        str,
+        Dict[
+            Union[Device, str],
+            Union[LayerRepositoryProtocol, Dict[Mode, LayerRepositoryProtocol]],
+        ],
+    ],
 ):
    """
-    Allows one to register a mapping between a layer name the corresponding
-    kernel to use, depending on the device. This should be use in conjunction
+    Allows one to register a mapping between a layer name and the corresponding
+    kernel(s) to use, depending on the device. This should be used in conjunction
    with `kernelize`.

-    Exemple usage:
+    Example usage:

    ```python
    from kernels import LayerRepository, register_kernel_mapping
@ -121,10 +383,17 @@ def register_kernel_mapping(
    for new_kernel, new_device_repos in mapping.items():
        device_repo = _KERNEL_MAPPING.get().setdefault(new_kernel, {})
        for new_device, new_repo in new_device_repos.items():
-            if isinstance(new_device, str):
-                device_repo[Device(type=new_device)] = new_repo
+            device = (
+                Device(type=new_device) if isinstance(new_device, str) else new_device
+            )
+
+            if isinstance(new_repo, dict):
+                kernel_options = new_repo
            else:
-                device_repo[new_device] = new_repo
+                kernel_options = {Mode.FALLBACK: new_repo}
+
+            feature_repos = device_repo.setdefault(device.type, device.create_repo())
+            feature_repos.insert(device, kernel_options)


 def replace_kernel_forward_from_hub(
@ -145,10 +414,57 @@ def replace_kernel_forward_from_hub(
    cls.kernel_layer_name = layer_name


+_MODE_FALLBACK_PRIORITY = {
+    Mode.INFERENCE: [
+        Mode.INFERENCE,
+        Mode.INFERENCE | Mode.TORCH_COMPILE,
+        Mode.TRAINING,
+        Mode.TRAINING | Mode.TORCH_COMPILE,
+        Mode.FALLBACK,
+    ],
+    Mode.TRAINING: [
+        Mode.TRAINING,
+        Mode.TRAINING | Mode.TORCH_COMPILE,
+        Mode.FALLBACK,
+    ],
+    Mode.INFERENCE
+    | Mode.TORCH_COMPILE: [
+        Mode.INFERENCE | Mode.TORCH_COMPILE,
+        Mode.TRAINING | Mode.TORCH_COMPILE,
+        Mode.FALLBACK,
+    ],
+    Mode.TRAINING
+    | Mode.TORCH_COMPILE: [
+        Mode.TRAINING | Mode.TORCH_COMPILE,
+        Mode.FALLBACK,
+    ],
+}
+
+
+def _select_repository(
+    repositories: Dict[Mode, LayerRepositoryProtocol],
+    *,
+    mode: Mode,
+) -> Optional[Tuple[LayerRepositoryProtocol, Mode]]:
+    # Get the fallback priority list for the requested mode
+    if mode not in _MODE_FALLBACK_PRIORITY:
+        raise ValueError(f"Unsupported mode: {mode}")
+
+    fallback_modes = _MODE_FALLBACK_PRIORITY[mode]
+
+    # Try each mode in priority order
+    for fallback_mode in fallback_modes:
+        if fallback_mode in repositories:
+            return (repositories[fallback_mode], fallback_mode)
+
+    return None
+
+
 def kernelize(
    model: "nn.Module",
+    *,
+    mode: Mode = Mode.TRAINING | Mode.TORCH_COMPILE,
    device: Optional[Union[str, "torch.device"]] = None,
-    needs_torch_compile: bool = False,
    use_fallback: bool = True,
 ):
    """
@ -158,10 +474,11 @@ def kernelize(

    Args:
        model: The PyTorch model to kernelize
+        mode: the mode that the kernel is going to be used in (e.g.
+            `Mode.TRAINING | Mode.TORCH_COMPILE` kernelizes the model for training
+            and `torch.compile`).
        device: The device type to load kernels for. The device type will be inferred
            from the parameters of the model when not provided.
-        needs_torch_compile: When set to `true`, only kernels that support
-            `torch.compile` will be loaded.
        use_fallback: Whether to use the original forward method of modules when no
            compatible kernel could be found. If set to `False`, an exception will
            be raised in such cases.
@ -171,12 +488,22 @@ def kernelize(
    """
    import torch

+    if mode == Mode.FALLBACK:
+        raise ValueError("Mode.FALLBACK can only be used to register kernel mappings.")
+
+    # Type check ignored because this causes a false negative on Python < 3.11.
+    # Looks similar to: https://github.com/python/mypy/issues/9642
+    # Remove once we start doing typing checks on >= 3.11.
+    if Mode.INFERENCE not in mode and Mode.TRAINING not in mode:  # type: ignore[operator]
+        raise ValueError("kernelize mode must contain Mode.INFERENCE or Mode.TRAINING.")
+
    if device is None:
        device_type = _find_device(model)
    elif isinstance(device, str):
        device_type = Device(type=torch.device(device).type)
    else:
        device_type = Device(device.type)
+
    assert isinstance(device_type, Device)

    for _, module in model.named_modules():
@ -203,10 +530,10 @@ def kernelize(
            _replace_forward(module, module_class)
            continue

-        # Use device type string directly instead of Device object
-        repo = kernel.get(device_type)
+        # Get kernel options for the device
+        property_repos = kernel.get(device_type.type)

-        if repo is None:
+        if property_repos is None:
            if not use_fallback:
                raise ValueError(
                    f"No layer mapping for `{layer_name}` with device type `{device_type}`"
@ -214,32 +541,50 @@ def kernelize(
            _replace_forward(module, module_class)
            continue

-        # Short-circuit if we already loaded the layer.
-        layer = _CACHED_LAYER.get(repo, None)
-        if layer is not None:
-            _conditionally_replace_forward(
-                module=module,
-                layer=layer,
-                needs_torch_compile=needs_torch_compile,
-                use_fallback=use_fallback,
-            )
+        repos = property_repos.repos
+
+        if repos is None:
+            if not use_fallback:
+                raise ValueError(
+                    f"No layer mapping for `{layer_name}` device `{device_type}` with the right properties"
+                )
+            _replace_forward(module, module_class)
            continue

-        layer = _get_kernel_layer(
-            repo_id=repo.repo_id,
-            layer_name=repo.layer_name,
-            revision=repo.revision,
+        repo_with_mode = _select_repository(
+            repos,
+            mode=mode,
        )

-        # Validate the replacement layer against the class layer.
-        _validate_layer(check_cls=module_class, cls=layer)
+        if repo_with_mode is None:
+            if not use_fallback:
+                raise ValueError(
+                    f"No repository for `{layer_name}` for configuration mode={mode}"
+                )
+            _replace_forward(module, module_class)
+            continue

-        _CACHED_LAYER[repo] = layer
+        repo, repo_mode = repo_with_mode
+
+        logging.info(
+            f"Using layer `{repo.layer_name}` from repo `{repo.repo_id}` (revision: {repo.revision}) for layer `{layer_name}`"
+        )
+        logging.debug(f"kernelize mode: {mode}, repo mode: {repo_mode}")
+
+        layer = _get_layer_memoize(repo, module_class)
+
+        # Ideally we would do validation on the mapping where we check that
+        # e.g. if a repo class is registered for TRAINING | TORCH_COMPILE,
+        # the actual layer is compatible with that. Unfortunately, this would
+        # mean that we have to pre-download everything.
+        _validate_layer_has_mode(
+            layer_name=layer_name, module=layer, repo=repo, repo_mode=repo_mode
+        )

        _conditionally_replace_forward(
            module=module,
            layer=layer,
-            needs_torch_compile=needs_torch_compile,
+            mode=mode,
            use_fallback=use_fallback,
        )

@ -327,49 +672,87 @@ def _find_device(model: "nn.Module") -> Device:
    return Device(type=param.device.type)


+@lru_cache
+def _find_capability() -> int:
+    import torch
+
+    major, minor = torch.cuda.get_device_capability(device=None)
+    return major * 10 + minor
+
+
 def _conditionally_replace_forward(
    *,
    module: "nn.Module",
    layer: Type["nn.Module"],
-    needs_torch_compile: bool,
+    mode: Mode,
    use_fallback: bool,
 ):
    module_class = type(module)

-    # Switch to fallback when the layer does not support:
-    # compilation/compile when needed.
-    # backward when needed
-    needs_fallback = needs_torch_compile and not getattr(
+    # Switch to fallback if the mode is not supported by the layer.
+    # Note that this is useful even after _validate_layer_has_mode because
+    # layers registered with the FALLBACK mode never get rejected by
+    # _validate_layer_has_mode. For such layers, we want to fall back in
+    # case the layer does not support the given mode.
+    needs_fallback = Mode.TORCH_COMPILE in mode and not getattr(
        layer, "can_torch_compile", False
    )
+    needs_fallback |= Mode.TRAINING in mode and not getattr(layer, "has_backward", True)
+
    if needs_fallback:
        if use_fallback:
            _replace_forward(module, module_class)
        else:
-            raise ValueError(
-                f"Available kernel does not fulfill requirements: needs_torch_compile={needs_torch_compile}"
-            )
+            raise ValueError(f"Available kernel does not support mode: {mode}")
    else:
        _replace_forward(module, layer)


 def _replace_forward(module: "nn.Module", layer: Type["nn.Module"]):
-    import torch.nn as nn
+    module.forward = MethodType(layer.forward, module)  # type: ignore[method-assign]

-    module_class = type(module)
-    layer_with_backward = (
-        layer if getattr(layer, "has_backward", True) else module_class
+
+def _validate_layer_has_mode(
+    *,
+    layer_name: str,
+    module: Type["nn.Module"],
+    repo: LayerRepositoryProtocol,
+    repo_mode: Mode,
+):
+    """
+    Check that a repository supports the mode that it was registered for.
+    """
+
+    if Mode.TRAINING in repo_mode and not getattr(module, "has_backward", True):
+        raise ValueError(
+            f"Layer `{repo.layer_name}` ({repo.repo_id}, revision: {repo.revision}) does not support backward.\n"
+            f"Was registered for `{layer_name}` with mode `{repo_mode}`"
+        )
+
+    if Mode.TORCH_COMPILE in repo_mode and not getattr(
+        module, "can_torch_compile", False
+    ):
+        raise ValueError(
+            f"Layer `{repo.layer_name}` ({repo.repo_id}, revision: {repo.revision}) does not support torch.compile.\n"
+            f"Was registered for `{layer_name}` with mode `{repo_mode}`"
+        )
+
+    return True
+
+
+def _get_layer_memoize(
+    repo: LayerRepositoryProtocol, module_class: Type["nn.Module"]
+) -> Type["nn.Module"]:
+    layer = _CACHED_LAYER.get(repo, None)
+    if layer is not None:
+        return layer
+
+    layer = _get_kernel_layer(
+        repo_id=repo.repo_id,
+        layer_name=repo.layer_name,
+        revision=repo.revision,
    )
+    _validate_layer(check_cls=module_class, cls=layer)
+    _CACHED_LAYER[repo] = layer

-    def train(self, mode: bool = True) -> nn.Module:
-        super(type(self), self).train(mode)
-        if mode:
-            self.forward = MethodType(layer_with_backward.forward, self)
-        else:
-            self.forward = MethodType(layer.forward, self)
-        return self
-
-    module.train = MethodType(train, module)  # type: ignore[method-assign]
-
-    # Trigger setting correct forward for the current state.
-    module.train(module.training)
+    return layer
--- a/src/kernels/lockfile.py
+++ b/src/kernels/lockfile.py
@ -4,10 +4,8 @@ from pathlib import Path
 from typing import Dict, List, Tuple

 from huggingface_hub import HfApi
-from huggingface_hub.hf_api import GitRefInfo
-from packaging.specifiers import SpecifierSet
-from packaging.version import InvalidVersion, Version

+from kernels._versions import resolve_version_spec_as_ref
 from kernels.compat import tomllib


@ -31,20 +29,6 @@ class KernelLock:
        return cls(repo_id=o["repo_id"], sha=o["sha"], variants=variants)


-def _get_available_versions(repo_id: str) -> Dict[Version, GitRefInfo]:
-    """Get kernel versions that are available in the repository."""
-    versions = {}
-    for tag in HfApi().list_repo_refs(repo_id).tags:
-        if not tag.name.startswith("v"):
-            continue
-        try:
-            versions[Version(tag.name[1:])] = tag
-        except InvalidVersion:
-            continue
-
-    return versions
-
-
 def get_kernel_locks(repo_id: str, version_spec: str) -> KernelLock:
    """
    Get the locks for a kernel with the given version spec.
@ -52,16 +36,7 @@ def get_kernel_locks(repo_id: str, version_spec: str) -> KernelLock:
    The version specifier can be any valid Python version specifier:
    https://packaging.python.org/en/latest/specifications/version-specifiers/#version-specifiers
    """
-    versions = _get_available_versions(repo_id)
-    requirement = SpecifierSet(version_spec)
-    accepted_versions = sorted(requirement.filter(versions.keys()))
-
-    if len(accepted_versions) == 0:
-        raise ValueError(
-            f"No version of `{repo_id}` satisfies requirement: {version_spec}"
-        )
-
-    tag_for_newest = versions[accepted_versions[-1]]
+    tag_for_newest = resolve_version_spec_as_ref(repo_id, version_spec)

    r = HfApi().repo_info(
        repo_id=repo_id, revision=tag_for_newest.target_commit, files_metadata=True
--- a/src/kernels/utils.py
+++ b/src/kernels/utils.py
@ -16,6 +16,7 @@ from typing import Dict, List, Optional, Tuple
 from huggingface_hub import file_exists, snapshot_download
 from packaging.version import parse

+from kernels._versions import select_revision_or_version
 from kernels.lockfile import KernelLock, VariantLock


@ -45,9 +46,11 @@ def build_variant() -> str:
        compute_framework = f"rocm{rocm_version.major}{rocm_version.minor}"
    elif torch.backends.mps.is_available():
        compute_framework = "metal"
+    elif hasattr(torch, "xpu") and torch.xpu.is_available():
+        compute_framework = "xpu"
    else:
        raise AssertionError(
-            "Torch was not compiled with CUDA, Metal, or ROCm enabled."
+            "Torch was not compiled with CUDA, Metal, XPU, or ROCm enabled."
        )

    torch_version = parse(torch.__version__)
@ -55,6 +58,7 @@ def build_variant() -> str:
    os = platform.system().lower()

    if os == "darwin":
+        cpu = "aarch64" if cpu == "arm64" else cpu
        return f"torch{torch_version.major}{torch_version.minor}-{compute_framework}-{cpu}-{os}"

    cxxabi = "cxx11" if torch.compiled_with_cxx11_abi() else "cxx98"
@ -109,6 +113,23 @@ def install_kernel(
        )
    )

+    try:
+        return _load_kernel_from_path(repo_path, package_name, variant_locks)
+    except FileNotFoundError:
+        # Redo with more specific error message.
+        raise FileNotFoundError(
+            f"Kernel `{repo_id}` at revision {revision} does not have build: {variant}"
+        )
+
+
+def _load_kernel_from_path(
+    repo_path: Path,
+    package_name: str,
+    variant_locks: Optional[Dict[str, VariantLock]] = None,
+) -> Tuple[str, Path]:
+    variant = build_variant()
+    universal_variant = universal_build_variant()
+
    variant_path = repo_path / "build" / variant
    universal_variant_path = repo_path / "build" / universal_variant

@ -127,7 +148,7 @@ def install_kernel(

    if not os.path.exists(module_init_path):
        raise FileNotFoundError(
-            f"Kernel `{repo_id}` at revision {revision} does not have build: {variant}"
+            f"Kernel at path `{repo_path}` does not have build: {variant}"
        )

    return package_name, variant_path
@ -164,16 +185,63 @@ def install_kernel_all_variants(
    return repo_path / "build"


-def get_kernel(repo_id: str, revision: str = "main") -> ModuleType:
+def get_kernel(
+    repo_id: str, revision: Optional[str] = None, version: Optional[str] = None
+) -> ModuleType:
+    """
+    Load a kernel from the kernel hub.
+    This function downloads a kernel to the local Hugging Face Hub cache
+    directory (if it was not downloaded before) and then loads the kernel.
+    Args:
+        repo_id (`str`): The Hub repository containing the kernel.
+        revision (`str`, *optional*, defaults to `"main"`): The specific
+            revision (branch, tag, or commit) to download.
+            Cannot be used together with `version`.
+        version (`str`, *optional*): The kernel version to download. This
+            can be a Python version specifier, such as `">=1.0.0,<2.0.0"`.
+            Cannot be used together with `revision`.
+    Returns:
+        `ModuleType`: The imported kernel module.
+    Example:
+        ```python
+        from kernels import get_kernel
+        kernel = get_kernel("username/my-kernel")
+        result = kernel.kernel_function(input_data)
+        ```
+    """
+    revision = select_revision_or_version(repo_id, revision, version)
    package_name, package_path = install_kernel(repo_id, revision=revision)
    return import_from_path(package_name, package_path / package_name / "__init__.py")


-def has_kernel(repo_id: str, revision: str = "main") -> bool:
+def get_local_kernel(repo_path: Path, package_name: str) -> ModuleType:
+    """
+    Import a kernel from a local kernel repository path.
+    """
+    package_name, package_path = _load_kernel_from_path(repo_path, package_name)
+    return import_from_path(package_name, package_path / package_name / "__init__.py")
+
+
+def has_kernel(
+    repo_id: str, revision: Optional[str] = None, version: Optional[str] = None
+) -> bool:
    """
    Check whether a kernel build exists for the current environment
    (Torch version and compute framework).
+
+    Args:
+        repo_id (`str`): The Hub repository containing the kernel.
+        revision (`str`, *optional*, defaults to `"main"`): The specific
+            revision (branch, tag, or commit) to download.
+            Cannot be used together with `version`.
+        version (`str`, *optional*): The kernel version to download. This
+            can be a Python version specifier, such as `">=1.0.0,<2.0.0"`.
+            Cannot be used together with `revision`.
+    Returns:
+        `bool`: `true` if a kernel is avaialble for the current environment.
    """
+    revision = select_revision_or_version(repo_id, revision, version)
+
    package_name = package_name_from_repo_id(repo_id)
    variant = build_variant()
    universal_variant = universal_build_variant()
--- a/tests/layer_locking/kernels.lock
+++ b/tests/layer_locking/kernels.lock
@ -0,0 +1,12 @@
+[
+  {
+    "repo_id": "kernels-test/versions",
+    "sha": "dc142fd6c9920c993d32be6358b78957c58681c3",
+    "variants": {
+      "torch-universal": {
+        "hash": "sha256-35ce0ccfe68e392cbc06feef72268f4c41a74b9920496a2c6ee8978db7f7c17c",
+        "hash_type": "git_lfs_concat"
+      }
+    }
+  }
+]
--- a/tests/layer_locking/pyproject.toml
+++ b/tests/layer_locking/pyproject.toml
@ -0,0 +1,2 @@
+[tool.kernels.dependencies]
+"kernels-test/versions" = ">=0.1.0,<0.2.0"
--- a/tests/test_basic.py
+++ b/tests/test_basic.py
@ -1,7 +1,7 @@
 import pytest
 import torch

-from kernels import get_kernel, has_kernel
+from kernels import get_kernel, get_local_kernel, has_kernel, install_kernel


@pytest.fixture
@ -9,6 +9,14 @@ def kernel():
    return get_kernel("kernels-community/activation")


+@pytest.fixture
+def local_kernel():
+    package_name, path = install_kernel("kernels-community/activation", "main")
+    # Path is the build variant path (build/torch-<...>), so the grandparent
+    # is the kernel repository path.
+    return get_local_kernel(path.parent.parent, package_name)
+
+
@pytest.fixture
 def metal_kernel():
    return get_kernel("kernels-test/relu-metal")
@ -42,6 +50,22 @@ def test_gelu_fast(kernel, device):
    assert torch.allclose(y, expected)


+@pytest.mark.linux_only
+def test_local_kernel(local_kernel, device):
+    x = torch.arange(1, 10, dtype=torch.float16, device=device).view(3, 3)
+    y = torch.empty_like(x)
+
+    local_kernel.gelu_fast(y, x)
+
+    expected = torch.tensor(
+        [[0.8408, 1.9551, 2.9961], [4.0000, 5.0000, 6.0000], [7.0000, 8.0000, 9.0000]],
+        device=device,
+        dtype=torch.float16,
+    )
+
+    assert torch.allclose(y, expected)
+
+
@pytest.mark.darwin_only
@pytest.mark.parametrize("dtype", [torch.float16, torch.float32])
 def test_relu_metal(metal_kernel, dtype):
@ -67,6 +91,25 @@ def test_has_kernel(kernel_exists):
    assert has_kernel(repo_id, revision=revision) == kernel


+def test_version():
+    kernel = get_kernel("kernels-test/versions")
+    assert kernel.version() == "0.2.0"
+    kernel = get_kernel("kernels-test/versions", version="<1.0.0")
+    assert kernel.version() == "0.2.0"
+    kernel = get_kernel("kernels-test/versions", version="<0.2.0")
+    assert kernel.version() == "0.1.1"
+    kernel = get_kernel("kernels-test/versions", version=">0.1.0,<0.2.0")
+    assert kernel.version() == "0.1.1"
+
+    with pytest.raises(ValueError, match=r"No version.*satisfies requirement"):
+        get_kernel("kernels-test/versions", version=">0.2.0")
+
+    with pytest.raises(ValueError, match=r"Either a revision or a version.*not both"):
+        kernel = get_kernel(
+            "kernels-test/versions", revision="v0.1.0", version="<1.0.0"
+        )
+
+
@pytest.mark.linux_only
 def test_universal_kernel(universal_kernel):
    torch.manual_seed(0)
--- a/tests/test_interval_tree.py
+++ b/tests/test_interval_tree.py
@ -0,0 +1,230 @@
+import random
+from typing import Generic, List, Optional, Tuple, TypeVar
+
+import pytest
+
+from kernels._interval_tree import IntervalTree, _Node
+
+T = TypeVar("T")
+
+
+class SimpleIntervalStore(Generic[T]):
+    """A simple O(n) implementation that stores intervals in a list."""
+
+    def __init__(self):
+        self.intervals: List[Tuple[int, int, T]] = []
+
+    def insert(self, start: int, end: int, data: T) -> None:
+        """Insert an interval into the store."""
+        # Replace data if the interval already exists.
+        for i, (existing_start, existing_end, existing_data) in enumerate(
+            self.intervals
+        ):
+            if existing_start == start and existing_end == end:
+                self.intervals[i] = (start, end, data)
+                return
+
+        self.intervals.append((start, end, data))
+
+    def find_smallest_interval(self, point: int) -> Optional[T]:
+        """Find the best match using linear search."""
+        matches = []
+        for start, end, data in self.intervals:
+            if start <= point <= end:
+                matches.append((start, end, data))
+
+        if not matches:
+            return None
+
+        # Return the smallest interval, sort by memory location when
+        # there are multiple matches with the same interval size. This
+        # mirrors the ordering in the intervan tree.
+        best_match = min(matches, key=lambda x: (x[1] - x[0], id(x[2])))
+        return best_match[2]
+
+
+def is_balanced(tree: IntervalTree[T]) -> bool:
+    """Check if the AVL tree is properly balanced."""
+
+    def check_balance(node: Optional[_Node[T]]) -> Tuple[bool, int]:
+        if node is None:
+            return True, 0
+
+        # Left and right subtrees should be balanced.
+        left_balanced, left_height = check_balance(node.left)
+        if not left_balanced:
+            return False, -1
+
+        right_balanced, right_height = check_balance(node.right)
+        if not right_balanced:
+            return False, -1
+
+        # The difference in height should not exceed 1.
+        if abs(left_height - right_height) > 1:
+            return False, -1
+
+        # Check if the height is correct.
+        expected_height = 1 + max(left_height, right_height)
+        if node.height != expected_height:
+            return False, -1
+
+        return True, expected_height
+
+    balanced, _ = check_balance(tree.root)
+    return balanced
+
+
+@pytest.fixture
+def populated_tree() -> IntervalTree[str]:
+    """Provides a pre-populated IntervalTree for testing."""
+    tree = IntervalTree[str]()
+    kernels = [
+        (80, 89, "Kernel_A_General_80_89"),
+        (86, 89, "Kernel_B_Ampere_86_89"),
+        (80, 86, "Kernel_C_Older_Ampere_80_86"),
+        (70, 75, "Kernel_D_Volta_70_75"),
+        (86, 87, "Kernel_E_Specific_86_87"),
+    ]
+    for start, end, name in kernels:
+        tree.insert(start, end, name)
+    return tree
+
+
+def test_find_smallest_interval_match_with_multiple_overlaps(populated_tree):
+    # Check that the smallest inteval is selected when there are
+    # multiple matching intervals.
+    assert populated_tree.find_smallest_interval(86) == "Kernel_E_Specific_86_87"
+
+
+def test_find_single_match(populated_tree):
+    assert populated_tree.find_smallest_interval(72) == "Kernel_D_Volta_70_75"
+    assert populated_tree.find_smallest_interval(75) == "Kernel_D_Volta_70_75"
+
+
+def test_no_match_outside_all_ranges(populated_tree):
+    # Check that no interval is found when the value is out of range
+    # (too small/too large).
+    assert populated_tree.find_smallest_interval(65) is None
+    assert populated_tree.find_smallest_interval(95) is None
+
+
+def test_no_match_in_gap_between_ranges(populated_tree):
+    # Check that no interval is found when the value is between two
+    # intervals.
+    assert populated_tree.find_smallest_interval(78) is None
+
+
+def test_boundary_conditions_start_and_end(populated_tree):
+    # Test exact upper/lower bounds of intervals.
+    assert populated_tree.find_smallest_interval(80) == "Kernel_C_Older_Ampere_80_86"
+    assert populated_tree.find_smallest_interval(89) == "Kernel_B_Ampere_86_89"
+
+
+def test_empty_tree():
+    # Searching in an empty tree should return None.
+    empty_tree = IntervalTree[str]()
+    assert empty_tree.find_smallest_interval(100) is None
+
+
+def test_multiple_equally_specific_matches():
+    # Check that we pick the match in a stable way when there is are
+    # multiple matching intervals with the same size.
+    tree = IntervalTree[str]()
+    str1 = "First_Narrow_Kernel"
+    str2 = "Second_Narrow_Kernel"
+    tree.insert(10, 20, "Wide_Kernel")
+    tree.insert(12, 17, str1)
+    tree.insert(14, 19, str2)
+
+    if id(str1) < id(str2):
+        assert tree.find_smallest_interval(15) == str1
+    else:
+        assert tree.find_smallest_interval(15) == str2
+
+
+def test_property_based_interval_tree():
+    # Quick-check property-based testing:
+    #
+    # - Verify that the tree is balanced after each insertion.
+    # - Verify the query against a simple list-based implementation.
+
+    random.seed(42)  # For reproducible tests
+
+    test_points = list(range(0, 101))
+
+    for _ in range(5):
+        tree = IntervalTree[str]()
+        simple = SimpleIntervalStore[str]()
+
+        intervals = []
+        for i in range(100):
+            start = random.randint(0, 90)
+            end = random.randint(start, 100)
+            data = f"interval_{i}_s{start}_e{end}"
+            intervals.append((start, end, data))
+
+        for i, (start, end, data) in enumerate(intervals):
+            tree.insert(start, end, data)
+            simple.insert(start, end, data)
+
+            # Check that tree is still balanced
+            assert is_balanced(
+                tree
+            ), f"Tree became unbalanced after inserting interval {i}: ({start}, {end})"
+
+            for point in test_points:
+                tree_result = tree.find_smallest_interval(point)
+                simple_result = simple.find_smallest_interval(point)
+
+                assert tree_result == simple_result, (
+                    f"Mismatch for point {point} after inserting {i+1} intervals. "
+                    f"Tree: {tree_result}, Simple: {simple_result}. "
+                    f"Last inserted: ({start}, {end})"
+                )
+
+
+def test_property_based_edge_cases():
+    random.seed(123)
+
+    tree = IntervalTree[str]()
+    simple = SimpleIntervalStore[str]()
+
+    # Single-point intervals.
+    for i in range(10):
+        point = random.randint(0, 100)
+        data = f"single_point_{i}_{point}"
+        tree.insert(point, point, data)
+        simple.insert(point, point, data)
+
+        assert is_balanced(
+            tree
+        ), f"Tree unbalanced after inserting single point {point}"
+
+        # Test the exact point and neighbors
+        for test_point in [point - 1, point, point + 1]:
+            if 0 <= test_point <= 100:
+                tree_result = tree.find_smallest_interval(test_point)
+                simple_result = simple.find_smallest_interval(test_point)
+                assert tree_result == simple_result
+
+
+def test_unique_intervals_override():
+    """Test that inserting an interval with the same start/end overrides the previous value."""
+    tree = IntervalTree[str]()
+
+    tree.insert(10, 20, "original_value")
+    assert tree.find_smallest_interval(15) == "original_value"
+
+    tree.insert(10, 20, "new_value")
+    assert tree.find_smallest_interval(15) == "new_value"
+
+    tree.insert(10, 25, "different_interval")
+    results = tree.search(15)
+    assert "new_value" in results
+    assert "different_interval" in results
+    assert len(results) == 2
+
+    tree.insert(10, 20, "final_value")
+    assert tree.find_smallest_interval(15) == "final_value"
+
+    assert is_balanced(tree)
--- a/tests/test_kernel_locking.py
+++ b/tests/test_kernel_locking.py
@ -2,9 +2,17 @@ from dataclasses import dataclass
 from pathlib import Path

 import pytest
+import torch.nn as nn

 from kernels import load_kernel
 from kernels.cli import download_kernels
+from kernels.layer import (
+    LockedLayerRepository,
+    Mode,
+    kernelize,
+    use_kernel_forward_from_hub,
+    use_kernel_mapping,
+)


 # Mock download arguments class.
@ -25,3 +33,28 @@ def test_load_locked():
    # Also validates that hashing works correctly.
    download_kernels(DownloadArgs(all_variants=False, project_dir=project_dir))
    load_kernel("kernels-community/activation", lockfile=project_dir / "kernels.lock")
+
+
+def test_layer_locked():
+    project_dir = Path(__file__).parent / "layer_locking"
+
+    @use_kernel_forward_from_hub("Version")
+    class Version(nn.Module):
+        def forward(self) -> str:
+            return "0.0.0"
+
+    version = Version()
+
+    with use_kernel_mapping(
+        {
+            "Version": {
+                "cuda": LockedLayerRepository(
+                    repo_id="kernels-test/versions",
+                    layer_name="Version",
+                    lockfile=project_dir / "kernels.lock",
+                )
+            },
+        }
+    ):
+        version = kernelize(version, device="cuda", mode=Mode.INFERENCE)
+        assert version() == "0.1.1"
--- a/tests/test_layer.py
+++ b/tests/test_layer.py
@ -1,3 +1,4 @@
+import sys
 from contextlib import nullcontext

 import pytest
@ -8,11 +9,17 @@ from torch.nn import functional as F
 from kernels import (
    Device,
    LayerRepository,
+    Mode,
    kernelize,
    register_kernel_mapping,
    use_kernel_forward_from_hub,
 )
-from kernels.layer import _KERNEL_MAPPING, _validate_layer, use_kernel_mapping
+from kernels.layer import (
+    _KERNEL_MAPPING,
+    CUDAProperties,
+    _validate_layer,
+    use_kernel_mapping,
+)

 kernel_layer_mapping = {
    "SiluAndMul": {
@ -65,6 +72,18 @@ class SiluAndMulStringDevice(SiluAndMul):
    pass


+@use_kernel_forward_from_hub("Linear")
+class TorchLinearWithCounter(nn.Linear):
+    def __init__(self, *args, **kwargs):
+        super().__init__(*args, **kwargs)
+        # Used to check that we called hub kernel.
+        self.n_calls = 0
+
+    def forward(self, input: torch.Tensor) -> torch.Tensor:
+        self.n_calls += 1
+        return super().forward(input)
+
+
 def test_arg_kinds():
    @use_kernel_forward_from_hub("ArgKind")
    class ArgKind(nn.Module):
@ -93,7 +112,7 @@ def test_hub_forward(cls, device):
    X = torch.randn((32, 64), device=device)
    Y = silu_and_mul(X)

-    silu_and_mul_with_kernel = kernelize(cls(), device=device)
+    silu_and_mul_with_kernel = kernelize(cls(), device=device, mode=Mode.INFERENCE)
    Y_kernel = silu_and_mul_with_kernel(X)

    torch.testing.assert_close(Y_kernel, Y)
@ -105,6 +124,55 @@ def test_hub_forward(cls, device):
        assert silu_and_mul_with_kernel.n_calls == 1


+@pytest.mark.linux_only
+def test_capability():
+    linear = TorchLinearWithCounter(32, 32).to("cuda")
+    with use_kernel_mapping(
+        {
+            "Linear": {
+                Device(
+                    type="cuda",
+                    properties=CUDAProperties(
+                        min_capability=75, max_capability=sys.maxsize
+                    ),
+                ): LayerRepository(
+                    repo_id="kernels-test/backward-marker-test",
+                    layer_name="LinearBackward",
+                )
+            }
+        }
+    ):
+        kernelize(linear, mode=Mode.INFERENCE)
+        X = torch.randn(10, 32, device="cuda")
+        linear(X)
+
+        # Check that we called out to the kernel.
+        assert linear.n_calls == 0
+
+    with use_kernel_mapping(
+        {
+            "Linear": {
+                Device(
+                    type="cuda",
+                    properties=CUDAProperties(
+                        min_capability=sys.maxsize, max_capability=sys.maxsize
+                    ),
+                ): LayerRepository(
+                    repo_id="kernels-test/backward-marker-test",
+                    layer_name="LinearBackward",
+                )
+            }
+        }
+    ):
+        kernelize(linear, mode=Mode.INFERENCE)
+        X = torch.randn(10, 32, device="cuda")
+        linear(X)
+
+        # Check that we didn't call out to the kernel because there is
+        # is no kernel with a matching capability..
+        assert linear.n_calls == 1
+
+
 def test_layer_fallback_works():
    @use_kernel_forward_from_hub("SiluAndMulNonExisting")
    class SiluAndMulWithKernelFallback(SiluAndMul):
@ -112,7 +180,7 @@ def test_layer_fallback_works():

    # Check that we don't raise an exception for a non-existing kernel.
    silu_and_mul = SiluAndMulWithKernelFallback()
-    kernelize(silu_and_mul, device="cuda")
+    kernelize(silu_and_mul, device="cuda", mode=Mode.INFERENCE)


@pytest.mark.linux_only
@ -128,7 +196,7 @@ def test_torch_compile_layer_without_fallback(cls, device):
    silu_and_mul_with_kernel.eval()

    ctx = (
-        pytest.raises(ValueError, match="does not fulfill requirements")
+        pytest.raises(ValueError, match="does not support mode")
        if cls is SiluAndMulNoCompileKernel
        else nullcontext()
    )
@ -136,7 +204,7 @@ def test_torch_compile_layer_without_fallback(cls, device):
        silu_and_mul_with_kernel = kernelize(
            silu_and_mul_with_kernel,
            device=device,
-            needs_torch_compile=True,
+            mode=Mode.INFERENCE | Mode.TORCH_COMPILE,
            use_fallback=False,
        )
    silu_and_mul_compiled = torch.compile(silu_and_mul_with_kernel, fullgraph=True)
@ -160,7 +228,7 @@ def test_torch_compile_layer_with_fallback(cls, device):
    silu_and_mul_with_kernel = kernelize(
        silu_and_mul_with_kernel,
        device=device,
-        needs_torch_compile=True,
+        mode=Mode.INFERENCE | Mode.TORCH_COMPILE,
    )
    silu_and_mul_compiled = torch.compile(silu_and_mul_with_kernel, fullgraph=True)

@ -169,6 +237,7 @@ def test_torch_compile_layer_with_fallback(cls, device):
    torch.testing.assert_close(Y_compiled, Y)


+@pytest.mark.linux_only
 def test_mapping_contexts():
    assert set(_KERNEL_MAPPING.get().keys()) == {
        "SiluAndMul",
@ -212,7 +281,7 @@ def test_mapping_contexts():
                "TestKernel",
            }
            assert (
-                _KERNEL_MAPPING.get()["SiluAndMul"][Device(type="cuda")].repo_id
+                _KERNEL_MAPPING.get()["SiluAndMul"]["cuda"].repos[Mode.FALLBACK].repo_id
                == "kernels-community/non-existing"
            )

@ -223,7 +292,7 @@ def test_mapping_contexts():
            "TestKernel",
        }
        assert (
-            _KERNEL_MAPPING.get()["SiluAndMul"][Device(type="cuda")].repo_id
+            _KERNEL_MAPPING.get()["SiluAndMul"]["cuda"].repos[Mode.FALLBACK].repo_id
            == "kernels-community/activation"
        )

@ -232,7 +301,7 @@ def test_mapping_contexts():
                "SiluAndMul",
            }
            assert (
-                _KERNEL_MAPPING.get()["SiluAndMul"][Device(type="cuda")].repo_id
+                _KERNEL_MAPPING.get()["SiluAndMul"]["cuda"].repos[Mode.FALLBACK].repo_id
                == "kernels-community/non-existing"
            )

@ -243,7 +312,7 @@ def test_mapping_contexts():
            "TestKernel",
        }
        assert (
-            _KERNEL_MAPPING.get()["SiluAndMul"][Device(type="cuda")].repo_id
+            _KERNEL_MAPPING.get()["SiluAndMul"]["cuda"].repos[Mode.FALLBACK].repo_id
            == "kernels-community/activation"
        )

@ -282,20 +351,173 @@ def test_validate_kernel_layer():
        _validate_layer(cls=BadLayer4, check_cls=SiluAndMul)


+@pytest.mark.linux_only
+def test_invalid_mode_for_mapping_rejected():
+    linear = TorchLinearWithCounter(32, 32).to("cuda")
+
+    with use_kernel_mapping(
+        {
+            "Linear": {
+                "cuda": {
+                    Mode.TRAINING: LayerRepository(
+                        repo_id="kernels-test/backward-marker-test",
+                        layer_name="LinearNoBackward",
+                    )
+                }
+            }
+        }
+    ):
+        with pytest.raises(ValueError, match="does not support backward"):
+            kernelize(linear, mode=Mode.TRAINING)
+
+
+@pytest.mark.linux_only
+def test_kernel_modes():
+    linear = TorchLinearWithCounter(32, 32).to("cuda")
+
+    # Case 1: layer without further specification, becomes the
+    #         base layer.
+    with use_kernel_mapping(
+        {
+            "Linear": {
+                "cuda": LayerRepository(
+                    repo_id="kernels-test/backward-marker-test",
+                    layer_name="LinearBackward",
+                )
+            }
+        }
+    ):
+        kernelize(linear, mode=Mode.INFERENCE)
+        X = torch.randn(10, 32, device="cuda")
+        linear(X)
+        assert linear.n_calls == 0
+
+        kernelize(linear, mode=Mode.TRAINING)
+        linear(X)
+        assert linear.n_calls == 0
+
+        kernelize(linear, mode=Mode.TRAINING | Mode.TORCH_COMPILE)
+        linear(X)
+        assert linear.n_calls == 0
+
+        # Same as previous, since TRAINING | TORCH_COMPILE is the default.
+        kernelize(linear)
+        linear(X)
+        assert linear.n_calls == 0
+
+    # Case 2: register a kernel just for training. If no base kernel
+    #         layer is registered, we fall back to the original layer.
+    with use_kernel_mapping(
+        {
+            "Linear": {
+                "cuda": {
+                    Mode.TRAINING: LayerRepository(
+                        repo_id="kernels-test/backward-marker-test",
+                        layer_name="LinearBackward",
+                    )
+                }
+            }
+        }
+    ):
+        kernelize(linear, mode=Mode.INFERENCE)
+        X = torch.randn(10, 32, device="cuda")
+        linear(X)
+        assert linear.n_calls == 0
+
+        kernelize(linear, mode=Mode.TRAINING)
+        linear(X)
+        # Training has a kernel, so fallback.
+        assert linear.n_calls == 0
+
+        kernelize(linear, mode=Mode.TRAINING | Mode.TORCH_COMPILE)
+        linear(X)
+        # TRAINING | TORCH_COMPILE cannot fall back to TRAINING kernel, so uses original.
+        assert linear.n_calls == 1
+
+        # Same as previous, since TRAINING | TORCH_COMPILE is the default.
+        kernelize(linear)
+        linear(X)
+        # TRAINING | TORCH_COMPILE cannot fall back to TRAINING kernel, so uses original.
+        assert linear.n_calls == 2
+
+    # Case 3: register a kernel just for training and one for fallback.
+    with use_kernel_mapping(
+        {
+            "Linear": {
+                "cuda": {
+                    Mode.FALLBACK: LayerRepository(
+                        repo_id="kernels-test/backward-marker-test",
+                        layer_name="LinearBackward",
+                    ),
+                    Mode.TRAINING: LayerRepository(
+                        repo_id="kernels-test/backward-marker-test",
+                        layer_name="LinearBackward",
+                    ),
+                }
+            }
+        }
+    ):
+        kernelize(linear, mode=Mode.INFERENCE)
+        X = torch.randn(10, 32, device="cuda")
+        linear(X)
+        # Falls back to TRAINING.
+        assert linear.n_calls == 2
+
+        kernelize(linear, mode=Mode.TRAINING)
+        linear(X)
+        # Falls back to the TRAINING kernel.
+        assert linear.n_calls == 2
+
+        kernelize(linear, mode=Mode.TRAINING | Mode.TORCH_COMPILE)
+        linear(X)
+        # TRAINING | TORCH_COMPILE falls back to FALLBACK kernel.
+        assert linear.n_calls == 2
+
+        # Same as previous, since TRAINING | TORCH_COMPILE is the default.
+        kernelize(linear)
+        linear(X)
+        # TRAINING | TORCH_COMPILE falls back to FALLBACK kernel.
+        assert linear.n_calls == 2
+
+    # Case 4: register a kernel with two preferences.
+    with use_kernel_mapping(
+        {
+            "Linear": {
+                "cuda": {
+                    Mode.TRAINING
+                    | Mode.TORCH_COMPILE: LayerRepository(
+                        repo_id="kernels-test/backward-marker-test",
+                        layer_name="LinearBackward",
+                    )
+                }
+            }
+        }
+    ):
+        kernelize(linear, mode=Mode.INFERENCE)
+        X = torch.randn(10, 32, device="cuda")
+        linear(X)
+        # Falls back to the TRAINING | TORCH_COMPILE kernel.
+        assert linear.n_calls == 2
+
+        kernelize(linear, mode=Mode.TRAINING)
+        linear(X)
+        # TRAINING can fall back to TRAINING | TORCH_COMPILE kernel.
+        assert linear.n_calls == 2
+
+        kernelize(linear, mode=Mode.TRAINING | Mode.TORCH_COMPILE)
+        linear(X)
+        # Uses TRAINING | TORCH_COMPILE kernel.
+        assert linear.n_calls == 2
+
+        kernelize(linear)
+        linear(X)
+        # Same as previous, since TRAINING | TORCH_COMPILE is the default.
+        assert linear.n_calls == 2
+
+
@pytest.mark.linux_only
 def test_fallback_used_when_training():
-    @use_kernel_forward_from_hub("Linear")
-    class TorchLinear(nn.Linear):
-        def __init__(self, *args, **kwargs):
-            super().__init__(*args, **kwargs)
-            # Used to check that we called hub kernel.
-            self.n_calls = 0
-
-        def forward(self, input: torch.Tensor) -> torch.Tensor:
-            self.n_calls += 1
-            return super().forward(input)
-
-    linear = TorchLinear(32, 32).to("cuda")
+    linear = TorchLinearWithCounter(32, 32).to("cuda")

    # Case 1: kernel with explicit backward support should always
    #         use the kernel.
@ -310,7 +532,7 @@ def test_fallback_used_when_training():
        }
    ):
        linear.train()
-        kernelize(linear)
+        kernelize(linear, mode=Mode.INFERENCE)
        X = torch.randn(10, 32, device="cuda")
        linear(X)
        assert linear.n_calls == 0
@ -332,7 +554,7 @@ def test_fallback_used_when_training():
        }
    ):
        linear.train()
-        kernelize(linear)
+        kernelize(linear, mode=Mode.INFERENCE)
        X = torch.randn(10, 32, device="cuda")
        linear(X)
        assert linear.n_calls == 0
@ -341,57 +563,391 @@ def test_fallback_used_when_training():
        linear(X)
        assert linear.n_calls == 0

-    # Case 3: kernel out backward support should use the kernel in
-    #         eval mode and the fallback in training. Test train ->
-    #         eval -> train.
+
+def test_invalid_mode_rejected():
+    with pytest.raises(ValueError, match="mutually exclusive"):
+        _ = Mode.INFERENCE | Mode.TRAINING
+
+    with pytest.raises(ValueError, match="cannot be combined with other modes"):
+        _ = Mode.FALLBACK | Mode.TORCH_COMPILE
+
+    with pytest.raises(
+        ValueError, match="can only be used to register kernel mappings"
+    ):
+        kernelize(torch.nn.Linear(32, 32), mode=Mode.FALLBACK)
+
+    with pytest.raises(ValueError, match="mode must contain"):
+        kernelize(torch.nn.Linear(32, 32), mode=Mode.TORCH_COMPILE)
+
+
+@pytest.mark.linux_only
+def test_kernel_modes_inference():
+    """Test inference-specific fallback scenarios."""
+    linear = TorchLinearWithCounter(32, 32).to("cuda")
+
+    # Case 1: register a kernel just for inference
    with use_kernel_mapping(
        {
            "Linear": {
-                Device(type="cuda"): LayerRepository(
-                    repo_id="kernels-test/backward-marker-test",
-                    layer_name="LinearNoBackward",
-                )
+                "cuda": {
+                    Mode.INFERENCE: LayerRepository(
+                        repo_id="kernels-test/backward-marker-test",
+                        layer_name="LinearBackward",
+                    )
+                }
            }
        }
    ):
-        linear.train()
-        kernelize(linear)
+        kernelize(linear, mode=Mode.INFERENCE)
        X = torch.randn(10, 32, device="cuda")
        linear(X)
+        assert linear.n_calls == 0
+
+        kernelize(linear, mode=Mode.INFERENCE | Mode.TORCH_COMPILE)
+        linear(X)
+        # INFERENCE | TORCH_COMPILE cannot fall back to INFERENCE kernel, so uses original
        assert linear.n_calls == 1

-        # When switching the kernel to eval, forward gets replaced by
-        # the kernel.
-        linear.eval()
-        linear(X)
-        assert linear.n_calls == 1
-
-        ## Let's do it in the other direction to make sure it works as well.
-        linear.train()
+        kernelize(linear, mode=Mode.TRAINING)
        linear(X)
+        # No training kernel, so fallback to original
        assert linear.n_calls == 2

-    # Case 4: same as case 3, but test eval -> train -> eval.
+    # Case 2: register a kernel just for inference + torch.compile
    with use_kernel_mapping(
        {
            "Linear": {
-                Device(type="cuda"): LayerRepository(
-                    repo_id="kernels-test/backward-marker-test",
-                    layer_name="LinearNoBackward",
-                )
+                "cuda": {
+                    Mode.INFERENCE
+                    | Mode.TORCH_COMPILE: LayerRepository(
+                        repo_id="kernels-test/backward-marker-test",
+                        layer_name="LinearBackward",
+                    )
+                }
            }
        }
    ):
-        linear.eval()
-        kernelize(linear)
+        kernelize(linear, mode=Mode.INFERENCE | Mode.TORCH_COMPILE)
        X = torch.randn(10, 32, device="cuda")
        linear(X)
        assert linear.n_calls == 2

-        linear.train()
+        kernelize(linear, mode=Mode.INFERENCE)
        linear(X)
+        # INFERENCE falls back to INFERENCE | TORCH_COMPILE kernel
+        assert linear.n_calls == 2
+
+        kernelize(linear, mode=Mode.TRAINING)
+        linear(X)
+        # No training kernel, so fallback to original
        assert linear.n_calls == 3

-        linear.eval()
+    # Case 3: register both inference kernels
+    with use_kernel_mapping(
+        {
+            "Linear": {
+                "cuda": {
+                    Mode.INFERENCE: LayerRepository(
+                        repo_id="kernels-test/backward-marker-test",
+                        layer_name="LinearBackward",
+                    ),
+                    Mode.INFERENCE
+                    | Mode.TORCH_COMPILE: LayerRepository(
+                        repo_id="kernels-test/backward-marker-test",
+                        layer_name="LinearBackward",
+                    ),
+                }
+            }
+        }
+    ):
+        kernelize(linear, mode=Mode.INFERENCE)
+        X = torch.randn(10, 32, device="cuda")
        linear(X)
+        # Uses exact INFERENCE kernel
        assert linear.n_calls == 3
+
+        kernelize(linear, mode=Mode.INFERENCE | Mode.TORCH_COMPILE)
+        linear(X)
+        # Uses exact INFERENCE | TORCH_COMPILE kernel
+        assert linear.n_calls == 3
+
+        kernelize(linear, mode=Mode.TRAINING)
+        linear(X)
+        # No training kernel, so fallback to original
+        assert linear.n_calls == 4
+
+
+@pytest.mark.linux_only
+def test_kernel_modes_mixed():
+    """Test mixed training and inference kernel scenarios."""
+    linear = TorchLinearWithCounter(32, 32).to("cuda")
+
+    # Case 1: register both base inference and training kernels
+    with use_kernel_mapping(
+        {
+            "Linear": {
+                "cuda": {
+                    Mode.INFERENCE: LayerRepository(
+                        repo_id="kernels-test/backward-marker-test",
+                        layer_name="LinearBackward",
+                    ),
+                    Mode.TRAINING: LayerRepository(
+                        repo_id="kernels-test/backward-marker-test",
+                        layer_name="LinearBackward",
+                    ),
+                }
+            }
+        }
+    ):
+        kernelize(linear, mode=Mode.INFERENCE)
+        X = torch.randn(10, 32, device="cuda")
+        linear(X)
+        assert linear.n_calls == 0
+
+        kernelize(linear, mode=Mode.TRAINING)
+        linear(X)
+        assert linear.n_calls == 0
+
+        kernelize(linear, mode=Mode.INFERENCE | Mode.TORCH_COMPILE)
+        linear(X)
+        # INFERENCE | TORCH_COMPILE cannot fall back to INFERENCE kernel, so uses original
+        assert linear.n_calls == 1
+
+        kernelize(linear, mode=Mode.TRAINING | Mode.TORCH_COMPILE)
+        linear(X)
+        # TRAINING | TORCH_COMPILE cannot fall back to TRAINING kernel, so uses original
+        assert linear.n_calls == 2
+
+    # Case 2: register all four kernel modes
+    with use_kernel_mapping(
+        {
+            "Linear": {
+                "cuda": {
+                    Mode.INFERENCE: LayerRepository(
+                        repo_id="kernels-test/backward-marker-test",
+                        layer_name="LinearBackward",
+                    ),
+                    Mode.TRAINING: LayerRepository(
+                        repo_id="kernels-test/backward-marker-test",
+                        layer_name="LinearBackward",
+                    ),
+                    Mode.INFERENCE
+                    | Mode.TORCH_COMPILE: LayerRepository(
+                        repo_id="kernels-test/backward-marker-test",
+                        layer_name="LinearBackward",
+                    ),
+                    Mode.TRAINING
+                    | Mode.TORCH_COMPILE: LayerRepository(
+                        repo_id="kernels-test/backward-marker-test",
+                        layer_name="LinearBackward",
+                    ),
+                }
+            }
+        }
+    ):
+        kernelize(linear, mode=Mode.INFERENCE)
+        X = torch.randn(10, 32, device="cuda")
+        linear(X)
+        # Uses exact INFERENCE kernel
+        assert linear.n_calls == 2
+
+        kernelize(linear, mode=Mode.TRAINING)
+        linear(X)
+        # Uses exact TRAINING kernel
+        assert linear.n_calls == 2
+
+        kernelize(linear, mode=Mode.INFERENCE | Mode.TORCH_COMPILE)
+        linear(X)
+        # Uses exact INFERENCE | TORCH_COMPILE kernel
+        assert linear.n_calls == 2
+
+        kernelize(linear, mode=Mode.TRAINING | Mode.TORCH_COMPILE)
+        linear(X)
+        # Uses exact TRAINING | TORCH_COMPILE kernel
+        assert linear.n_calls == 2
+
+
+@pytest.mark.linux_only
+def test_kernel_modes_cross_fallback():
+    """Test cross-mode fallback scenarios from inference to training modes."""
+    linear = TorchLinearWithCounter(32, 32).to("cuda")
+
+    # Case 1: Only training kernel registered - inference should fall back to training
+    with use_kernel_mapping(
+        {
+            "Linear": {
+                "cuda": {
+                    Mode.TRAINING: LayerRepository(
+                        repo_id="kernels-test/backward-marker-test",
+                        layer_name="LinearBackward",
+                    )
+                }
+            }
+        }
+    ):
+        kernelize(linear, mode=Mode.INFERENCE)
+        X = torch.randn(10, 32, device="cuda")
+        linear(X)
+        # INFERENCE falls back to TRAINING kernel
+        assert linear.n_calls == 0
+
+        kernelize(linear, mode=Mode.TRAINING)
+        linear(X)
+        # TRAINING uses the kernel directly
+        assert linear.n_calls == 0
+
+    # Case 2: Only training + torch.compile kernel registered
+    with use_kernel_mapping(
+        {
+            "Linear": {
+                "cuda": {
+                    Mode.TRAINING
+                    | Mode.TORCH_COMPILE: LayerRepository(
+                        repo_id="kernels-test/backward-marker-test",
+                        layer_name="LinearBackward",
+                    )
+                }
+            }
+        }
+    ):
+        kernelize(linear, mode=Mode.INFERENCE)
+        X = torch.randn(10, 32, device="cuda")
+        linear(X)
+        # INFERENCE falls back to TRAINING | TORCH_COMPILE kernel
+        assert linear.n_calls == 0
+
+        kernelize(linear, mode=Mode.INFERENCE | Mode.TORCH_COMPILE)
+        linear(X)
+        # INFERENCE | TORCH_COMPILE falls back to TRAINING | TORCH_COMPILE kernel
+        assert linear.n_calls == 0
+
+        kernelize(linear, mode=Mode.TRAINING)
+        linear(X)
+        # TRAINING falls back to TRAINING | TORCH_COMPILE kernel
+        assert linear.n_calls == 0
+
+        kernelize(linear, mode=Mode.TRAINING | Mode.TORCH_COMPILE)
+        linear(X)
+        # TRAINING | TORCH_COMPILE uses the kernel directly
+        assert linear.n_calls == 0
+
+    # Case 3: Test that training modes don't fall back to inference modes
+    with use_kernel_mapping(
+        {
+            "Linear": {
+                "cuda": {
+                    Mode.INFERENCE: LayerRepository(
+                        repo_id="kernels-test/backward-marker-test",
+                        layer_name="LinearBackward",
+                    ),
+                    Mode.INFERENCE
+                    | Mode.TORCH_COMPILE: LayerRepository(
+                        repo_id="kernels-test/backward-marker-test",
+                        layer_name="LinearBackward",
+                    ),
+                }
+            }
+        }
+    ):
+        kernelize(linear, mode=Mode.TRAINING)
+        X = torch.randn(10, 32, device="cuda")
+        linear(X)
+        # TRAINING should NOT fall back to inference kernels, use original
+        assert linear.n_calls == 1
+
+        kernelize(linear, mode=Mode.TRAINING | Mode.TORCH_COMPILE)
+        linear(X)
+        # TRAINING | TORCH_COMPILE should NOT fall back to inference kernels, use original
+        assert linear.n_calls == 2
+
+
+def test_layer_versions():
+    @use_kernel_forward_from_hub("Version")
+    class Version(nn.Module):
+        def forward(self) -> str:
+            return "0.0.0"
+
+    version = Version()
+
+    with use_kernel_mapping(
+        {
+            "Version": {
+                Device(type="cuda"): LayerRepository(
+                    repo_id="kernels-test/versions",
+                    layer_name="Version",
+                )
+            }
+        }
+    ):
+        version = kernelize(version, device="cuda", mode=Mode.INFERENCE)
+        assert version() == "0.2.0"
+
+    with use_kernel_mapping(
+        {
+            "Version": {
+                Device(type="cuda"): LayerRepository(
+                    repo_id="kernels-test/versions",
+                    layer_name="Version",
+                    version="<1.0.0",
+                )
+            }
+        }
+    ):
+        version = kernelize(version, device="cuda", mode=Mode.INFERENCE)
+        assert version() == "0.2.0"
+
+    with use_kernel_mapping(
+        {
+            "Version": {
+                Device(type="cuda"): LayerRepository(
+                    repo_id="kernels-test/versions",
+                    layer_name="Version",
+                    version="<0.2.0",
+                )
+            }
+        }
+    ):
+        version = kernelize(version, device="cuda", mode=Mode.INFERENCE)
+        assert version() == "0.1.1"
+
+    with use_kernel_mapping(
+        {
+            "Version": {
+                Device(type="cuda"): LayerRepository(
+                    repo_id="kernels-test/versions",
+                    layer_name="Version",
+                    version=">0.1.0,<0.2.0",
+                )
+            }
+        }
+    ):
+        version = kernelize(version, device="cuda", mode=Mode.INFERENCE)
+        assert version() == "0.1.1"
+
+    with use_kernel_mapping(
+        {
+            "Version": {
+                Device(type="cuda"): LayerRepository(
+                    repo_id="kernels-test/versions",
+                    layer_name="Version",
+                    version=">0.2.0",
+                )
+            }
+        }
+    ):
+        with pytest.raises(ValueError, match=r"No version.*satisfies requirement"):
+            kernelize(version, device="cuda", mode=Mode.INFERENCE)
+
+    with pytest.raises(ValueError, match=r"Either a revision or a version.*not both"):
+        use_kernel_mapping(
+            {
+                "Version": {
+                    Device(type="cuda"): LayerRepository(
+                        repo_id="kernels-test/versions",
+                        layer_name="Version",
+                        revision="v0.1.0",
+                        version="<1.0.0",
+                    )
+                }
+            }
+        )
Author	SHA1	Message	Date
Daniël de Kok	0429131630	Set version to 0.8.1	2025-07-23 14:43:31 +02:00
Daniël de Kok	967ac581b8	Set version to 0.8.1.dev0 (#115 )	2025-07-23 14:42:24 +02:00
Daniël de Kok	81088d44e8	Add support for project-wide locking of layers (#114 ) This change adds `LockedLayerRepository` as an alternative to `LayerRepository`. `LockedLayerRepository` allows for locking all kernel layers that are used at the project level. Example usage: ``` with use_kernel_mapping( { "SomeLayer": { "cuda": LockedLayerRepository( repo_id="some-org/some-layer", layer_name="SomeLayer", ) }, } ): layer = kernelize(layer, device="cuda", mode=Mode.INFERENCE) ``` This requires that the project has a `pyproject.toml` with kernel version specifications and `kernel.lock` with the locked kernels.	2025-07-23 09:37:05 +02:00
Daniël de Kok	4a04c005e3	Add version support to `LayerRepository` (#113 ) * Add version support to `LayerRepository` * Remove some docs that do not apply * Removed unused member variable	2025-07-22 17:02:39 +02:00
Wang, Yi	6d3c6daf20	triton based kernel could also run in xpu (#112 ) Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-07-22 10:03:34 +02:00
Daniël de Kok	071900fd69	`get_kernel`: allow Python-style version specifiers (#111 ) Use Python-style version specifiers to resolve to tags. E.g., given the presence of the tags `v0.1.0`, `v0.1.1`, and `v0.2.0`, get_kernel("my/kernel", version=">=0.1.0,<0.2.0") would resolve to `v0.1.1`.	2025-07-21 17:18:35 +02:00
Daniël de Kok	2d2c6b14e0	Set version to 0.8.0.dev0 (#110 )	2025-07-15 18:45:03 +02:00
Daniël de Kok	03edc573b1	Log kernel layer selection (#109 )	2025-07-15 18:38:17 +02:00
Daniël de Kok	c841a6c90d	Improve mode handling (#108 ) * Set `kernelize` default mode to `Mode.TRAINING \| Mode.TORCH_COMPILE` Also update docs and tests. * Rename `Mode.DEFAULT` to `Mode.FALLBACK` * More fine-grained fallbacks For instance, INFERENCE can fall back to INFERENCE \| TORCH_COMPILE, TRAINING, TRAINING \| TORCH_COMPILE, and FALLBACK. * Update documtenation for mode fallback * Mention that you can rerun `kernelize` to change the mode	2025-07-15 16:10:43 +02:00
Daniël de Kok	c7a343f195	Support registering layers with a range of CUDA capabilities (#106 ) * Add interval tree implementation * Support registering layers with a range of CUDA capabilities This change adds support for registering a layers for ranges of CUDA capabilities. This makes it possible to use newer, faster kernels for new GPUs, while falling back to another implementation on older GPUs. * Add docs for registering kernels with CUDA capabilities * Fix typing errors	2025-07-14 16:59:21 +02:00
Daniël de Kok	8d838f947d	Fix macOS tests by marking some CUDA-only tests (#105 )	2025-07-10 12:24:25 +02:00
Daniël de Kok	b87e6fadbe	Set version to 0.7.0.dev0 (#104 )	2025-07-07 14:56:43 +02:00
Daniël de Kok	fc935d9874	Support registering inference/training-specific layers (#103 ) * Support registering inference/training-specific layers This change makes it possible to register kernels specialized for inference, training, and/or `torch.compile`. To do so, the mapping notation is extended to support registering specialized kernels for a specific 'mode'. For instance, the following mapping, ```python kernel_layer_mapping = { "SiluAndMul": { "cuda": { Mode.DEFAULT: LayerRepository( repo_id="kernels-community/activation", layer_name="SiluAndMul", ), Mode.TRAINING \| Mode.TORCH_COMPILE: LayerRepository( repo_id="kernels-community/activation-training-optimized", layer_name="SiluAndMul", ), } } } ``` uses `kernels-community/activation` by default, but will switch to using `kernels-community/activation-training-optimized` if a model is kernelized for training and `torch.compile`. To make it easier to add more modes in the future and to unify the `register_kernel_mapping` and `kernelize` signatures, the `training` and `needs_torch_compile` arguments of `kernelize` are replaced by a single `mode` argument: ```python model = MyModel(...) model = kernelize(model, mode=Mode.TRAINING \| Mode.TORCH_COMPILE) ``` * Documentation fixes Co-authored-by: Mohamed Mekkouri <93391238+MekkCyber@users.noreply.github.com> * Add note on when the fallback is used * Tighten up some Mode checks * Fix ruff check * Attempt to fix mypy errors * More typing fixes * Ignore Python < 3.11 type check SNAFU --------- Co-authored-by: Mohamed Mekkouri <93391238+MekkCyber@users.noreply.github.com>	2025-07-04 19:57:14 +02:00
Daniël de Kok	3622e1f8dd	Add `get_local_kernel` function (#102 ) This function loads a kernel from a local repository (e.g. the output of kernel-builder), which can be handy for testing.	2025-07-01 13:58:47 +02:00
Daniël de Kok	a7f3b2e8ed	Set version to 0.6.2.dev0 (#100 )	2025-06-25 09:48:09 +02:00
Daniël de Kok	a6ab5d83ba	Make the flake work on Darwin (#98 )	2025-06-24 20:35:21 +02:00
Daniël de Kok	4f9f1abfb9	darwin: fix variant CPU for aarch64 (#97 )	2025-06-24 20:35:07 +02:00
Daniël de Kok	f94b7780a6	CI: main triton-layer-norm has docs, branch is gone (#99 )	2025-06-24 16:40:36 +02:00