Set release to 0.8.0

2025-11-06 23:24:31 +08:00 · 2025-07-15 18:46:10 +02:00
34 changed files with 260 additions and 1498 deletions
--- a/.github/workflows/build_documentation.yaml
+++ b/.github/workflows/build_documentation.yaml
@ -1,17 +0,0 @@
 name: Build documentation
 on:
  push:
    branches:
      - main
      - doc-builder*
      - v*-release
 jobs:
  build:
    uses: huggingface/doc-builder/.github/workflows/build_main_documentation.yml@main
    with:
      commit_sha: ${{ github.sha }}
      package: kernels
    secrets:
      hf_token: ${{ secrets.HF_DOC_BUILD_PUSH }}
--- a/.github/workflows/build_pr_documentation.yaml
+++ b/.github/workflows/build_pr_documentation.yaml
@ -1,15 +0,0 @@
 name: Build PR Documentation
 on: pull_request
 concurrency:
  group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }}
  cancel-in-progress: true
 jobs:
  build:
    uses: huggingface/doc-builder/.github/workflows/build_pr_documentation.yml@main
    with:
      commit_sha: ${{ github.event.pull_request.head.sha }}
      pr_number: ${{ github.event.number }}
      package: kernels
--- a/.github/workflows/lint.yml
+++ b/.github/workflows/lint.yml
@ -8,24 +8,3 @@ jobs:
      - uses: actions/checkout@v4
      - name: Run ruff
        uses: astral-sh/ruff-action@v3
  black:
    name: Run black check
    runs-on: ubuntu-latest
    env:
      UV_PYTHON_PREFERENCE: only-managed
    steps:
      - uses: actions/checkout@v4
      - name: Install uv and set the python version
        uses: astral-sh/setup-uv@v5
        with:
          python-version: 3.12
      - name: Install black
        run: uv pip install black
      - name: Check formatting
        run: |
          uv run black --check src
          uv run black --check tests
--- a/.github/workflows/upload_pr_documentation.yaml
+++ b/.github/workflows/upload_pr_documentation.yaml
@ -1,16 +0,0 @@
 name: Upload PR Documentation
 on:
  workflow_run:
    workflows: ["Build PR Documentation"]
    types:
      - completed
 jobs:
  build:
    uses: huggingface/doc-builder/.github/workflows/upload_pr_documentation.yml@main
    with:
      package_name: kernels
    secrets:
      hf_token: ${{ secrets.HF_DOC_BUILD_PUSH }}
      comment_bot_token: ${{ secrets.COMMENT_BOT_TOKEN }}
--- a/README.md
+++ b/README.md
@ -56,13 +56,10 @@ the Hub.
 ## 📚 Documentation
- [Introduction](docs/source/index.md)
+- [Using layers](docs/layers.md)
- [Installation](docs/source/installation.md)
+- [Locking kernel versions](docs/locking.md)
- [Basic usage](docs/source/basic-usage.md)
+- [Environment variables](docs/env.md)
- [Using layers](docs/source/layers.md)
+- [Using kernels in a Docker container](docs/docker.md)
- [Locking kernel/layer versions](docs/source/locking.md)
+- [Kernel requirements](docs/kernel-requirements.md)
- [Environment variables](docs/source/env.md)
+- [Frequently Asked Questions](docs/faq.md)
 - [Using kernels in a Docker container](docs/source/docker.md)
 - [Kernel requirements](docs/source/kernel-requirements.md)
 - [Frequently Asked Questions](docs/source/faq.md)
 - [Writing kernels](https://github.com/huggingface/kernel-builder/blob/main/docs/writing-kernels.md) using [kernel-builder](https://github.com/huggingface/kernel-builder/)
--- a/docs/docker.md
+++ b/docs/docker.md
@ -0,0 +1,8 @@
 # Using kernels in a Docker container
 build and run the reference [examples/basic.py](examples/basic.py) in a Docker container with the following commands:
 ```bash
 docker build --platform linux/amd64 -t kernels-reference -f docker/Dockerfile.reference .
 docker run --gpus all -it --rm -e HF_TOKEN=$HF_TOKEN kernels-reference
 ```
--- a/docs/source/env.md
+++ b/docs/source/env.md
--- a/docs/source/faq.md
+++ b/docs/source/faq.md
@ -2,9 +2,9 @@
 ## Why is the kernelization step needed?
-In earlier versions of `kernels`, a layer's `forward` method was replaced
+In earlier versions of `kernels`, a layer's `forward` was replaced by
-by `use_kernel_forward_from_hub` and `replace_kernel_forward_from_hub`.
+`use_kernel_forward_from_hub` and `replace_kernel_forward_from_hub`. The
-The new `forward` would dispatch to a kernel based on the device type,
+new `forward` would dispatch to a kernel based on the device type,
 whether a model was training, etc. However, this approach was
 fundamentally incompatible with `torch.compile` since it relied
 on data-dependent branching.
--- a/docs/source/kernel-requirements.md
+++ b/docs/source/kernel-requirements.md
--- a/docs/source/layers.md
+++ b/docs/source/layers.md
@ -84,6 +84,12 @@ model = kernelize(model, mode=Mode.INFERENCE | Mode.TORCH_COMPILE)
 model = kernelize(model, mode=Mode.TRAINING | Mode.TORCH_COMPILE)
 ```
 When the `mode` argument is not specified,
 `Mode.TRAINING | Mode.TORCH_COMPILE` is used as the default. This mode
 aligns most closely with pure PyTorch layers which also support training
 and `torch.compile`. However, to select the most performant kernels, it
 is often good to make the mode specific as possible.
 ### Kernel device
 Kernels can be registered per device type. For instance, separate `cuda` and
@ -101,7 +107,7 @@ model = kernelize(model, device="cuda", mode=Mode.INFERENCE)
 If the `TRAINING` and/or `TORCH_COMPILE` modes are used, but a registered
 kernel does not support backward passes or `torch.compile` respectively,
-`kernelize` will fall back to the original, non-kernelized, layer. You
+`kernenize` will fall back to the original, non-kernelized, layer. You
 can let `kernelize` raise an exception instead by using `use_fallback=False`:
 ```python
@ -129,10 +135,6 @@ kernel_layer_mapping = {
        "cuda": LayerRepository(
            repo_id="kernels-community/activation",
            layer_name="SiluAndMul",
        ),
        "rocm": LayerRepository(
            repo_id="kernels-community/activation",
            layer_name="SiluAndMul",
        )
    }
 }
@ -151,7 +153,7 @@ used with the `use_kernel_mapping` context manager:
 ```python
 with use_kernel_mapping(kernel_layer_mapping):
    # Use the layer for which the mapping is applied.
-    model = kernelize(model, mode=Mode.TRAINING | Mode.TORCH_COMPILE)
+    model = kernelize(model)
 ```
 This ensures that the mapping is not active anymore outside the
@ -259,6 +261,7 @@ Capabilities behave as follows:
  an existing kernel, the new kernel will replace the old kernel.
 - When there are multiple kernels that support a capability, the kernel
  with the smaller capability interval will be used. E.g. given:
  - `KernelA` with `min_capability=80` and `max_capability=89`;
  - `KernelB` with `min_capability=75` and `max_capability=89`;
  - `kernelize` runs on a system with capability 8.6.
@ -267,30 +270,3 @@ Capabilities behave as follows:
  than 75..89. The motivation is that kernels with smaller ranges
  tend to be more optimized for a specific set of GPUs. **This behavior
  might still change in the future.**
 ### Registering kernels for specific ROCm capabilities
 Registering kernels for the ROCm architecture follows the exact same
 pattern as CUDA kernels, using `min_capability` and `max_capability` to restrict
 a kernel to a range of ROCm capabilities.
 ### Loading from a local repository for testing
 The `LocalLayerRepository` class is provided to load a repository from
 a local directory. For example:
 ```python
 with use_kernel_mapping(
    {
        "SiluAndMul": {
            "cuda": LocalLayerRepository(
                repo_path="/home/daniel/kernels/activation",
                package_name="activation",
                layer_name="SiluAndMul",
            )
        }
    },
    inherit_mapping=False,
 ):
    kernelize(linear, mode=Mode.INFERENCE)
 ```
--- a/docs/source/locking.md
+++ b/docs/source/locking.md
@ -1,4 +1,4 @@
-# Locking kernel/layer versions
+# Locking kernel versions
 Projects that use `setuptools` can lock the kernel versions that should be
 used. First specify the accepted versions in `pyproject.toml` and make
@ -26,24 +26,6 @@ activation = get_locked_kernel("kernels-community/activation")
 **Note:** the lock file is included in the package metadata, so it will only be visible
 to `kernels` after doing an (editable or regular) installation of your project.
 ## Locked kernel layers
 Locking is also supported for kernel layers. To use locked layers, register them
 with the `LockedLayerRepository` class:
 ```python
 kernel_layer_mapping = {
    "SiluAndMul": {
        "cuda": LockedLayerRepository(
            repo_id="kernels-community/activation",
            layer_name="SiluAndMul",
        )
    }
 }
 register_kernel_mapping(kernel_layer_mapping)
 ```
 ## Pre-downloading locked kernels
 Locked kernels can be pre-downloaded by running `kernels download .` in your
--- a/docs/source/_toctree.yml
+++ b/docs/source/_toctree.yml
@ -1,28 +0,0 @@
 - sections:
    - local: index
      title: Introduction
    - local: installation
      title: Installation
  title: Getting started
 - sections:
    - local: basic-usage
      title: Basic Usage
    - local: layers
      title: Using Layers
    - local: locking
      title: Locking Kernel Versions
    - local: env
      title: Environment Variables
    - local: faq
      title: FAQ
  title: Usage Guide
 - sections:
    - local: api/kernels
      title: Kernels
    - local: api/layers
      title: Layers
  title: API Reference
 - sections:
    - local: kernel-requirements
      title: Kernel Requirements
  title: Developer Guide
--- a/docs/source/api/kernels.md
+++ b/docs/source/api/kernels.md
@ -1,21 +0,0 @@
 # Kernels API Reference
 ## Main Functions
 ### get_kernel
 [[autodoc]] kernels.get_kernel
 ### has_kernel
 [[autodoc]] kernels.has_kernel
 ## Loading locked kernels
 ### load_kernel
 [[autodoc]] kernels.load_kernel
 ### get_locked_kernel
 [[autodoc]] kernels.get_locked_kernel
--- a/docs/source/api/layers.md
+++ b/docs/source/api/layers.md
@ -1,41 +0,0 @@
 # Layers API Reference
 ## Making layers kernel-aware
 ### use_kernel_forward_from_hub
 [[autodoc]] kernels.use_kernel_forward_from_hub
 ### replace_kernel_forward_from_hub
 [[autodoc]] kernels.replace_kernel_forward_from_hub
 ## Registering kernel mappings
 ### use_kernel_mapping
 [[autodoc]] kernels.use_kernel_mapping
 ### register_kernel_mapping
 [[autodoc]] kernels.register_kernel_mapping
 ## Kernelizing a model
 ### kernelize
 [[autodoc]] kernels.kernelize
 ## Classes
 ### Device
 [[autodoc]] kernels.Device
 ### Mode
 [[autodoc]] kernels.Mode
 ### LayerRepository
 [[autodoc]] kernels.LayerRepository
--- a/docs/source/basic-usage.md
+++ b/docs/source/basic-usage.md
@ -1,34 +0,0 @@
 # Basic Usage
 ## Loading Kernels
 Here is how you would use the [activation](https://huggingface.co/kernels-community/activation) kernels from the Hugging Face Hub:
 ```python
 import torch
 from kernels import get_kernel
 # Download optimized kernels from the Hugging Face hub
 activation = get_kernel("kernels-community/activation")
 # Create a random tensor
 x = torch.randn((10, 10), dtype=torch.float16, device="cuda")
 # Run the kernel
 y = torch.empty_like(x)
 activation.gelu_fast(y, x)
 print(y)
 ```
 ## Checking Kernel Availability
 You can check if a specific kernel is available for your environment:
 ```python
 from kernels import has_kernel
 # Check if kernel is available for current environment
 is_available = has_kernel("kernels-community/activation")
 print(f"Kernel available: {is_available}")
 ```
--- a/docs/source/index.md
+++ b/docs/source/index.md
@ -1,20 +0,0 @@
 # Kernels
 <div align="center">
 <img src="https://github.com/user-attachments/assets/64a652f3-0cd3-4829-b3c1-df13f7933569" width="450" height="450" alt="kernel-builder logo">
 </div>
 The Kernel Hub allows Python libraries and applications to load compute
 kernels directly from the [Hub](https://hf.co/). To support this kind
 of dynamic loading, Hub kernels differ from traditional Python kernel
 packages in that they are made to be:
 - **Portable**: a kernel can be loaded from paths outside `PYTHONPATH`.
 - **Unique**: multiple versions of the same kernel can be loaded in the
  same Python process.
 - **Compatible**: kernels must support all recent versions of Python and
  the different PyTorch build configurations (various CUDA versions
  and C++ ABIs). Furthermore, older C library versions must be supported.
 You can [search for kernels](https://huggingface.co/models?other=kernel) on
 the Hub.
--- a/docs/source/installation.md
+++ b/docs/source/installation.md
@ -1,16 +0,0 @@
 # Installation
 Install the `kernels` package with `pip` (requires `torch>=2.5` and CUDA):
 ```bash
 pip install kernels
 ```
 # Using kernels in a Docker container
 Build and run the reference `examples/basic.py` in a Docker container with the following commands:
 ```bash
 docker build --platform linux/amd64 -t kernels-reference -f docker/Dockerfile.reference .
 docker run --gpus all -it --rm -e HF_TOKEN=$HF_TOKEN kernels-reference
 ```
--- a/flake.lock
+++ b/flake.lock
@ -58,11 +58,11 @@
        "nixpkgs": "nixpkgs"
      },
      "locked": {
-        "lastModified": 1754038838,
+        "lastModified": 1750775451,
-        "narHash": "sha256-oHigCT4z0ayyLyEuxdZooSXRAZP8lfOkZHzY1lx1U50=",
+        "narHash": "sha256-HiGqtwzIgUH7Xkh+wgpvHRZGooqrW0z663E6nauczA4=",
        "owner": "huggingface",
        "repo": "hf-nix",
-        "rev": "336f781fa284e193baa3d4c3ce3f95fb34e9ffad",
+        "rev": "5943c3169e861618a6634bc8dbdb498e413ab9b7",
        "type": "github"
      },
      "original": {
@ -73,17 +73,17 @@
    },
    "nixpkgs": {
      "locked": {
-        "lastModified": 1752785354,
+        "lastModified": 1747820358,
-        "narHash": "sha256-Y33ryUz7MPqKrZwlbQcsYCUz2jAJCacRf8jbs0tYUlA=",
+        "narHash": "sha256-fTqsZsUX6M3yeEvgyQvXcbGmT2CaRVyVwsi8eK29Oj4=",
-        "owner": "nixos",
+        "owner": "danieldk",
        "repo": "nixpkgs",
-        "rev": "d38025438a6ee456758dc03188ca6873a415463b",
+        "rev": "d3c1681180717528068082103bf323147de6ab0b",
        "type": "github"
      },
      "original": {
-        "owner": "nixos",
+        "owner": "danieldk",
        "ref": "cudatoolkit-12.9-kernel-builder",
        "repo": "nixpkgs",
        "rev": "d38025438a6ee456758dc03188ca6873a415463b",
        "type": "github"
      }
    },
--- a/flake.nix
+++ b/flake.nix
@ -26,10 +26,6 @@
        formatter = pkgs.nixfmt-tree;
        devShells = with pkgs; rec {
          default = mkShell {
            nativeBuildInputs = [
              # For hf-doc-builder.
              nodejs
            ];
            buildInputs =
              [
                black
@ -40,7 +36,6 @@
              ++ (with python3.pkgs; [
                docutils
                huggingface-hub
                mktestdocs
                pytest
                pytest-benchmark
                pyyaml
--- a/pyproject.toml
+++ b/pyproject.toml
@ -1,6 +1,6 @@
 [project]
 name = "kernels"
-version = "0.10.1"
+version = "0.8.0"
 description = "Download compute kernels"
 authors = [
  { name = "OlivierDehaene", email = "olivier@huggingface.co" },
@ -24,20 +24,16 @@ build-backend = "setuptools.build_meta"
 [dependency-groups]
 dev = [
-  "mktestdocs>=0.2.5",
+  "mypy >= 1.15.0",
-  "mypy>=1.15.0",
+  "pytest >=8",
  "pytest>=8",
  # Whatever version is compatible with pytest.
  "pytest-benchmark",
-  "torch>=2.5",
+  "torch >=2.5",
  "types-pyyaml"
 ]
 [project.optional-dependencies]
 torch = ["torch"]
 docs = [
  "hf-doc-builder",
 ]
 [project.scripts]
 kernels = "kernels.cli:main"
--- a/pytest.ini
+++ b/pytest.ini
@ -1,5 +1,4 @@
 [pytest]
 markers =
    cuda_only: marks tests that should only hosts with CUDA GPUs
    rocm_only: marks tests that should only run on hosts with ROCm GPUs
    darwin_only: marks tests that should only run on macOS
    linux_only: marks tests that should only run on Linux
--- a/src/kernels/init.py
+++ b/src/kernels/init.py
@ -1,13 +1,7 @@
 import importlib.metadata
 __version__ = importlib.metadata.version("kernels")
 from kernels.layer import (
    CUDAProperties,
    Device,
    LayerRepository,
    LocalLayerRepository,
    LockedLayerRepository,
    Mode,
    kernelize,
    register_kernel_mapping,
@ -25,12 +19,9 @@ from kernels.utils import (
 )
 __all__ = [
    "__version__",
    "CUDAProperties",
    "Device",
    "LayerRepository",
    "LocalLayerRepository",
    "LockedLayerRepository",
    "Mode",
    "get_kernel",
    "get_local_kernel",
--- a/src/kernels/_versions.py
+++ b/src/kernels/_versions.py
@ -1,52 +0,0 @@
 from typing import Dict, Optional
 from huggingface_hub import HfApi
 from huggingface_hub.hf_api import GitRefInfo
 from packaging.specifiers import SpecifierSet
 from packaging.version import InvalidVersion, Version
 def _get_available_versions(repo_id: str) -> Dict[Version, GitRefInfo]:
    """Get kernel versions that are available in the repository."""
    versions = {}
    for tag in HfApi().list_repo_refs(repo_id).tags:
        if not tag.name.startswith("v"):
            continue
        try:
            versions[Version(tag.name[1:])] = tag
        except InvalidVersion:
            continue
    return versions
 def resolve_version_spec_as_ref(repo_id: str, version_spec: str) -> GitRefInfo:
    """
    Get the locks for a kernel with the given version spec.
    The version specifier can be any valid Python version specifier:
    https://packaging.python.org/en/latest/specifications/version-specifiers/#version-specifiers
    """
    versions = _get_available_versions(repo_id)
    requirement = SpecifierSet(version_spec)
    accepted_versions = sorted(requirement.filter(versions.keys()))
    if len(accepted_versions) == 0:
        raise ValueError(
            f"No version of `{repo_id}` satisfies requirement: {version_spec}"
        )
    return versions[accepted_versions[-1]]
 def select_revision_or_version(
    repo_id: str, revision: Optional[str], version: Optional[str]
 ) -> str:
    if revision is not None and version is not None:
        raise ValueError("Either a revision or a version must be specified, not both.")
    elif revision is None and version is None:
        revision = "main"
    elif version is not None:
        revision = resolve_version_spec_as_ref(repo_id, version).target_commit
    assert revision is not None
    return revision
--- a/src/kernels/layer.py
+++ b/src/kernels/layer.py
@ -1,6 +1,5 @@
 from __future__ import annotations
 import functools
 import inspect
 import logging
 import os
@ -9,34 +8,27 @@ import warnings
 from abc import ABC, abstractmethod
 from contextvars import ContextVar
 from copy import deepcopy
-from dataclasses import dataclass
+from dataclasses import dataclass, field
 from enum import Flag, auto
 from functools import lru_cache
-from pathlib import Path
+from types import MethodType
 from types import MethodType, ModuleType
 from typing import (
    TYPE_CHECKING,
    Dict,
    Optional,
    Protocol,
    Tuple,
    Type,
    Union,
 )
 from ._interval_tree import IntervalTree
-from ._versions import select_revision_or_version
+from .utils import get_kernel
 from .utils import (
    _get_caller_locked_kernel,
    _get_locked_kernel,
    get_kernel,
    get_local_kernel,
 )
 if TYPE_CHECKING:
    import torch
    from torch import nn
 _DISABLE_KERNEL_MAPPING: bool = bool(int(os.environ.get("DISABLE_KERNEL_MAPPING", "0")))
@ -44,19 +36,17 @@ class Mode(Flag):
    """
    Kernelize mode
-    The `Mode` flag is used by [`kernelize`] to select kernels for the given mode. Mappings can be registered for
+    The `Mode` flag is used by `kernelize` to select kernels for the given
-    specific modes.
+    mode. Mappings can be registered for specific modes.
-    Attributes:
+    * `INFERENCE`: The kernel is used for inference.
-        INFERENCE: The kernel is used for inference.
+    * `TRAINING`: The kernel is used for training.
-        TRAINING: The kernel is used for training.
+    * `TORCH_COMPILE`: The kernel is used with `torch.compile`.
-        TORCH_COMPILE: The kernel is used with `torch.compile`.
+    * `FALLBACK`: In a kernel mapping, this kernel is used when no other mode
-        FALLBACK: In a kernel mapping, this kernel is used when no other mode matches.
+       matches.
    Note:
        Different modes can be combined. For instance, `INFERENCE | TORCH_COMPILE` should be used for layers that
        are used for inference *with* `torch.compile`.
    Different modes can be combined. For instance, `INFERENCE | TORCH_COMPILE`
    should be used for layers that are used for inference *with* `torch.compile`.
    """
    _NONE = 0
@ -79,36 +69,6 @@ class Mode(Flag):
@dataclass(frozen=True)
 class Device:
    """
    Represents a compute device with optional properties.
    This class encapsulates device information including device type and optional device-specific properties
    like CUDA capabilities.
    Args:
        type (`str`):
            The device type (e.g., "cuda", "mps", "rocm").
        properties ([`CUDAProperties`], *optional*):
            Device-specific properties. Currently only [`CUDAProperties`] is supported for CUDA devices.
    Example:
        ```python
        from kernels import Device, CUDAProperties
        # Basic CUDA device
        cuda_device = Device(type="cuda")
        # CUDA device with specific capability requirements
        cuda_device_with_props = Device(
            type="cuda",
            properties=CUDAProperties(min_capability=75, max_capability=90)
        )
        # MPS device for Apple Silicon
        mps_device = Device(type="mps")
        ```
    """
    type: str
    properties: Optional[CUDAProperties] = None
@ -121,8 +81,6 @@ class Device:
        """Create an appropriate repository set for this device type."""
        if self.type == "cuda":
            return _CUDARepos()
        elif self.type == "rocm":
            return _ROCMRepos()
        elif self.type == "mps":
            return _MPSRepos()
        else:
@ -139,34 +97,6 @@ class Device:
@dataclass(frozen=True)
 class CUDAProperties:
    """
    CUDA-specific device properties for capability-based kernel selection.
    This class defines CUDA compute capability constraints for kernel selection, allowing kernels to specify
    minimum and maximum CUDA compute capabilities they support.
    Args:
        min_capability (`int`):
            Minimum CUDA compute capability required (e.g., 75 for compute capability 7.5).
        max_capability (`int`):
            Maximum CUDA compute capability supported (e.g., 90 for compute capability 9.0).
    Example:
        ```python
        from kernels import CUDAProperties, Device
        # Define CUDA properties for modern GPUs (compute capability 7.5 to 9.0)
        cuda_props = CUDAProperties(min_capability=75, max_capability=90)
        # Create a device with these properties
        device = Device(type="cuda", properties=cuda_props)
        ```
    Note:
        CUDA compute capabilities are represented as integers where the major and minor versions are concatenated.
        For example, compute capability 7.5 is represented as 75, and 8.6 is represented as 86.
    """
    min_capability: int
    max_capability: int
@ -182,250 +112,33 @@ class CUDAProperties:
        return hash((self.min_capability, self.max_capability))
-@dataclass(frozen=True)
+@dataclass
 class ROCMProperties:
    """
    ROCM-specific device properties for capability-based kernel selection.
    This class defines ROCM compute capability constraints for kernel selection, allowing kernels to specify
    minimum and maximum ROCM compute capabilities they support.
    Args:
        min_capability (`int`):
            Minimum ROCM compute capability required (e.g., 75 for compute capability 7.5).
        max_capability (`int`):
            Maximum ROCM compute capability supported (e.g., 90 for compute capability 9.0).
    Example:
        ```python
        from kernels import ROCMProperties, Device
        # Define ROCM properties for modern GPUs (compute capability 7.5 to 9.0)
        rocm_props = ROCMProperties(min_capability=75, max_capability=90)
        # Create a device with these properties
        device = Device(type="rocm", properties=rocm_props)
        ```
    Note:
        ROCM compute capabilities are represented as integers where the major and minor versions are concatenated.
        For example, compute capability 7.5 is represented as 75, and 8.6 is represented as 86.
    """
    min_capability: int
    max_capability: int
    def __eq__(self, other):
        if not isinstance(other, ROCMProperties):
            return NotImplemented
        return (
            self.min_capability == other.min_capability
            and self.max_capability == other.max_capability
        )
    def __hash__(self):
        return hash((self.min_capability, self.max_capability))
 class LayerRepositoryProtocol(Protocol):
    @property
    def layer_name(self) -> str: ...
    def load(self) -> ModuleType: ...
 class LayerRepository:
    """
-    Repository and name of a layer for kernel mapping.
+    Repository and name of a layer.
    Args:
        repo_id (`str`):
            The Hub repository containing the layer.
        layer_name (`str`):
            The name of the layer within the kernel repository.
        revision (`str`, *optional*, defaults to `"main"`):
            The specific revision (branch, tag, or commit) to download. Cannot be used together with `version`.
        version (`str`, *optional*):
            The kernel version to download. This can be a Python version specifier, such as `">=1.0.0,<2.0.0"`.
            Cannot be used together with `revision`.
    Example:
        ```python
        from kernels import LayerRepository
        # Reference a specific layer by revision
        layer_repo = LayerRepository(
            repo_id="kernels-community/activation",
            layer_name="SiluAndMul",
        )
        # Reference a layer by version constraint
        layer_repo_versioned = LayerRepository(
            repo_id="kernels-community/activation",
            layer_name="SiluAndMul",
            version=">=0.0.3,<0.1"
        )
        ```
    """
-    def __init__(
+    layer_name: str = field(
-        self,
+        metadata={"help": "The name of the layer in the kernel repository."}
-        repo_id: str,
+    )
-        *,
+    repo_id: str = field(metadata={"help": "The kernel hub repository with the layer."})
-        layer_name: str,
+    revision: str = field(
-        revision: Optional[str] = None,
+        default="main", metadata={"help": "The revision of the layer."}
-        version: Optional[str] = None,
+    )
    ):
        if revision is not None and version is not None:
            raise ValueError(
                "Either a revision or a version must be specified, not both."
            )
        self._repo_id = repo_id
        self.layer_name = layer_name
        # We are going to resolve these lazily, since we do not want
        # to do a network request for every registered LayerRepository.
        self._revision = revision
        self._version = version
    @functools.lru_cache()
    def _resolve_revision(self) -> str:
        return select_revision_or_version(
            repo_id=self._repo_id, revision=self._revision, version=self._version
        )
    def load(self) -> ModuleType:
        return get_kernel(self._repo_id, revision=self._resolve_revision())
    def __eq__(self, other):
        return (
            isinstance(other, LayerRepository)
            and self.layer_name == other.layer_name
-            and self._repo_id == other._repo_id
+            and self.repo_id == other.repo_id
-            and self._revision == other._revision
+            and self.revision == other.revision
            and self._version == other._version
        )
    def __hash__(self):
-        return hash((self.layer_name, self._repo_id, self._revision, self._version))
+        return hash((self.layer_name, self.repo_id, self.revision))
    def __str__(self) -> str:
        return f"`{self._repo_id}` (revision: {self._resolve_revision()}) for layer `{self.layer_name}`"
-class LocalLayerRepository:
+_CACHED_LAYER: Dict[LayerRepository, Type["nn.Module"]] = {}
    """
    Repository from a local directory for kernel mapping.
    Args:
        repo_path (`Path`):
            The local repository containing the layer.
        package_name (`str`):
            Package name of the kernel.
        layer_name (`str`):
            The name of the layer within the kernel repository.
    Example:
        ```python
        from pathlib import Path
        from kernels import LocalLayerRepository
        # Reference a specific layer by revision
        layer_repo = LocalLayerRepository(
            repo_path=Path("/home/daniel/kernels/activation"),
            package_name="activation",
            layer_name="SiluAndMul",
        )
        ```
    """
    def __init__(
        self,
        repo_path: Path,
        *,
        package_name: str,
        layer_name: str,
    ):
        self._repo_path = repo_path
        self._package_name = package_name
        self.layer_name = layer_name
    def load(self) -> ModuleType:
        return get_local_kernel(self._repo_path, self._package_name)
    def __eq__(self, other):
        return (
            isinstance(other, LocalLayerRepository)
            and self.layer_name == other.layer_name
            and self._repo_path == other._repo_path
            and self._package_name == other._package_name
        )
    def __hash__(self):
        return hash((self.layer_name, self._repo_path, self._package_name))
    def __str__(self) -> str:
        return f"`{self._repo_path}` (package: {self._package_name}) for layer `{self.layer_name}`"
 class LockedLayerRepository:
    """
    Repository and name of a layer.
    In contrast to `LayerRepository`, this class uses repositories that
    are locked inside a project.
    """
    def __init__(
        self,
        repo_id: str,
        *,
        lockfile: Optional[Path] = None,
        layer_name: str,
    ):
        """
        Construct a layer repository.
        Args:
            repo_id (`str`): The Hub repository containing the layer.
        """
        self._repo_id = repo_id
        self._lockfile = lockfile
        self.layer_name = layer_name
    @functools.lru_cache()
    def _resolve_revision(self) -> str:
        if self._lockfile is None:
            locked_sha = _get_caller_locked_kernel(self._repo_id)
        else:
            with open(self._lockfile, "r") as f:
                locked_sha = _get_locked_kernel(self._repo_id, f.read())
        if locked_sha is None:
            raise ValueError(f"Kernel `{self._repo_id}` is not locked")
        return locked_sha
    def load(self) -> ModuleType:
        return get_kernel(repo_id=self._repo_id, revision=self._resolve_revision())
    def __eq__(self, other):
        return (
            isinstance(other, LockedLayerRepository)
            and self.layer_name == other.layer_name
            and self._repo_id == other._repo_id
        )
    def __hash__(self):
        return hash((self.layer_name, self._repo_id))
    def __str__(self) -> str:
        return f"`{self._repo_id}` (revision: {self._resolve_revision()}) for layer `{self.layer_name}`"
 _CACHED_LAYER: Dict[LayerRepositoryProtocol, Type["nn.Module"]] = {}
 class _DeviceRepos(ABC):
@ -437,10 +150,10 @@ class _DeviceRepos(ABC):
    @abstractmethod
    def repos(
        self,
-    ) -> Optional[Dict[Mode, LayerRepositoryProtocol]]: ...
+    ) -> Optional[Dict[Mode, LayerRepository]]: ...
    @abstractmethod
-    def insert(self, device: Device, repos: Dict[Mode, LayerRepositoryProtocol]):
+    def insert(self, device: Device, repos: Dict[Mode, LayerRepository]):
        """
        Insert a repository for a specific device and mode.
        """
@ -448,7 +161,7 @@ class _DeviceRepos(ABC):
 class _MPSRepos(_DeviceRepos):
-    _repos: Dict[Mode, LayerRepositoryProtocol]
+    _repos: Dict[Mode, LayerRepository]
    def __init__(self):
        super().__init__()
@ -457,10 +170,10 @@ class _MPSRepos(_DeviceRepos):
    @property
    def repos(
        self,
-    ) -> Optional[Dict[Mode, LayerRepositoryProtocol]]:
+    ) -> Optional[Dict[Mode, LayerRepository]]:
        return self._repos
-    def insert(self, device: Device, repos: Dict[Mode, LayerRepositoryProtocol]):
+    def insert(self, device: Device, repos: Dict[Mode, LayerRepository]):
        if device.type != "mps":
            raise ValueError(f"Device type must be 'mps', got {device.type}")
@ -468,7 +181,7 @@ class _MPSRepos(_DeviceRepos):
 class _CUDARepos(_DeviceRepos):
-    _repos: IntervalTree[Dict[Mode, LayerRepositoryProtocol]]
+    _repos: IntervalTree[Dict[Mode, LayerRepository]]
    def __init__(self):
        super().__init__()
@ -477,11 +190,11 @@ class _CUDARepos(_DeviceRepos):
    @property
    def repos(
        self,
-    ) -> Optional[Dict[Mode, LayerRepositoryProtocol]]:
+    ) -> Optional[Dict[Mode, LayerRepository]]:
        capability = _find_capability()
        return self.repos_by_capability.find_smallest_interval(capability)
-    def insert(self, device: Device, repos: Dict[Mode, LayerRepositoryProtocol]):
+    def insert(self, device: Device, repos: Dict[Mode, LayerRepository]):
        assert device.properties is None or isinstance(
            device.properties, CUDAProperties
        )
@ -498,46 +211,6 @@ class _CUDARepos(_DeviceRepos):
        self.repos_by_capability.insert(min_capability, max_capability, repos)
 class _ROCMRepos(_DeviceRepos):
    _repos: IntervalTree[Dict[Mode, LayerRepositoryProtocol]]
    def __init__(self):
        super().__init__()
        self.repos_by_capability = IntervalTree()
    @property
    def repos(
        self,
    ) -> Optional[Dict[Mode, LayerRepositoryProtocol]]:
        capability = _find_capability()
        return self.repos_by_capability.find_smallest_interval(capability)
    def insert(self, device: Device, repos: Dict[Mode, LayerRepositoryProtocol]):
        assert device.properties is None or isinstance(
            device.properties, ROCMProperties
        )
        min_capability = (
            0 if device.properties is None else device.properties.min_capability
        )
        max_capability = (
            sys.maxsize
            if device.properties is None
            else device.properties.max_capability
        )
        self.repos_by_capability.insert(min_capability, max_capability, repos)
 def _validate_device_type(device_type: str) -> None:
    """Validate that the device type is supported."""
    supported_devices = {"cuda", "rocm", "mps"}
    if device_type not in supported_devices:
        raise ValueError(
            f"Unsupported device type '{device_type}'. Supported device types are: {', '.join(sorted(supported_devices))}"
        )
 _KERNEL_MAPPING: ContextVar[Dict[str, Dict[str, _DeviceRepos]]] = ContextVar(
    "_KERNEL_MAPPING", default={}
 )
@ -546,65 +219,17 @@ _KERNEL_MAPPING: ContextVar[Dict[str, Dict[str, _DeviceRepos]]] = ContextVar(
 def use_kernel_mapping(
    mapping: Dict[
        str,
-        Dict[
+        Dict[Union[Device, str], Union[LayerRepository, Dict[Mode, LayerRepository]]],
            Union[Device, str],
            Union[LayerRepositoryProtocol, Dict[Mode, LayerRepositoryProtocol]],
        ],
    ],
    *,
    inherit_mapping: bool = True,
 ):
    """
-    Context manager that sets a kernel mapping for the duration of the context.
+    Context manager that sets a mapping for a duration of the context.
-    This function allows temporary kernel mappings to be applied within a specific context, enabling different
+    When `inherit_mapping` is set to `True` the current mapping will be
-    kernel configurations for different parts of your code.
+    extended by `mapping` inside the context. If it is `False`, only
-
+    `mapping` is used inside the context.
    Args:
        mapping (`Dict[str, Dict[Union[Device, str], Union[LayerRepositoryProtocol, Dict[Mode, LayerRepositoryProtocol]]]]`):
            The kernel mapping to apply. Maps layer names to device-specific kernel configurations.
        inherit_mapping (`bool`, *optional*, defaults to `True`):
            When `True`, the current mapping will be extended by `mapping` inside the context. When `False`,
            only `mapping` is used inside the context.
    Returns:
        Context manager that handles the temporary kernel mapping.
    Example:
        ```python
        import torch
        import torch.nn as nn
        from torch.nn import functional as F
        from kernels import use_kernel_forward_from_hub
        from kernels import use_kernel_mapping, LayerRepository, Device
        from kernels import Mode, kernelize
        # Define a mapping
        mapping = {
            "SiluAndMul": {
                "cuda": LayerRepository(
                    repo_id="kernels-community/activation",
                    layer_name="SiluAndMul",
                )
            }
        }
        @use_kernel_forward_from_hub("SiluAndMul")
        class SiluAndMul(nn.Module):
            def forward(self, x: torch.Tensor) -> torch.Tensor:
                d = x.shape[-1] // 2
                return F.silu(x[..., :d]) * x[..., d:]
        model = SiluAndMul()
        # Use the mapping for the duration of the context.
        with use_kernel_mapping(mapping):
            # kernelize uses the temporary mapping
            model = kernelize(model, mode=Mode.TRAINING | Mode.TORCH_COMPILE, device="cuda")
        # Outside the context, original mappings are restored
        ```
    """
    class ContextManager:
@ -625,64 +250,31 @@ def use_kernel_mapping(
 def register_kernel_mapping(
    mapping: Dict[
        str,
-        Dict[
+        Dict[Union[Device, str], Union[LayerRepository, Dict[Mode, LayerRepository]]],
            Union[Device, str],
            Union[LayerRepositoryProtocol, Dict[Mode, LayerRepositoryProtocol]],
        ],
    ],
    inherit_mapping: bool = True,
 ):
    """
-    Register a global mapping between layer names and their corresponding kernel implementations.
+    Allows one to register a mapping between a layer name and the corresponding
    kernel(s) to use, depending on the device. This should be used in conjunction
    with `kernelize`.
-    This function allows you to register a mapping between a layer name and the corresponding kernel(s) to use,
+    Example usage:
    depending on the device and mode. This should be used in conjunction with [`kernelize`].
-    Args:
+    ```python
-        mapping (`Dict[str, Dict[Union[Device, str], Union[LayerRepositoryProtocol, Dict[Mode, LayerRepositoryProtocol]]]]`):
+    from kernels import LayerRepository, register_kernel_mapping
            The kernel mapping to register globally. Maps layer names to device-specific kernels.
            The mapping can specify different kernels for different modes (training, inference, etc.).
        inherit_mapping (`bool`, *optional*, defaults to `True`):
            When `True`, the current mapping will be extended by `mapping`. When `False`, the existing mappings
            are erased before adding `mapping`.
-    Example:
+    kernel_layer_mapping = {
-        ```python
+      "LlamaRMSNorm": {
-        from kernels import LayerRepository, register_kernel_mapping, Mode
+          "cuda": LayerRepository(
-
+              repo_id="kernels-community/activation",
-        # Simple mapping for a single kernel per device
+              layer_name="RmsNorm",
-        kernel_layer_mapping = {
+              revision="layers",
-            "LlamaRMSNorm": {
+          ),
-                "cuda": LayerRepository(
+      },
-                    repo_id="kernels-community/activation",
+    }
-                    layer_name="RmsNorm",
+    register_kernel_mapping(kernel_layer_mapping)
-                    revision="layers",
+    ```
                ),
            },
        }
        register_kernel_mapping(kernel_layer_mapping)
        # Advanced mapping with mode-specific kernels
        advanced_mapping = {
            "MultiHeadAttention": {
                "cuda": {
                    Mode.TRAINING: LayerRepository(
                        repo_id="username/training-kernels",
                        layer_name="TrainingAttention"
                    ),
                    Mode.INFERENCE: LayerRepository(
                        repo_id="username/inference-kernels",
                        layer_name="FastAttention"
                    ),
                }
            }
        }
        register_kernel_mapping(advanced_mapping)
        ```
    """
    if not inherit_mapping:
        _KERNEL_MAPPING.set({})
    # Merge with existing mappings.
    for new_kernel, new_device_repos in mapping.items():
        device_repo = _KERNEL_MAPPING.get().setdefault(new_kernel, {})
@ -691,10 +283,10 @@ def register_kernel_mapping(
                Device(type=new_device) if isinstance(new_device, str) else new_device
            )
-            if isinstance(new_repo, dict):
+            if isinstance(new_repo, LayerRepository):
                kernel_options = new_repo
            else:
                kernel_options = {Mode.FALLBACK: new_repo}
            else:
                kernel_options = new_repo
            feature_repos = device_repo.setdefault(device.type, device.create_repo())
            feature_repos.insert(device, kernel_options)
@ -705,20 +297,15 @@ def replace_kernel_forward_from_hub(
    layer_name: str,
 ):
    """
-    Function that prepares a layer class to use kernels from the Hugging Face Hub.
+    Decorator that prepares a layer class to use a kernel from the Hugging Face Hub.
-    It is recommended to use [`use_kernel_forward_from_hub`] decorator instead.
+    This decorator stores the layer name and original forward method, which will be used
-    This function should only be used as a last resort to extend third-party layers,
+    by the kernelize function to replace the forward implementation with the appropriate
-    it is inherently fragile since the member variables and `forward` signature
+    kernel from the hub.
    of usch a layer can change.
-    Example:
+    Args:
-        ```python
+        cls: The layer class to decorate
-        from kernels import replace_kernel_forward_from_hub
+        layer_name: The name of the layer to use for kernel lookup
        import torch.nn as nn
        replace_kernel_forward_from_hub(nn.LayerNorm, "LayerNorm")
        ```
    """
    cls.kernel_layer_name = layer_name
@ -751,10 +338,10 @@ _MODE_FALLBACK_PRIORITY = {
 def _select_repository(
-    repositories: Dict[Mode, LayerRepositoryProtocol],
+    repositories: Dict[Mode, LayerRepository],
    *,
    mode: Mode,
-) -> Optional[Tuple[LayerRepositoryProtocol, Mode]]:
+) -> Optional[Tuple[LayerRepository, Mode]]:
    # Get the fallback priority list for the requested mode
    if mode not in _MODE_FALLBACK_PRIORITY:
        raise ValueError(f"Unsupported mode: {mode}")
@ -772,66 +359,30 @@ def _select_repository(
 def kernelize(
    model: "nn.Module",
    *,
-    mode: Mode,
+    mode: Mode = Mode.TRAINING | Mode.TORCH_COMPILE,
    device: Optional[Union[str, "torch.device"]] = None,
    use_fallback: bool = True,
 ):
    """
-    Replace layer forward methods with optimized kernel implementations.
+    Iterate over all modules in the model and replace the `forward` method of
-
+    extensible layers for which kernels are registered using `register_kernel_mapping`
-    This function iterates over all modules in the model and replaces the `forward` method of extensible layers
+    or `use_kernel_mapping`.
    for which kernels are registered using [`register_kernel_mapping`] or [`use_kernel_mapping`].
    Args:
-        model (`nn.Module`):
+        model: The PyTorch model to kernelize
-            The PyTorch model to kernelize.
+        mode: the mode that the kernel is going to be used in (e.g.
-        mode ([`Mode`]): The mode that the kernel is going to be used in. For example,
+            `Mode.TRAINING | Mode.TORCH_COMPILE` kernelizes the model for training
-            `Mode.TRAINING | Mode.TORCH_COMPILE` kernelizes the model for training with
+            and `torch.compile`).
-            `torch.compile`.
+        device: The device type to load kernels for. The device type will be inferred
-        device (`Union[str, torch.device]`, *optional*):
+            from the parameters of the model when not provided.
-            The device type to load kernels for. Supported device types are: "cuda", "mps", "rocm".
+        use_fallback: Whether to use the original forward method of modules when no
-            The device type will be inferred from the model parameters when not provided.
+            compatible kernel could be found. If set to `False`, an exception will
-        use_fallback (`bool`, *optional*, defaults to `True`):
+            be raised in such cases.
            Whether to use the original forward method of modules when no compatible kernel could be found.
            If set to `False`, an exception will be raised in such cases.
    Returns:
-        `nn.Module`: The kernelized model with optimized kernel implementations.
+        The kernelized model
    Example:
        ```python
        import torch
        import torch.nn as nn
        from kernels import kernelize, Mode, register_kernel_mapping, LayerRepository
        from kernels import use_kernel_forward_from_hub
        @use_kernel_forward_from_hub("SiluAndMul")
        class SiluAndMul(nn.Module):
            def forward(self, x: torch.Tensor) -> torch.Tensor:
                d = x.shape[-1] // 2
                return F.silu(x[..., :d]) * x[..., d:]
        mapping = {
            "LayerNorm": {
                "cuda": LayerRepository(
                    repo_id="kernels-community/activation",
                    layer_name="SiluAndMul",
                )
            }
        }
        register_kernel_mapping(mapping)
        # Create and kernelize a model
        model = nn.Sequential(
            nn.Linear(1024, 2048, device="cuda"),
            SiluAndMul(),
        )
        # Kernelize for inference
        kernelized_model = kernelize(model, mode=Mode.TRAINING | Mode.TORCH_COMPILE)
        ```
    """
    import torch
    if mode == Mode.FALLBACK:
        raise ValueError("Mode.FALLBACK can only be used to register kernel mappings.")
@ -845,8 +396,7 @@ def kernelize(
    if device is None:
        device_type = _find_device(model)
    elif isinstance(device, str):
-        _validate_device_type(device)
+        device_type = Device(type=torch.device(device).type)
        device_type = Device(type=device)
    else:
        device_type = Device(device.type)
@ -912,7 +462,9 @@ def kernelize(
        repo, repo_mode = repo_with_mode
-        logging.info(f"Using layer `{repo.layer_name}` from repo {repo}")
+        logging.info(
            f"Using layer `{repo.layer_name}` from repo `{repo.repo_id}` (revision: {repo.revision}) for layer `{layer_name}`"
        )
        logging.debug(f"kernelize mode: {mode}, repo mode: {repo_mode}")
        layer = _get_layer_memoize(repo, module_class)
@ -937,41 +489,7 @@ def kernelize(
 def use_kernel_forward_from_hub(layer_name: str):
    """
-    Decorator factory that makes a layer extensible using the specified layer name.
+    Make a layer extensible using the name `layer_name`.
    This is a decorator factory that returns a decorator which prepares a layer class to use kernels from the
    Hugging Face Hub.
    Args:
        layer_name (`str`):
            The name of the layer to use for kernel lookup in registered mappings.
    Returns:
        `Callable`: A decorator function that can be applied to layer classes.
    Example:
        ```python
        import torch
        import torch.nn as nn
        from kernels import use_kernel_forward_from_hub
        from kernels import Mode, kernelize
        @use_kernel_forward_from_hub("MyCustomLayer")
        class MyCustomLayer(nn.Module):
            def __init__(self, hidden_size):
                super().__init__()
                self.hidden_size = hidden_size
            def forward(self, x: torch.Tensor):
                # original implementation
                return x
        model = MyCustomLayer(768)
        # The layer can now be kernelized:
        # model = kernelize(model, mode=Mode.TRAINING | Mode.TORCH_COMPILE, device="cuda")
        ```
    """
    def decorator(cls):
@ -981,17 +499,21 @@ def use_kernel_forward_from_hub(layer_name: str):
    return decorator
-def _get_kernel_layer(repo: LayerRepositoryProtocol) -> Type["nn.Module"]:
+def _get_kernel_layer(
    *, repo_id: str, layer_name: str, revision: str
 ) -> Type["nn.Module"]:
    """Get a layer from a kernel."""
-    kernel = repo.load()
+    kernel = get_kernel(repo_id, revision=revision)
    if getattr(kernel, "layers", None) is None:
-        raise ValueError(f"Kernel repo {repo} does not define any layers.")
+        raise ValueError(
            f"Kernel `{repo_id}` at revision `{revision}` does not define any layers."
        )
-    layer = getattr(kernel.layers, repo.layer_name, None)
+    layer = getattr(kernel.layers, layer_name, None)
    if layer is None:
-        raise ValueError(f"Layer `{repo.layer_name}` not found in kernel repo {repo}.")
+        raise ValueError(f"Layer `{layer_name}` not found in kernel `{repo_id}`.")
    return layer
@ -1035,18 +557,6 @@ def _validate_layer(*, check_cls, cls):
            )
 def _is_cuda_platform():
    import torch
    return torch.version.cuda is not None
 def _is_rocm_platform():
    import torch
    return torch.version.hip is not None
 def _find_device(model: "nn.Module") -> Device:
    try:
        param = next(model.parameters())
@ -1055,15 +565,7 @@ def _find_device(model: "nn.Module") -> Device:
            "Cannot determine model device, provide as `device` argument to `kernelize`."
        )
-    dev_type = param.device.type
+    return Device(type=param.device.type)
    if dev_type == "cuda":
        # Refine based on actual platform
        if _is_rocm_platform():
            return Device(type="rocm")
        elif _is_cuda_platform():
            return Device(type="cuda")
    return Device(type=dev_type)
@lru_cache
@ -1088,19 +590,13 @@ def _conditionally_replace_forward(
    # layers registered with the FALLBACK mode never get rejected by
    # _validate_layer_has_mode. For such layers, we want to fall back in
    # case the layer does not support the given mode.
-    needs_fallback_for_compile = Mode.TORCH_COMPILE in mode and not getattr(
+    needs_fallback = Mode.TORCH_COMPILE in mode and not getattr(
        layer, "can_torch_compile", False
    )
-    needs_fallback_for_backward = Mode.TRAINING in mode and not getattr(
+    needs_fallback |= Mode.TRAINING in mode and not getattr(layer, "has_backward", True)
        layer, "has_backward", True
    )
-    if needs_fallback_for_compile or needs_fallback_for_backward:
+    if needs_fallback:
        if use_fallback:
            if needs_fallback_for_compile:
                logging.info("Layer does not support torch.compile, using fallback")
            if needs_fallback_for_backward:
                logging.info("Layer does not support backward, using fallback")
            _replace_forward(module, module_class)
        else:
            raise ValueError(f"Available kernel does not support mode: {mode}")
@ -1116,7 +612,7 @@ def _validate_layer_has_mode(
    *,
    layer_name: str,
    module: Type["nn.Module"],
-    repo: LayerRepositoryProtocol,
+    repo: LayerRepository,
    repo_mode: Mode,
 ):
    """
@ -1125,7 +621,7 @@ def _validate_layer_has_mode(
    if Mode.TRAINING in repo_mode and not getattr(module, "has_backward", True):
        raise ValueError(
-            f"Layer `{repo.layer_name}` from repo {repo} does not support backward.\n"
+            f"Layer `{repo.layer_name}` ({repo.repo_id}, revision: {repo.revision}) does not support backward.\n"
            f"Was registered for `{layer_name}` with mode `{repo_mode}`"
        )
@ -1133,7 +629,7 @@ def _validate_layer_has_mode(
        module, "can_torch_compile", False
    ):
        raise ValueError(
-            f"Layer `{repo.layer_name}` from repo {repo} does not support torch.compile.\n"
+            f"Layer `{repo.layer_name}` ({repo.repo_id}, revision: {repo.revision}) does not support torch.compile.\n"
            f"Was registered for `{layer_name}` with mode `{repo_mode}`"
        )
@ -1141,13 +637,17 @@ def _validate_layer_has_mode(
 def _get_layer_memoize(
-    repo: LayerRepositoryProtocol, module_class: Type["nn.Module"]
+    repo: LayerRepository, module_class: Type["nn.Module"]
 ) -> Type["nn.Module"]:
    layer = _CACHED_LAYER.get(repo, None)
    if layer is not None:
        return layer
-    layer = _get_kernel_layer(repo)
+    layer = _get_kernel_layer(
        repo_id=repo.repo_id,
        layer_name=repo.layer_name,
        revision=repo.revision,
    )
    _validate_layer(check_cls=module_class, cls=layer)
    _CACHED_LAYER[repo] = layer
--- a/src/kernels/lockfile.py
+++ b/src/kernels/lockfile.py
@ -4,8 +4,10 @@ from pathlib import Path
 from typing import Dict, List, Tuple
 from huggingface_hub import HfApi
 from huggingface_hub.hf_api import GitRefInfo
 from packaging.specifiers import SpecifierSet
 from packaging.version import InvalidVersion, Version
 from kernels._versions import resolve_version_spec_as_ref
 from kernels.compat import tomllib
@ -29,6 +31,20 @@ class KernelLock:
        return cls(repo_id=o["repo_id"], sha=o["sha"], variants=variants)
 def _get_available_versions(repo_id: str) -> Dict[Version, GitRefInfo]:
    """Get kernel versions that are available in the repository."""
    versions = {}
    for tag in HfApi().list_repo_refs(repo_id).tags:
        if not tag.name.startswith("v"):
            continue
        try:
            versions[Version(tag.name[1:])] = tag
        except InvalidVersion:
            continue
    return versions
 def get_kernel_locks(repo_id: str, version_spec: str) -> KernelLock:
    """
    Get the locks for a kernel with the given version spec.
@ -36,7 +52,16 @@ def get_kernel_locks(repo_id: str, version_spec: str) -> KernelLock:
    The version specifier can be any valid Python version specifier:
    https://packaging.python.org/en/latest/specifications/version-specifiers/#version-specifiers
    """
-    tag_for_newest = resolve_version_spec_as_ref(repo_id, version_spec)
+    versions = _get_available_versions(repo_id)
    requirement = SpecifierSet(version_spec)
    accepted_versions = sorted(requirement.filter(versions.keys()))
    if len(accepted_versions) == 0:
        raise ValueError(
            f"No version of `{repo_id}` satisfies requirement: {version_spec}"
        )
    tag_for_newest = versions[accepted_versions[-1]]
    r = HfApi().repo_info(
        repo_id=repo_id, revision=tag_for_newest.target_commit, files_metadata=True
--- a/src/kernels/utils.py
+++ b/src/kernels/utils.py
@ -16,7 +16,6 @@ from typing import Dict, List, Optional, Tuple
 from huggingface_hub import file_exists, snapshot_download
 from packaging.version import parse
 from kernels._versions import select_revision_or_version
 from kernels.lockfile import KernelLock, VariantLock
@ -46,12 +45,9 @@ def build_variant() -> str:
        compute_framework = f"rocm{rocm_version.major}{rocm_version.minor}"
    elif torch.backends.mps.is_available():
        compute_framework = "metal"
    elif torch.version.xpu is not None:
        version = torch.version.xpu
        compute_framework = f"xpu{version[0:4]}{version[5:6]}"
    else:
        raise AssertionError(
-            "Torch was not compiled with CUDA, Metal, XPU, or ROCm enabled."
+            "Torch was not compiled with CUDA, Metal, or ROCm enabled."
        )
    torch_version = parse(torch.__version__)
@ -99,20 +95,7 @@ def install_kernel(
    """
    Download a kernel for the current environment to the cache.
-    The output path is validated against the hashes in `variant_locks` when provided.
+    The output path is validated againt `hash` when set.
    Args:
        repo_id (`str`):
            The Hub repository containing the kernel.
        revision (`str`):
            The specific revision (branch, tag, or commit) to download.
        local_files_only (`bool`, *optional*, defaults to `False`):
            Whether to only use local files and not download from the Hub.
        variant_locks (`Dict[str, VariantLock]`, *optional*):
            Optional dictionary of variant locks for validation.
    Returns:
        `Tuple[str, Path]`: A tuple containing the package name and the path to the variant directory.
    """
    package_name = package_name_from_repo_id(repo_id)
    variant = build_variant()
@ -199,39 +182,13 @@ def install_kernel_all_variants(
    return repo_path / "build"
-def get_kernel(
+def get_kernel(repo_id: str, revision: str = "main") -> ModuleType:
    repo_id: str, revision: Optional[str] = None, version: Optional[str] = None
 ) -> ModuleType:
    """
-    Load a kernel from the kernel hub.
+    Download and import a kernel from the Hugging Face Hub.
-    This function downloads a kernel to the local Hugging Face Hub cache directory (if it was not downloaded before)
+    The kernel is downloaded from the repository `repo_id` at
-    and then loads the kernel.
+    branch/commit/tag `revision`.
    Args:
        repo_id (`str`):
            The Hub repository containing the kernel.
        revision (`str`, *optional*, defaults to `"main"`):
            The specific revision (branch, tag, or commit) to download. Cannot be used together with `version`.
        version (`str`, *optional*):
            The kernel version to download. This can be a Python version specifier, such as `">=1.0.0,<2.0.0"`.
            Cannot be used together with `revision`.
    Returns:
        `ModuleType`: The imported kernel module.
    Example:
        ```python
        import torch
        from kernels import get_kernel
        activation = get_kernel("kernels-community/activation")
        x = torch.randn(10, 20, device="cuda")
        out = torch.empty_like(x)
        result = activation.silu_and_mul(out, x)
        ```
    """
    revision = select_revision_or_version(repo_id, revision, version)
    package_name, package_path = install_kernel(repo_id, revision=revision)
    return import_from_path(package_name, package_path / package_name / "__init__.py")
@ -239,56 +196,16 @@ def get_kernel(
 def get_local_kernel(repo_path: Path, package_name: str) -> ModuleType:
    """
    Import a kernel from a local kernel repository path.
    Args:
        repo_path (`Path`):
            The local path to the kernel repository.
        package_name (`str`):
            The name of the package to import from the repository.
    Returns:
        `ModuleType`: The imported kernel module.
    """
-    variant = build_variant()
+    package_name, package_path = _load_kernel_from_path(repo_path, package_name)
-    universal_variant = universal_build_variant()
+    return import_from_path(package_name, package_path / package_name / "__init__.py")
    # Presume we were given the top level path of the kernel repository.
    for base_path in [repo_path, repo_path / "build"]:
        # Prefer the universal variant if it exists.
        for v in [universal_variant, variant]:
            package_path = base_path / v / package_name / "__init__.py"
            if package_path.exists():
                return import_from_path(package_name, package_path)
    # If we didn't find the package in the repo we may have a explicit
    # package path.
    package_path = repo_path / package_name / "__init__.py"
    if package_path.exists():
        return import_from_path(package_name, package_path)
    raise FileNotFoundError(f"Could not find package '{package_name}' in {repo_path}")
-def has_kernel(
+def has_kernel(repo_id: str, revision: str = "main") -> bool:
    repo_id: str, revision: Optional[str] = None, version: Optional[str] = None
 ) -> bool:
    """
-    Check whether a kernel build exists for the current environment (Torch version and compute framework).
+    Check whether a kernel build exists for the current environment
-
+    (Torch version and compute framework).
    Args:
        repo_id (`str`):
            The Hub repository containing the kernel.
        revision (`str`, *optional*, defaults to `"main"`):
            The specific revision (branch, tag, or commit) to download. Cannot be used together with `version`.
        version (`str`, *optional*):
            The kernel version to download. This can be a Python version specifier, such as `">=1.0.0,<2.0.0"`.
            Cannot be used together with `revision`.
    Returns:
        `bool`: `True` if a kernel is available for the current environment.
    """
    revision = select_revision_or_version(repo_id, revision, version)
    package_name = package_name_from_repo_id(repo_id)
    variant = build_variant()
    universal_variant = universal_build_variant()
@ -311,16 +228,8 @@ def load_kernel(repo_id: str, *, lockfile: Optional[Path] = None) -> ModuleType:
    """
    Get a pre-downloaded, locked kernel.
-    If `lockfile` is not specified, the lockfile will be loaded from the caller's package metadata.
+    If `lockfile` is not specified, the lockfile will be loaded from the
-
+    caller's package metadata.
    Args:
        repo_id (`str`):
            The Hub repository containing the kernel.
        lockfile (`Path`, *optional*):
            Path to the lockfile. If not provided, the lockfile will be loaded from the caller's package metadata.
    Returns:
        `ModuleType`: The imported kernel module.
    """
    if lockfile is None:
        locked_sha = _get_caller_locked_kernel(repo_id)
@ -365,18 +274,7 @@ def load_kernel(repo_id: str, *, lockfile: Optional[Path] = None) -> ModuleType:
 def get_locked_kernel(repo_id: str, local_files_only: bool = False) -> ModuleType:
-    """
+    """Get a kernel using a lock file."""
    Get a kernel using a lock file.
    Args:
        repo_id (`str`):
            The Hub repository containing the kernel.
        local_files_only (`bool`, *optional*, defaults to `False`):
            Whether to only use local files and not download from the Hub.
    Returns:
        `ModuleType`: The imported kernel module.
    """
    locked_sha = _get_caller_locked_kernel(repo_id)
    if locked_sha is None:
--- a/tests/conftest.py
+++ b/tests/conftest.py
@ -1,24 +1,10 @@
 import sys
 import pytest
 import torch
 has_cuda = (
    hasattr(torch.version, "cuda")
    and torch.version.cuda is not None
    and torch.cuda.device_count() > 0
 )
 has_rocm = (
    hasattr(torch.version, "hip")
    and torch.version.hip is not None
    and torch.cuda.device_count() > 0
 )
 def pytest_runtest_setup(item):
-    if "cuda_only" in item.keywords and not has_cuda:
+    if "linux_only" in item.keywords and not sys.platform.startswith("linux"):
-        pytest.skip("skipping CUDA-only test on host without CUDA")
+        pytest.skip("skipping Linux-only test on non-Linux platform")
    if "rocm_only" in item.keywords and not has_rocm:
        pytest.skip("skipping ROCm-only test on host without ROCm")
    if "darwin_only" in item.keywords and not sys.platform.startswith("darwin"):
        pytest.skip("skipping macOS-only test on non-macOS platform")
--- a/tests/layer_locking/kernels.lock
+++ b/tests/layer_locking/kernels.lock
@ -1,12 +0,0 @@
 [
  {
    "repo_id": "kernels-test/versions",
    "sha": "dc142fd6c9920c993d32be6358b78957c58681c3",
    "variants": {
      "torch-universal": {
        "hash": "sha256-35ce0ccfe68e392cbc06feef72268f4c41a74b9920496a2c6ee8978db7f7c17c",
        "hash_type": "git_lfs_concat"
      }
    }
  }
 ]
--- a/tests/layer_locking/pyproject.toml
+++ b/tests/layer_locking/pyproject.toml
@ -1,2 +0,0 @@
 [tool.kernels.dependencies]
 "kernels-test/versions" = ">=0.1.0,<0.2.0"
--- a/tests/test_basic.py
+++ b/tests/test_basic.py
@ -10,16 +10,10 @@ def kernel():
@pytest.fixture
-def local_kernel_path():
+def local_kernel():
    package_name, path = install_kernel("kernels-community/activation", "main")
    # Path is the build variant path (build/torch-<...>), so the grandparent
    # is the kernel repository path.
    return package_name, path
@pytest.fixture
 def local_kernel(local_kernel_path):
    package_name, path = local_kernel_path
    return get_local_kernel(path.parent.parent, package_name)
@ -40,7 +34,7 @@ def device():
    return "cuda"
-@pytest.mark.cuda_only
+@pytest.mark.linux_only
 def test_gelu_fast(kernel, device):
    x = torch.arange(1, 10, dtype=torch.float16, device=device).view(3, 3)
    y = torch.empty_like(x)
@ -56,7 +50,7 @@ def test_gelu_fast(kernel, device):
    assert torch.allclose(y, expected)
-@pytest.mark.cuda_only
+@pytest.mark.linux_only
 def test_local_kernel(local_kernel, device):
    x = torch.arange(1, 10, dtype=torch.float16, device=device).view(3, 3)
    y = torch.empty_like(x)
@ -72,39 +66,6 @@ def test_local_kernel(local_kernel, device):
    assert torch.allclose(y, expected)
@pytest.mark.cuda_only
 def test_local_kernel_path_types(local_kernel_path, device):
    package_name, path = local_kernel_path
    # Top-level repo path
    # ie: /home/ubuntu/.cache/huggingface/hub/models--kernels-community--activation/snapshots/2fafa6a3a38ccb57a1a98419047cf7816ecbc071
    kernel = get_local_kernel(path.parent.parent, package_name)
    x = torch.arange(1, 10, dtype=torch.float16, device=device).view(3, 3)
    y = torch.empty_like(x)
    kernel.gelu_fast(y, x)
    expected = torch.tensor(
        [[0.8408, 1.9551, 2.9961], [4.0000, 5.0000, 6.0000], [7.0000, 8.0000, 9.0000]],
        device=device,
        dtype=torch.float16,
    )
    assert torch.allclose(y, expected)
    # Build directory path
    # ie: /home/ubuntu/.cache/huggingface/hub/models--kernels-community--activation/snapshots/2fafa6a3a38ccb57a1a98419047cf7816ecbc071/build
    kernel = get_local_kernel(path.parent.parent / "build", package_name)
    y = torch.empty_like(x)
    kernel.gelu_fast(y, x)
    assert torch.allclose(y, expected)
    # Explicit package path
    # ie: /home/ubuntu/.cache/huggingface/hub/models--kernels-community--activation/snapshots/2fafa6a3a38ccb57a1a98419047cf7816ecbc071/build/torch28-cxx11-cu128-x86_64-linux
    kernel = get_local_kernel(path, package_name)
    y = torch.empty_like(x)
    kernel.gelu_fast(y, x)
    assert torch.allclose(y, expected)
@pytest.mark.darwin_only
@pytest.mark.parametrize("dtype", [torch.float16, torch.float32])
 def test_relu_metal(metal_kernel, dtype):
@ -113,7 +74,7 @@ def test_relu_metal(metal_kernel, dtype):
    assert torch.allclose(y, torch.relu(x))
-@pytest.mark.cuda_only
+@pytest.mark.linux_only
@pytest.mark.parametrize(
    "kernel_exists",
    [
@ -130,26 +91,7 @@ def test_has_kernel(kernel_exists):
    assert has_kernel(repo_id, revision=revision) == kernel
-def test_version():
+@pytest.mark.linux_only
    kernel = get_kernel("kernels-test/versions")
    assert kernel.version() == "0.2.0"
    kernel = get_kernel("kernels-test/versions", version="<1.0.0")
    assert kernel.version() == "0.2.0"
    kernel = get_kernel("kernels-test/versions", version="<0.2.0")
    assert kernel.version() == "0.1.1"
    kernel = get_kernel("kernels-test/versions", version=">0.1.0,<0.2.0")
    assert kernel.version() == "0.1.1"
    with pytest.raises(ValueError, match=r"No version.*satisfies requirement"):
        get_kernel("kernels-test/versions", version=">0.2.0")
    with pytest.raises(ValueError, match=r"Either a revision or a version.*not both"):
        kernel = get_kernel(
            "kernels-test/versions", revision="v0.1.0", version="<1.0.0"
        )
@pytest.mark.cuda_only
 def test_universal_kernel(universal_kernel):
    torch.manual_seed(0)
    A = torch.randint(-10, 10, (64, 128), dtype=torch.int8, device="cuda")
--- a/tests/test_benchmarks.py
+++ b/tests/test_benchmarks.py
@ -16,21 +16,21 @@ def device():
    return "cuda"
-@pytest.mark.cuda_only
+@pytest.mark.linux_only
 def test_gelu_small(kernel, device, benchmark):
    x = torch.randn(32, 32, dtype=torch.float16, device=device)
    y = torch.empty_like(x)
    benchmark(kernel.gelu_fast, y, x)
-@pytest.mark.cuda_only
+@pytest.mark.linux_only
 def test_gelu_medium(kernel, device, benchmark):
    x = torch.randn(128, 128, dtype=torch.float16, device=device)
    y = torch.empty_like(x)
    benchmark(kernel.gelu_fast, y, x)
-@pytest.mark.cuda_only
+@pytest.mark.linux_only
 def test_gelu_large(kernel, device, benchmark):
    x = torch.randn(512, 512, dtype=torch.float16, device=device)
    y = torch.empty_like(x)
--- a/tests/test_doctest.py
+++ b/tests/test_doctest.py
@ -1,49 +0,0 @@
 import inspect
 import pytest
 from mktestdocs import check_docstring, get_codeblock_members
 import kernels
 def all_public_functions():
    function_list = inspect.getmembers(kernels, inspect.isfunction)
    return [func for _, func in function_list]
 def all_public_classes():
    class_list = inspect.getmembers(kernels, inspect.isclass)
    return [cls for _, cls in class_list]
 def all_public_class_members():
    members = get_codeblock_members(*all_public_classes())
    return members
@pytest.mark.cuda_only
@pytest.mark.parametrize(
    "func",
    all_public_functions(),
    ids=lambda d: d.__name__,
 )
 def test_func_docstring(func):
    check_docstring(obj=func)
@pytest.mark.cuda_only
@pytest.mark.parametrize(
    "cls",
    all_public_classes(),
    ids=lambda d: d.__name__,
 )
 def test_class_docstring(cls):
    check_docstring(obj=cls)
@pytest.mark.cuda_only
@pytest.mark.parametrize(
    "member", all_public_class_members(), ids=lambda d: d.__qualname__
 )
 def test_member_docstring(member):
    check_docstring(member)
--- a/tests/test_kernel_locking.py
+++ b/tests/test_kernel_locking.py
@ -2,17 +2,9 @@ from dataclasses import dataclass
 from pathlib import Path
 import pytest
 import torch.nn as nn
 from kernels import load_kernel
 from kernels.cli import download_kernels
 from kernels.layer import (
    LockedLayerRepository,
    Mode,
    kernelize,
    use_kernel_forward_from_hub,
    use_kernel_mapping,
 )
 # Mock download arguments class.
@ -27,34 +19,9 @@ def test_download_all_hash_validation():
    download_kernels(DownloadArgs(all_variants=True, project_dir=project_dir))
-@pytest.mark.cuda_only
+@pytest.mark.linux_only
 def test_load_locked():
    project_dir = Path(__file__).parent / "kernel_locking"
    # Also validates that hashing works correctly.
    download_kernels(DownloadArgs(all_variants=False, project_dir=project_dir))
    load_kernel("kernels-community/activation", lockfile=project_dir / "kernels.lock")
 def test_layer_locked():
    project_dir = Path(__file__).parent / "layer_locking"
    @use_kernel_forward_from_hub("Version")
    class Version(nn.Module):
        def forward(self) -> str:
            return "0.0.0"
    version = Version()
    with use_kernel_mapping(
        {
            "Version": {
                "cuda": LockedLayerRepository(
                    repo_id="kernels-test/versions",
                    layer_name="Version",
                    lockfile=project_dir / "kernels.lock",
                )
            },
        }
    ):
        version = kernelize(version, device="cuda", mode=Mode.INFERENCE)
        assert version() == "0.1.1"
--- a/tests/test_layer.py
+++ b/tests/test_layer.py
@ -7,21 +7,19 @@ import torch.nn as nn
 from torch.nn import functional as F
 from kernels import (
    CUDAProperties,
    Device,
    LayerRepository,
    LocalLayerRepository,
    Mode,
    kernelize,
    register_kernel_mapping,
    use_kernel_forward_from_hub,
    use_kernel_mapping,
 )
 from kernels.layer import (
    _KERNEL_MAPPING,
    CUDAProperties,
    _validate_layer,
    use_kernel_mapping,
 )
 from kernels.utils import install_kernel
 kernel_layer_mapping = {
    "SiluAndMul": {
@ -34,11 +32,7 @@ kernel_layer_mapping = {
        "cuda": LayerRepository(
            repo_id="kernels-test/op-without-fake-test",
            layer_name="SiluAndMul",
-        ),
+        )
        "rocm": LayerRepository(
            repo_id="kernels-test/op-without-fake-test",
            layer_name="SiluAndMul",
        ),
    },
    "SiluAndMulStringDevice": {
        "cuda": LayerRepository(
@ -108,74 +102,29 @@ def test_arg_kinds():
    assert arg_kind("foo", "bar", kwarg1="baz", kwarg2=5) == ("foo", "bar", "baz", 5)
-@pytest.mark.cuda_only
+@pytest.mark.linux_only
@pytest.mark.parametrize("cls", [SiluAndMulWithKernel, SiluAndMulStringDevice])
-def test_hub_forward(cls):
+@pytest.mark.parametrize("device", ["cuda", "cpu"])
 def test_hub_forward(cls, device):
    torch.random.manual_seed(0)
    silu_and_mul = SiluAndMul()
-    X = torch.randn((32, 64), device="cuda")
+    X = torch.randn((32, 64), device=device)
    Y = silu_and_mul(X)
-    silu_and_mul_with_kernel = kernelize(cls(), device="cuda", mode=Mode.INFERENCE)
+    silu_and_mul_with_kernel = kernelize(cls(), device=device, mode=Mode.INFERENCE)
    Y_kernel = silu_and_mul_with_kernel(X)
    torch.testing.assert_close(Y_kernel, Y)
    assert silu_and_mul.n_calls == 1
-    assert silu_and_mul_with_kernel.n_calls == 0
+    if device == "cuda":
        assert silu_and_mul_with_kernel.n_calls == 0
    else:
        assert silu_and_mul_with_kernel.n_calls == 1
-@pytest.mark.rocm_only
+@pytest.mark.linux_only
 def test_hub_forward_rocm():
    torch.manual_seed(0)
    silu_and_mul = SiluAndMul()
    X = torch.randn((32, 64))
    Y = silu_and_mul(X)
    silu_and_mul_with_kernel = kernelize(
        SiluAndMulNoCompileKernel(), device="rocm", mode=Mode.INFERENCE
    )
    Y_kernel = silu_and_mul_with_kernel(X)
    torch.testing.assert_close(Y_kernel, Y)
    assert silu_and_mul.n_calls == 1
    # Should use kernel (n_calls == 0) if ROCm kernel is available, otherwise fallback (n_calls == 1)
    # The exact behavior depends on whether the test kernel exists for ROCm
    assert silu_and_mul_with_kernel.n_calls in [0, 1]
 def test_rocm_kernel_mapping():
    """Test that ROCm shorthand device mapping works correctly."""
    kernel_layer_mapping = {
        "SiluAndMul": {
            "rocm": LayerRepository(
                repo_id="kernels-community/activation",
                layer_name="SiluAndMul",
            )
        }
    }
    # Test that the mapping is processed correctly
    with use_kernel_mapping(kernel_layer_mapping, inherit_mapping=False):
        mapping = _KERNEL_MAPPING.get()
        # Verify the mapping exists
        assert "SiluAndMul" in mapping
        assert "rocm" in mapping["SiluAndMul"]
        # Verify the repository is correctly stored
        rocm_repos = mapping["SiluAndMul"]["rocm"]
        assert rocm_repos is not None
        assert (
            rocm_repos.repos[Mode.FALLBACK]._repo_id == "kernels-community/activation"
        )
        assert rocm_repos.repos[Mode.FALLBACK].layer_name == "SiluAndMul"
@pytest.mark.cuda_only
 def test_capability():
    linear = TorchLinearWithCounter(32, 32).to("cuda")
    with use_kernel_mapping(
@ -234,33 +183,7 @@ def test_layer_fallback_works():
    kernelize(silu_and_mul, device="cuda", mode=Mode.INFERENCE)
-def test_local_layer_repo():
+@pytest.mark.linux_only
    # Fetch a kernel to the local cache.
    package_name, path = install_kernel("kernels-test/backward-marker-test", "main")
    linear = TorchLinearWithCounter(32, 32).to("cuda")
    with use_kernel_mapping(
        {
            "Linear": {
                "cuda": LocalLayerRepository(
                    # install_kernel will give the fully-resolved path.
                    repo_path=path.parent.parent,
                    package_name=package_name,
                    layer_name="LinearBackward",
                )
            }
        },
        inherit_mapping=False,
    ):
        kernelize(linear, mode=Mode.INFERENCE)
    X = torch.randn(10, 32, device="cuda")
    linear(X)
    assert linear.n_calls == 0
@pytest.mark.cuda_only
@pytest.mark.parametrize("cls", [SiluAndMulWithKernel, SiluAndMulNoCompileKernel])
@pytest.mark.parametrize("device", ["cuda"])
 def test_torch_compile_layer_without_fallback(cls, device):
@ -291,7 +214,7 @@ def test_torch_compile_layer_without_fallback(cls, device):
    torch.testing.assert_close(Y_compiled, Y)
-@pytest.mark.cuda_only
+@pytest.mark.linux_only
@pytest.mark.parametrize("cls", [SiluAndMulWithKernel, SiluAndMulNoCompileKernel])
@pytest.mark.parametrize("device", ["cuda"])
 def test_torch_compile_layer_with_fallback(cls, device):
@ -314,11 +237,8 @@ def test_torch_compile_layer_with_fallback(cls, device):
    torch.testing.assert_close(Y_compiled, Y)
-@pytest.mark.cuda_only
+@pytest.mark.linux_only
 def test_mapping_contexts():
    # Make sure we start from scratch.
    register_kernel_mapping(kernel_layer_mapping, inherit_mapping=False)
    assert set(_KERNEL_MAPPING.get().keys()) == {
        "SiluAndMul",
        "SiluAndMulStringDevice",
@ -361,9 +281,7 @@ def test_mapping_contexts():
                "TestKernel",
            }
            assert (
-                _KERNEL_MAPPING.get()["SiluAndMul"]["cuda"]
+                _KERNEL_MAPPING.get()["SiluAndMul"]["cuda"].repos[Mode.FALLBACK].repo_id
                .repos[Mode.FALLBACK]
                ._repo_id
                == "kernels-community/non-existing"
            )
@ -374,7 +292,7 @@ def test_mapping_contexts():
            "TestKernel",
        }
        assert (
-            _KERNEL_MAPPING.get()["SiluAndMul"]["cuda"].repos[Mode.FALLBACK]._repo_id
+            _KERNEL_MAPPING.get()["SiluAndMul"]["cuda"].repos[Mode.FALLBACK].repo_id
            == "kernels-community/activation"
        )
@ -383,9 +301,7 @@ def test_mapping_contexts():
                "SiluAndMul",
            }
            assert (
-                _KERNEL_MAPPING.get()["SiluAndMul"]["cuda"]
+                _KERNEL_MAPPING.get()["SiluAndMul"]["cuda"].repos[Mode.FALLBACK].repo_id
                .repos[Mode.FALLBACK]
                ._repo_id
                == "kernels-community/non-existing"
            )
@ -396,7 +312,7 @@ def test_mapping_contexts():
            "TestKernel",
        }
        assert (
-            _KERNEL_MAPPING.get()["SiluAndMul"]["cuda"].repos[Mode.FALLBACK]._repo_id
+            _KERNEL_MAPPING.get()["SiluAndMul"]["cuda"].repos[Mode.FALLBACK].repo_id
            == "kernels-community/activation"
        )
@ -435,7 +351,7 @@ def test_validate_kernel_layer():
        _validate_layer(cls=BadLayer4, check_cls=SiluAndMul)
-@pytest.mark.cuda_only
+@pytest.mark.linux_only
 def test_invalid_mode_for_mapping_rejected():
    linear = TorchLinearWithCounter(32, 32).to("cuda")
@ -455,7 +371,7 @@ def test_invalid_mode_for_mapping_rejected():
            kernelize(linear, mode=Mode.TRAINING)
-@pytest.mark.cuda_only
+@pytest.mark.linux_only
 def test_kernel_modes():
    linear = TorchLinearWithCounter(32, 32).to("cuda")
@ -484,6 +400,11 @@ def test_kernel_modes():
        linear(X)
        assert linear.n_calls == 0
        # Same as previous, since TRAINING | TORCH_COMPILE is the default.
        kernelize(linear)
        linear(X)
        assert linear.n_calls == 0
    # Case 2: register a kernel just for training. If no base kernel
    #         layer is registered, we fall back to the original layer.
    with use_kernel_mapping(
@ -513,6 +434,12 @@ def test_kernel_modes():
        # TRAINING | TORCH_COMPILE cannot fall back to TRAINING kernel, so uses original.
        assert linear.n_calls == 1
        # Same as previous, since TRAINING | TORCH_COMPILE is the default.
        kernelize(linear)
        linear(X)
        # TRAINING | TORCH_COMPILE cannot fall back to TRAINING kernel, so uses original.
        assert linear.n_calls == 2
    # Case 3: register a kernel just for training and one for fallback.
    with use_kernel_mapping(
        {
@ -534,17 +461,23 @@ def test_kernel_modes():
        X = torch.randn(10, 32, device="cuda")
        linear(X)
        # Falls back to TRAINING.
-        assert linear.n_calls == 1
+        assert linear.n_calls == 2
        kernelize(linear, mode=Mode.TRAINING)
        linear(X)
        # Falls back to the TRAINING kernel.
-        assert linear.n_calls == 1
+        assert linear.n_calls == 2
        kernelize(linear, mode=Mode.TRAINING | Mode.TORCH_COMPILE)
        linear(X)
        # TRAINING | TORCH_COMPILE falls back to FALLBACK kernel.
-        assert linear.n_calls == 1
+        assert linear.n_calls == 2
        # Same as previous, since TRAINING | TORCH_COMPILE is the default.
        kernelize(linear)
        linear(X)
        # TRAINING | TORCH_COMPILE falls back to FALLBACK kernel.
        assert linear.n_calls == 2
    # Case 4: register a kernel with two preferences.
    with use_kernel_mapping(
@ -564,20 +497,25 @@ def test_kernel_modes():
        X = torch.randn(10, 32, device="cuda")
        linear(X)
        # Falls back to the TRAINING | TORCH_COMPILE kernel.
-        assert linear.n_calls == 1
+        assert linear.n_calls == 2
        kernelize(linear, mode=Mode.TRAINING)
        linear(X)
        # TRAINING can fall back to TRAINING | TORCH_COMPILE kernel.
-        assert linear.n_calls == 1
+        assert linear.n_calls == 2
        kernelize(linear, mode=Mode.TRAINING | Mode.TORCH_COMPILE)
        linear(X)
        # Uses TRAINING | TORCH_COMPILE kernel.
-        assert linear.n_calls == 1
+        assert linear.n_calls == 2
        kernelize(linear)
        linear(X)
        # Same as previous, since TRAINING | TORCH_COMPILE is the default.
        assert linear.n_calls == 2
-@pytest.mark.cuda_only
+@pytest.mark.linux_only
 def test_fallback_used_when_training():
    linear = TorchLinearWithCounter(32, 32).to("cuda")
@ -642,7 +580,7 @@ def test_invalid_mode_rejected():
        kernelize(torch.nn.Linear(32, 32), mode=Mode.TORCH_COMPILE)
-@pytest.mark.cuda_only
+@pytest.mark.linux_only
 def test_kernel_modes_inference():
    """Test inference-specific fallback scenarios."""
    linear = TorchLinearWithCounter(32, 32).to("cuda")
@ -739,7 +677,7 @@ def test_kernel_modes_inference():
        assert linear.n_calls == 4
-@pytest.mark.cuda_only
+@pytest.mark.linux_only
 def test_kernel_modes_mixed():
    """Test mixed training and inference kernel scenarios."""
    linear = TorchLinearWithCounter(32, 32).to("cuda")
@ -829,7 +767,7 @@ def test_kernel_modes_mixed():
        assert linear.n_calls == 2
-@pytest.mark.cuda_only
+@pytest.mark.linux_only
 def test_kernel_modes_cross_fallback():
    """Test cross-mode fallback scenarios from inference to training modes."""
    linear = TorchLinearWithCounter(32, 32).to("cuda")
@ -863,8 +801,7 @@ def test_kernel_modes_cross_fallback():
        {
            "Linear": {
                "cuda": {
-                    Mode.TRAINING
+                    Mode.TRAINING | Mode.TORCH_COMPILE: LayerRepository(
                    | Mode.TORCH_COMPILE: LayerRepository(
                        repo_id="kernels-test/backward-marker-test",
                        layer_name="LinearBackward",
                    )
@ -902,8 +839,7 @@ def test_kernel_modes_cross_fallback():
                        repo_id="kernels-test/backward-marker-test",
                        layer_name="LinearBackward",
                    ),
-                    Mode.INFERENCE
+                    Mode.INFERENCE | Mode.TORCH_COMPILE: LayerRepository(
                    | Mode.TORCH_COMPILE: LayerRepository(
                        repo_id="kernels-test/backward-marker-test",
                        layer_name="LinearBackward",
                    ),
@ -921,95 +857,3 @@ def test_kernel_modes_cross_fallback():
        linear(X)
        # TRAINING | TORCH_COMPILE should NOT fall back to inference kernels, use original
        assert linear.n_calls == 2
 def test_layer_versions():
    @use_kernel_forward_from_hub("Version")
    class Version(nn.Module):
        def forward(self) -> str:
            return "0.0.0"
    version = Version()
    with use_kernel_mapping(
        {
            "Version": {
                Device(type="cuda"): LayerRepository(
                    repo_id="kernels-test/versions",
                    layer_name="Version",
                )
            }
        }
    ):
        version = kernelize(version, device="cuda", mode=Mode.INFERENCE)
        assert version() == "0.2.0"
    with use_kernel_mapping(
        {
            "Version": {
                Device(type="cuda"): LayerRepository(
                    repo_id="kernels-test/versions",
                    layer_name="Version",
                    version="<1.0.0",
                )
            }
        }
    ):
        version = kernelize(version, device="cuda", mode=Mode.INFERENCE)
        assert version() == "0.2.0"
    with use_kernel_mapping(
        {
            "Version": {
                Device(type="cuda"): LayerRepository(
                    repo_id="kernels-test/versions",
                    layer_name="Version",
                    version="<0.2.0",
                )
            }
        }
    ):
        version = kernelize(version, device="cuda", mode=Mode.INFERENCE)
        assert version() == "0.1.1"
    with use_kernel_mapping(
        {
            "Version": {
                Device(type="cuda"): LayerRepository(
                    repo_id="kernels-test/versions",
                    layer_name="Version",
                    version=">0.1.0,<0.2.0",
                )
            }
        }
    ):
        version = kernelize(version, device="cuda", mode=Mode.INFERENCE)
        assert version() == "0.1.1"
    with use_kernel_mapping(
        {
            "Version": {
                Device(type="cuda"): LayerRepository(
                    repo_id="kernels-test/versions",
                    layer_name="Version",
                    version=">0.2.0",
                )
            }
        }
    ):
        with pytest.raises(ValueError, match=r"No version.*satisfies requirement"):
            kernelize(version, device="cuda", mode=Mode.INFERENCE)
    with pytest.raises(ValueError, match=r"Either a revision or a version.*not both"):
        use_kernel_mapping(
            {
                "Version": {
                    Device(type="cuda"): LayerRepository(
                        repo_id="kernels-test/versions",
                        layer_name="Version",
                        revision="v0.1.0",
                        version="<1.0.0",
                    )
                }
            }
        )
		`@ -1,2 +0,0 @@`
			`[tool.kernels.dependencies]`
			`"kernels-test/versions" = ">=0.1.0,<0.2.0"`