Set version to 0.10.4.dev0 (#169 )

feat: allow get_kernel to log telemetry. (#167 )
* feat: allow get_kernel to log telemetry. * Apply suggestions from code review Co-authored-by: Daniël de Kok <me@danieldk.eu> * doc --------- Co-authored-by: Daniël de Kok <me@danieldk.eu>
2025-10-20 21:10:02 +08:00 · 2025-10-16 20:21:35 +02:00 · 2025-10-16 20:16:41 +02:00 · 2025-10-16 16:01:00 +02:00 · 2025-10-14 09:01:53 +02:00 · 2025-10-13 17:23:39 +02:00
27 changed files with 1009 additions and 181 deletions
--- a/.github/workflows/test.yml
+++ b/.github/workflows/test.yml
@ -24,7 +24,7 @@ jobs:
      max-parallel: 4
      matrix:
        python-version: ["3.10", "3.12"]
-        torch-version: ["2.6.0", "2.7.0"]
+        torch-version: ["2.7.0", "2.8.0"]
    env:
      UV_PYTHON_PREFERENCE: only-managed
@ -51,7 +51,15 @@ jobs:
        run: uv run mypy src/kernels
      - name: Run tests
-        run: uv run pytest tests
+        run: |
          uv run pytest tests
      - name: Run staging tests
        env:
          HF_TOKEN: ${{ secrets.HF_STAGING_TOKEN }}
        run: |
          HUGGINGFACE_CO_STAGING=true uv run pytest --token -m "is_staging_test" tests/
        if: matrix.python_version == '3.10' && matrix.torch-version == '2.7.0'
      - name: Check kernel conversion
        run: |
@ -65,6 +73,11 @@ jobs:
        run: |
          uv run kernels generate-readme kernels-community/triton-layer-norm
      - name: Check kernel check
        run: |
          uv pip install kernel-abi-check
          kernels check kernels-community/activation
      - name: Import check without torch
        run: |
          uv pip uninstall torch
--- a/8
+++ b/8
@ -0,0 +1,8 @@
 .PHONY: style
 export check_dirs := src examples tests
 style:
 	black ${check_dirs}
 	isort ${check_dirs}
 	ruff check ${check_dirs} --fix
--- a/README.md
+++ b/README.md
@ -62,7 +62,6 @@ the Hub.
 - [Using layers](docs/source/layers.md)
 - [Locking kernel/layer versions](docs/source/locking.md)
 - [Environment variables](docs/source/env.md)
 - [Using kernels in a Docker container](docs/source/docker.md)
 - [Kernel requirements](docs/source/kernel-requirements.md)
 - [Frequently Asked Questions](docs/source/faq.md)
 - [Writing kernels](https://github.com/huggingface/kernel-builder/blob/main/docs/writing-kernels.md) using [kernel-builder](https://github.com/huggingface/kernel-builder/)
--- a/docs/source/_toctree.yml
+++ b/docs/source/_toctree.yml
@ -21,6 +21,8 @@
      title: Kernels
    - local: api/layers
      title: Layers
    - local: cli
      title: Kernels CLI
  title: API Reference
 - sections:
    - local: kernel-requirements
--- a/docs/source/api/kernels.md
+++ b/docs/source/api/kernels.md
@ -6,6 +6,10 @@
 [[autodoc]] kernels.get_kernel
 ### get_local_kernel
 [[autodoc]] kernels.get_local_kernel
 ### has_kernel
 [[autodoc]] kernels.has_kernel
--- a/docs/source/api/layers.md
+++ b/docs/source/api/layers.md
@ -39,3 +39,11 @@
 ### LayerRepository
 [[autodoc]] kernels.LayerRepository
 ### LocalLayerRepository
 [[autodoc]] kernels.LocalLayerRepository
 ### LockedLayerRepository
 [[autodoc]] kernels.LockedLayerRepository
--- a/docs/source/basic-usage.md
+++ b/docs/source/basic-usage.md
@ -21,6 +21,22 @@ activation.gelu_fast(y, x)
 print(y)
 ```
 ### Using version bounds
 Kernels are versioned using tags of the form `v<major>.<minor>.<patch>`.
 You can specify which version to download using Python version specifiers:
 ```python
 import torch
 from kernels import get_kernel
 activation = get_kernel("kernels-community/activation", version=">=0.0.4,<0.1.0")
 ```
 This will get the latest kernel tagged `v0.0.z` where `z` is at least 4. It
 is strongly recommended to specify a version bound, since a kernel author
 might push incompatible changes to the `main` branch.
 ## Checking Kernel Availability
 You can check if a specific kernel is available for your environment:
--- a/docs/source/cli.md
+++ b/docs/source/cli.md
@ -0,0 +1,58 @@
 # Kernels CLI Reference
 ## Main Functions
 ### kernels check
 You can use `kernels check` to test compliance of a kernel on the Hub.
 This currently checks that the kernel:
 - Supports the currently-required Python ABI version.
 - Works on supported operating system versions.
 For example:
 ```bash
 $ kernels check kernels-community/flash-attn3
 Checking variant: torch28-cxx11-cu128-aarch64-linux
  🐍 Python ABI 3.9 compatible
  🐧 manylinux_2_28 compatible
 [...]
 ```
 ### kernels to-wheel
 We strongly recommend downloading kernels from the Hub using the `kernels`
 package, since this comes with large [benefits](index.md) over using Python
 wheels. That said, some projects may require deployment of kernels as
 wheels. The `kernels` utility provides a simple solution to this. You can
 convert any Hub kernel into a set of wheels with the `to-wheel` command:
 ```bash
 $ kernels to-wheel drbh/img2grey 1.1.2
 ☸ img2grey-1.1.2+torch27cu128cxx11-cp39-abi3-manylinux_2_28_x86_64.whl
 ☸ img2grey-1.1.2+torch26cu124cxx11-cp39-abi3-manylinux_2_28_x86_64.whl
 ☸ img2grey-1.1.2+torch26cu126cxx11-cp39-abi3-manylinux_2_28_x86_64.whl
 ☸ img2grey-1.1.2+torch27cu126cxx11-cp39-abi3-manylinux_2_28_x86_64.whl
 ☸ img2grey-1.1.2+torch26cu126cxx98-cp39-abi3-manylinux_2_28_x86_64.whl
 ☸ img2grey-1.1.2+torch27cu128cxx11-cp39-abi3-manylinux_2_28_aarch64.whl
 ☸ img2grey-1.1.2+torch26cu126cxx98-cp39-abi3-manylinux_2_28_aarch64.whl
 ☸ img2grey-1.1.2+torch27cu126cxx11-cp39-abi3-manylinux_2_28_aarch64.whl
 ☸ img2grey-1.1.2+torch26cu126cxx11-cp39-abi3-manylinux_2_28_aarch64.whl
 ☸ img2grey-1.1.2+torch26cu118cxx98-cp39-abi3-manylinux_2_28_x86_64.whl
 ☸ img2grey-1.1.2+torch26cu124cxx98-cp39-abi3-manylinux_2_28_x86_64.whl
 ☸ img2grey-1.1.2+torch26cu118cxx11-cp39-abi3-manylinux_2_28_x86_64.whl
 ☸ img2grey-1.1.2+torch27cu118cxx11-cp39-abi3-manylinux_2_28_x86_64.whl
 ```
 ### kernels upload
 Use `kernels upload <dir_containing_build> --repo_id="hub-username/kernel"` to upload
 your kernel builds to the Hub. To know the supported arguments run: `kernels upload -h`.
 **Notes**:
 - This will take care of creating a repository on the Hub with the `repo_id` provided.
 - If a repo with the `repo_id` already exists and if it contains a `build` with the build variant
  being uploaded, it will attempt to delete the files existing under it.
 - Make sure to be authenticated (run `hf auth login` if not) to be able to perform uploads to the Hub.
--- a/docs/source/faq.md
+++ b/docs/source/faq.md
@ -1,6 +1,8 @@
 # FAQ
-## Why is the kernelization step needed?
+## Kernel layers
 ### Why is the kernelization step needed as a separate step?
 In earlier versions of `kernels`, a layer's `forward` method was replaced
 by `use_kernel_forward_from_hub` and `replace_kernel_forward_from_hub`.
@ -11,3 +13,39 @@ on data-dependent branching.
 To avoid branching, we have to make dispatch decisions ahead of time,
 which is what the `kernelize` function does.
 ### Why does kernelization only replace `forward` methods?
 There are some other possible approaches. The first is to completely
 replace existing layers by kernel layers. However, since this would
 permit free-form layer classes, it would be much harder to validate
 that layers are fully compatible with the layers that they are
 replacing. For instance, they could have completely different member
 variables. Besides that, we would also need to hold on to the original
 layers, in case we need to revert to the base layers when the model
 is `kernelize`d again with different options.
 A second approach would be to make an auxiliary layer that wraps the
 original layer and the kernel layer and dispatches to the kernel layer.
 This wouldn't have the issues of the first approach, because kernel layers
 could be similarly strict as they are now, and we would still have access
 to the original layers when `kernelize`-ing the model again. However,
 this would change the graph structure of the model and would break use
 cases where programs access the model internals (e.g.
 `model.layers[0].attention.query_weight`) or rely on the graph structure
 in other ways.
 The approach of `forward`-replacement is the least invasive, because
 it preserves the original model graph. It is also reversible, since
 even though the `forward` of a layer _instance_ might be replaced,
 the corresponding class still has the original `forward`.
 ## Misc
 ### How can I disable kernel reporting in the user-agent?
 By default, we collect telemetry when a call to `get_kernel()` is made.
 This only includes the `kernels` version, `torch` version, and the build
 information for the kernel being requested.
 You can disable this by setting `export DISABLE_TELEMETRY=yes`.
--- a/docs/source/kernel-requirements.md
+++ b/docs/source/kernel-requirements.md
@ -34,6 +34,8 @@ Kernels are versioned on the Hub using Git tags. Version tags must be of
 the form `v<major>.<minor>.<patch>`. Versions are used by [locking](./locking.md)
 to resolve the version constraints.
 We recommend using [semver](https://semver.org/) to version kernels.
 ## Native Python module
 Kernels will typically contain a native Python module with precompiled
@ -44,19 +46,28 @@ have dynamic library dependencies outside:
 - Torch;
 - CUDA/ROCm libraries installed as dependencies of Torch.
 ## Compatibility with torch.compile
 The Kernel Hub also encourages to write the kernels in a `torch.compile`
 compliant way. This helps to ensure that the kernels are compatible with
 `torch.compile` without introducing any graph breaks and triggering 
 recompilation which can limit the benefits of compilation.
 [Here](https://github.com/huggingface/kernel-builder/blob/d1ee9bf9301ac8c5199099d90ee1c9d5c789d5ba/examples/relu-backprop-compile/tests/test_relu.py#L162) is a simple test example which checks for graph breaks and 
 recompilation triggers during `torch.compile`.
 ### Linux
 - Use [ABI3/Limited API](https://docs.python.org/3/c-api/stable.html#stable-application-binary-interface)
  for compatibility with Python 3.9 and later.
 - Compatible with [`manylinux_2_28`](https://github.com/pypa/manylinux?tab=readme-ov-file#manylinux_2_28-almalinux-8-based).
  This means that the extension **must not** use symbols versions higher than:
  - GLIBC 2.28
  - GLIBCXX 3.4.24
  - CXXABI 1.3.11
  - GCC 7.0.0
-These requirement can be checked with the ABI checker (see below).
+These requirements can be checked with the ABI checker (see below).
 ### macOS
--- a/docs/source/layers.md
+++ b/docs/source/layers.md
@ -5,7 +5,7 @@ the Hub can replace the `forward` method of an existing layer for a certain
 device type. This makes it possible to provide more performant kernels for
 existing layers.
-See [Kernel requirements](kernel-requirements.md) for more information the
+See [Kernel requirements](kernel-requirements.md) for more information on the
 requirements of Hub layers.
 ## Making a layer extensible with kernels from the hub
@ -84,12 +84,6 @@ model = kernelize(model, mode=Mode.INFERENCE | Mode.TORCH_COMPILE)
 model = kernelize(model, mode=Mode.TRAINING | Mode.TORCH_COMPILE)
 ```
 When the `mode` argument is not specified,
 `Mode.TRAINING | Mode.TORCH_COMPILE` is used as the default. This mode
 aligns most closely with pure PyTorch layers which also support training
 and `torch.compile`. However, to select the most performant kernels, it
 is often good to make the mode specific as possible.
 ### Kernel device
 Kernels can be registered per device type. For instance, separate `cuda` and
@ -117,7 +111,7 @@ model = kernelize(model, mode=Mode.INFERENCE | Mode.TORCH_COMPILE, use_fallback=
 This can be useful if you want to guarantee that Hub kernels are used.
-### Inspecting kernels which kernels are used
+### Inspecting which kernels are used
 The kernels that are used are logged at the `INFO` level by `kernelize`.
 See the [Python logging](https://docs.python.org/3/library/logging.html)
@ -157,12 +151,39 @@ used with the `use_kernel_mapping` context manager:
 ```python
 with use_kernel_mapping(kernel_layer_mapping):
    # Use the layer for which the mapping is applied.
-    model = kernelize(model)
+    model = kernelize(model, mode=Mode.TRAINING | Mode.TORCH_COMPILE)
 ```
 This ensures that the mapping is not active anymore outside the
 `with`-scope.
 ### Using version bounds
 Kernels are versioned using tags of the form `v<major>.<minor>.<patch>`.
 You can specify which version of the kernel to download using Python version
 specifiers:
 ```python
 kernel_layer_mapping = {
    "SiluAndMul": {
        "cuda": LayerRepository(
            repo_id="kernels-community/activation",
            layer_name="SiluAndMul",
            version=">=0.0.4,<0.1.0",
        ),
        "rocm": LayerRepository(
            repo_id="kernels-community/activation",
            layer_name="SiluAndMul",
            version=">=0.0.4,<0.1.0",
        )
    }
 }
 ```
 This will get the layer from latest kernel tagged `v0.0.z` where `z` is at
 least 4. It is strongly recommended to specify a version bound, since a
 kernel author might push incompatible changes to the `main` branch.
 ### Registering kernels for specific modes
 You might want to register two different kernels for a particular layer,
@ -265,7 +286,6 @@ Capabilities behave as follows:
  an existing kernel, the new kernel will replace the old kernel.
 - When there are multiple kernels that support a capability, the kernel
  with the smaller capability interval will be used. E.g. given:
  - `KernelA` with `min_capability=80` and `max_capability=89`;
  - `KernelB` with `min_capability=75` and `max_capability=89`;
  - `kernelize` runs on a system with capability 8.6.
--- a/examples/basic.py
+++ b/examples/basic.py
@ -20,11 +20,11 @@ activation.gelu_fast(y, x)
 print("Kernel successfully executed")
 # Check results
-expected = torch.tensor([
+expected = torch.tensor(
-    [0.8408, 1.9551, 2.9961],
+    [[0.8408, 1.9551, 2.9961], [4.0000, 5.0000, 6.0000], [7.0000, 8.0000, 9.0000]],
-    [4.0000, 5.0000, 6.0000],
+    device="cuda:0",
-    [7.0000, 8.0000, 9.0000]
+    dtype=torch.float16,
-], device='cuda:0', dtype=torch.float16)
+)
 assert torch.allclose(y, expected)
 print("Calculated values are exact")
--- a/flake.nix
+++ b/flake.nix
@ -24,6 +24,7 @@
      in
      {
        formatter = pkgs.nixfmt-tree;
        packages.kernel-abi-check = pkgs.python3.pkgs.callPackage ./nix/kernel-abi-check.nix {};
        devShells = with pkgs; rec {
          default = mkShell {
            nativeBuildInputs = [
@ -40,6 +41,7 @@
              ++ (with python3.pkgs; [
                docutils
                huggingface-hub
                (callPackage ./nix/kernel-abi-check.nix {})
                mktestdocs
                pytest
                pytest-benchmark
--- a/nix/kernel-abi-check.nix
+++ b/nix/kernel-abi-check.nix
@ -0,0 +1,27 @@
 {
  buildPythonPackage,
  fetchPypi,
  rustPlatform,
 }:
 buildPythonPackage rec {
  pname = "kernel-abi-check";
  version = "0.6.2";
  src = fetchPypi {
    inherit version;
    pname = "kernel_abi_check";
    hash = "sha256-goWC7SK79FVNEvkp3bISBwbOqdSrmobANtrWIve9/Ys=";
  };
  cargoDeps = rustPlatform.fetchCargoVendor {
    inherit pname version src sourceRoot;
    hash = "sha256-+1jdbKsDKmG+bf0NEVYMv8t7Meuge1z2cgYfbdB9q8A=";
  };
  sourceRoot = "kernel_abi_check-${version}/bindings/python";
  pyproject = true;
  nativeBuildInputs = with rustPlatform; [ cargoSetupHook maturinBuildHook ];
 }
--- a/pyproject.toml
+++ b/pyproject.toml
@ -1,6 +1,6 @@
 [project]
 name = "kernels"
-version = "0.9.0.dev0"
+version = "0.10.4.dev0"
 description = "Download compute kernels"
 authors = [
  { name = "OlivierDehaene", email = "olivier@huggingface.co" },
@ -12,7 +12,7 @@ license = { text = "Apache-2.0" }
 readme = "README.md"
 requires-python = ">= 3.9"
 dependencies = [
-  "huggingface_hub>=0.26.0,<1.0",
+  "huggingface_hub>=0.26.0,<2.0",
  "packaging>=20.0",
  "pyyaml>=6",
  "tomli>=2.0; python_version<'3.11'",
@ -34,6 +34,7 @@ dev = [
 ]
 [project.optional-dependencies]
 abi-check = ["kernel-abi-check>=0.6.2,<0.7.0"]
 torch = ["torch"]
 docs = [
  "hf-doc-builder",
@ -45,6 +46,9 @@ kernels = "kernels.cli:main"
 [project.entry-points."egg_info.writers"]
 "kernels.lock" = "kernels.lockfile:write_egg_lockfile"
 [tool.isort]
 profile = "black"
 line_length = 119
 [tool.ruff]
 exclude = [
@ -71,4 +75,4 @@ line-length = 119
 # Ignored rules:
 # "E501" -> line length violation
 lint.ignore = ["E501"]
-lint.select = ["E", "F", "I", "W"]
+lint.select = ["E", "F", "W"]
--- a/pytest.ini
+++ b/pytest.ini
@ -3,3 +3,7 @@ markers =
    cuda_only: marks tests that should only hosts with CUDA GPUs
    rocm_only: marks tests that should only run on hosts with ROCm GPUs
    darwin_only: marks tests that should only run on macOS
    xpu_only: marks tests that should only run on hosts with Intel XPUs
    npu_only: marks tests that should only run on Ascend NPUs
    token: enable tests that require a write token
    is_staging_test: Marks tests that should only run on a staging environment
--- a/src/kernels/init.py
+++ b/src/kernels/init.py
@ -1,3 +1,7 @@
 import importlib.metadata
 __version__ = importlib.metadata.version("kernels")
 from kernels.layer import (
    CUDAProperties,
    Device,
@ -21,6 +25,7 @@ from kernels.utils import (
 )
 __all__ = [
    "__version__",
    "CUDAProperties",
    "Device",
    "LayerRepository",
--- a/src/kernels/check.py
+++ b/src/kernels/check.py
@ -0,0 +1,142 @@
 import sys
 from pathlib import Path
 from huggingface_hub import snapshot_download
 from kernel_abi_check import (
    BinaryFormat,
    IncompatibleAbi3Symbol,
    IncompatibleMacOSVersion,
    IncompatibleManylinuxSymbol,
    MissingMacOSVersion,
    NonAbi3Symbol,
    ObjectFile,
 )
 from kernels.utils import CACHE_DIR
 def check_kernel(
    *, macos: str, manylinux: str, python_abi: str, repo_id: str, revision: str
 ):
    variants_path = (
        Path(
            snapshot_download(
                repo_id,
                allow_patterns=["build/*"],
                cache_dir=CACHE_DIR,
                revision=revision,
            )
        )
        / "build"
    )
    has_issues = False
    for variant_path in variants_path.iterdir():
        if not variant_path.is_dir():
            print(
                f"⛔ `build/` must only contain directories, found: {variant_path.name}",
                file=sys.stderr,
            )
            has_issues = True
            continue
        print(f"Checking variant: {variant_path.name}", file=sys.stderr)
        indent = 2
        for dylib_path in variant_path.rglob("*.so"):
            print_with_indent(
                indent,
                f"Dynamic library {dylib_path.relative_to(variant_path)}:",
            )
            o = ObjectFile(dylib_path)
            has_issues |= check_abi3(o, python_abi, indent + 2)
            # TODO: also check operating system
            if o.format() == BinaryFormat.ELF:
                has_issues |= check_manylinux(o, manylinux, indent + 2)
            elif o.format() == BinaryFormat.MACH_O:
                has_issues |= check_macos(o, macos, indent + 2)
    if has_issues:
        sys.exit(1)
 def check_abi3(object_file: ObjectFile, python_abi: str, indent: int) -> bool:
    has_issues = False
    violations = object_file.check_python_abi(python_abi)
    if violations != []:
        has_issues = True
        print_with_indent(
            indent,
            f"⛔ Found symbols that are incompatible with Python ABI {python_abi}:",
        )
        for violation in violations:
            if isinstance(violation, IncompatibleAbi3Symbol):
                print_with_indent(
                    indent + 3,
                    f"{violation.name}: {violation.version_added}",
                )
            elif isinstance(violation, NonAbi3Symbol):
                print_with_indent(
                    indent + 3,
                    f"{violation.name}",
                )
    else:
        print_with_indent(indent, f"🐍 Python ABI {python_abi} compatible")
    return has_issues
 def check_macos(object_file: ObjectFile, macos: str, indent: int) -> bool:
    has_issues = False
    violations = object_file.check_macos(macos)
    if violations != []:
        has_issues = True
        print_with_indent(
            indent,
            f"⛔ Found incompatibility with macOS {macos}:",
        )
        for violation in violations:
            if isinstance(violation, MissingMacOSVersion):
                print_with_indent(
                    indent + 3,
                    "shared library does not contain macOS version",
                )
            elif isinstance(violation, IncompatibleMacOSVersion):
                print_with_indent(
                    indent + 3,
                    f"shared library requires macOS {violation.version}",
                )
    else:
        print_with_indent(indent, f"🍏 compatible with macOS {macos}")
    return has_issues
 def check_manylinux(object_file: ObjectFile, manylinux: str, indent: int) -> bool:
    has_issues = False
    violations = object_file.check_manylinux(manylinux)
    if violations != []:
        has_issues = True
        print_with_indent(
            indent,
            f"⛔ Found symbols that are incompatible with {manylinux}:",
        )
        for violation in violations:
            if isinstance(violation, IncompatibleManylinuxSymbol):
                print_with_indent(
                    indent + 3,
                    f"{violation.name}_{violation.dep}: {violation.version}",
                )
    else:
        print_with_indent(indent, f"🐧 {manylinux} compatible")
    return has_issues
 def print_with_indent(indent: int, message: str):
    print(f"{' ' * indent}{message}", file=sys.stderr)
--- a/src/kernels/cli.py
+++ b/src/kernels/cli.py
@ -4,6 +4,8 @@ import json
 import sys
 from pathlib import Path
 from huggingface_hub import create_repo, upload_folder, create_branch
 from kernels.compat import tomllib
 from kernels.lockfile import KernelLock, get_kernel_locks
 from kernels.utils import install_kernel, install_kernel_all_variants
@ -18,6 +20,31 @@ def main():
    )
    subparsers = parser.add_subparsers(required=True)
    check_parser = subparsers.add_parser("check", help="Check a kernel for compliance")
    check_parser.add_argument("repo_id", type=str, help="The kernel repo ID")
    check_parser.add_argument(
        "--revision",
        type=str,
        default="main",
        help="The kernel revision (branch, tag, or commit SHA, defaults to 'main')",
    )
    check_parser.add_argument("--macos", type=str, help="macOS version", default="15.0")
    check_parser.add_argument(
        "--manylinux", type=str, help="Manylinux version", default="manylinux_2_28"
    )
    check_parser.add_argument(
        "--python-abi", type=str, help="Python ABI version", default="3.9"
    )
    check_parser.set_defaults(
        func=lambda args: check_kernel(
            macos=args.macos,
            manylinux=args.manylinux,
            python_abi=args.python_abi,
            repo_id=args.repo_id,
            revision=args.revision,
        )
    )
    download_parser = subparsers.add_parser("download", help="Download locked kernels")
    download_parser.add_argument(
        "project_dir",
@ -31,6 +58,29 @@ def main():
    )
    download_parser.set_defaults(func=download_kernels)
    upload_parser = subparsers.add_parser("upload", help="Upload kernels to the Hub")
    upload_parser.add_argument(
        "kernel_dir",
        type=Path,
        help="Directory of the kernel build",
    )
    upload_parser.add_argument(
        "--repo_id",
        type=str,
        help="Repository ID to use to upload to the Hugging Face Hub",
    )
    upload_parser.add_argument(
        "--branch",
        type=None,
        help="If set, the upload will be made to a particular branch of the provided `repo_id`.",
    )
    upload_parser.add_argument(
        "--private",
        action="store_true",
        help="If the repository should be private.",
    )
    upload_parser.set_defaults(func=upload_kernels)
    lock_parser = subparsers.add_parser("lock", help="Lock kernel revisions")
    lock_parser.add_argument(
        "project_dir",
@ -153,8 +203,61 @@ def lock_kernels(args):
        json.dump(all_locks, f, cls=_JSONEncoder, indent=2)
 def upload_kernels(args):
    # Resolve `kernel_dir` to be uploaded.
    kernel_dir = Path(args.kernel_dir).resolve()
    build_dir = kernel_dir / "build"
    if not kernel_dir.is_dir():
        raise ValueError(f"{kernel_dir} is not a directory")
    if not build_dir.is_dir():
        raise ValueError("Couldn't find `build` directory inside `kernel_dir`")
    repo_id = create_repo(
        repo_id=args.repo_id, private=args.private, exist_ok=True
    ).repo_id
    if args.branch is not None:
        create_branch(repo_id=repo_id, branch=args.branch, exist_ok=True)
    delete_patterns: set[str] = set()
    for build_variant in build_dir.iterdir():
        if build_variant.is_dir():
            delete_patterns.add(f"{build_variant.name}/**")
    upload_folder(
        repo_id=repo_id,
        folder_path=build_dir,
        revision=args.branch,
        path_in_repo="build",
        delete_patterns=list(delete_patterns),
        commit_message="Build uploaded using `kernels`.",
    )
    print(f"✅ Kernel upload successful. Find the kernel in https://hf.co/{repo_id}.")
 class _JSONEncoder(json.JSONEncoder):
    def default(self, o):
        if dataclasses.is_dataclass(o):
            return dataclasses.asdict(o)
        return super().default(o)
 def check_kernel(
    *, macos: str, manylinux: str, python_abi: str, repo_id: str, revision: str
 ):
    try:
        import kernels.check
    except ImportError:
        print(
            "`kernels check` requires the `kernel-abi-check` package: pip install kernel-abi-check",
            file=sys.stderr,
        )
        sys.exit(1)
    kernels.check.check_kernel(
        macos=macos,
        manylinux=manylinux,
        python_abi=python_abi,
        repo_id=repo_id,
        revision=revision,
    )
--- a/src/kernels/layer.py
+++ b/src/kernels/layer.py
@ -87,7 +87,7 @@ class Device:
    Args:
        type (`str`):
-            The device type (e.g., "cuda", "mps", "cpu").
+            The device type (e.g., "cuda", "mps", "npu", "rocm", "xpu").
        properties ([`CUDAProperties`], *optional*):
            Device-specific properties. Currently only [`CUDAProperties`] is supported for CUDA devices.
@ -106,6 +106,12 @@ class Device:
        # MPS device for Apple Silicon
        mps_device = Device(type="mps")
        # XPU device (e.g., Intel(R) Data Center GPU Max 1550)
        xpu_device = Device(type="xpu")
        # NPU device (Huawei Ascend)
        npu_device = Device(type="npu")
        ```
    """
@ -125,6 +131,10 @@ class Device:
            return _ROCMRepos()
        elif self.type == "mps":
            return _MPSRepos()
        elif self.type == "xpu":
            return _XPURepos()
        elif self.type == "npu":
            return _NPURepos()
        else:
            raise ValueError(f"Unknown device type: {self.type}")
@ -311,7 +321,7 @@ class LayerRepository:
        return hash((self.layer_name, self._repo_id, self._revision, self._version))
    def __str__(self) -> str:
-        return f"`{self._repo_id}` (revision: {self._resolve_revision()}) for layer `{self.layer_name}`"
+        return f"`{self._repo_id}` (revision: {self._resolve_revision()}), layer `{self.layer_name}`"
 class LocalLayerRepository:
@ -367,7 +377,7 @@ class LocalLayerRepository:
        return hash((self.layer_name, self._repo_path, self._package_name))
    def __str__(self) -> str:
-        return f"`{self._repo_path}` (package: {self._package_name}) for layer `{self.layer_name}`"
+        return f"`{self._repo_path}` (package: {self._package_name}), layer `{self.layer_name}`"
 class LockedLayerRepository:
@ -422,7 +432,7 @@ class LockedLayerRepository:
        return hash((self.layer_name, self._repo_id))
    def __str__(self) -> str:
-        return f"`{self._repo_id}` (revision: {self._resolve_revision()}) for layer `{self.layer_name}`"
+        return f"`{self._repo_id}` (revision: {self._resolve_revision()}), layer `{self.layer_name}`"
 _CACHED_LAYER: Dict[LayerRepositoryProtocol, Type["nn.Module"]] = {}
@ -447,6 +457,46 @@ class _DeviceRepos(ABC):
        ...
 class _XPURepos(_DeviceRepos):
    _repos: Dict[Mode, LayerRepositoryProtocol]
    def __init__(self):
        super().__init__()
        self._repos = {}
    @property
    def repos(
        self,
    ) -> Optional[Dict[Mode, LayerRepositoryProtocol]]:
        return self._repos
    def insert(self, device: Device, repos: Dict[Mode, LayerRepositoryProtocol]):
        if device.type != "xpu":
            raise ValueError(f"Device type must be 'xpu', got {device.type}")
        self._repos = repos
 class _NPURepos(_DeviceRepos):
    _repos: Dict[Mode, LayerRepositoryProtocol]
    def __init__(self):
        super().__init__()
        self._repos = {}
    @property
    def repos(
        self,
    ) -> Optional[Dict[Mode, LayerRepositoryProtocol]]:
        return self._repos
    def insert(self, device: Device, repos: Dict[Mode, LayerRepositoryProtocol]):
        if device.type != "npu":
            raise ValueError(f"Device type must be 'npu', got {device.type}")
        self._repos = repos
 class _MPSRepos(_DeviceRepos):
    _repos: Dict[Mode, LayerRepositoryProtocol]
@ -531,7 +581,7 @@ class _ROCMRepos(_DeviceRepos):
 def _validate_device_type(device_type: str) -> None:
    """Validate that the device type is supported."""
-    supported_devices = {"cuda", "rocm", "mps", "cpu"}
+    supported_devices = {"cuda", "mps", "npu", "rocm", "xpu"}
    if device_type not in supported_devices:
        raise ValueError(
            f"Unsupported device type '{device_type}'. Supported device types are: {', '.join(sorted(supported_devices))}"
@ -578,7 +628,7 @@ def use_kernel_mapping(
        from kernels import use_kernel_forward_from_hub
        from kernels import use_kernel_mapping, LayerRepository, Device
-        from kernels import kernelize
+        from kernels import Mode, kernelize
        # Define a mapping
        mapping = {
@ -601,7 +651,7 @@ def use_kernel_mapping(
        # Use the mapping for the duration of the context.
        with use_kernel_mapping(mapping):
            # kernelize uses the temporary mapping
-            model = kernelize(model, device="cuda")
+            model = kernelize(model, mode=Mode.TRAINING | Mode.TORCH_COMPILE, device="cuda")
        # Outside the context, original mappings are restored
        ```
@ -772,7 +822,7 @@ def _select_repository(
 def kernelize(
    model: "nn.Module",
    *,
-    mode: Mode = Mode.TRAINING | Mode.TORCH_COMPILE,
+    mode: Mode,
    device: Optional[Union[str, "torch.device"]] = None,
    use_fallback: bool = True,
 ):
@ -785,11 +835,11 @@ def kernelize(
    Args:
        model (`nn.Module`):
            The PyTorch model to kernelize.
-        mode ([`Mode`], *optional*, defaults to `Mode.TRAINING | Mode.TORCH_COMPILE`):
+        mode ([`Mode`]): The mode that the kernel is going to be used in. For example,
-            The mode that the kernel is going to be used in. For example, `Mode.TRAINING | Mode.TORCH_COMPILE`
+            `Mode.TRAINING | Mode.TORCH_COMPILE` kernelizes the model for training with
-            kernelizes the model for training with `torch.compile`.
+            `torch.compile`.
        device (`Union[str, torch.device]`, *optional*):
-            The device type to load kernels for. Supported device types are: "cuda", "rocm", "mps", "cpu".
+            The device type to load kernels for. Supported device types are: "cuda", "mps", "npu", "rocm", "xpu".
            The device type will be inferred from the model parameters when not provided.
        use_fallback (`bool`, *optional*, defaults to `True`):
            Whether to use the original forward method of modules when no compatible kernel could be found.
@ -813,7 +863,7 @@ def kernelize(
                return F.silu(x[..., :d]) * x[..., d:]
        mapping = {
-            "LayerNorm": {
+            "SiluAndMul": {
                "cuda": LayerRepository(
                    repo_id="kernels-community/activation",
                    layer_name="SiluAndMul",
@ -829,7 +879,7 @@ def kernelize(
        )
        # Kernelize for inference
-        kernelized_model = kernelize(model)
+        kernelized_model = kernelize(model, mode=Mode.TRAINING | Mode.TORCH_COMPILE)
        ```
    """
@ -954,7 +1004,8 @@ def use_kernel_forward_from_hub(layer_name: str):
        import torch
        import torch.nn as nn
-        from kernels import use_kernel_forward_from_hub, kernelize
+        from kernels import use_kernel_forward_from_hub
        from kernels import Mode, kernelize
        @use_kernel_forward_from_hub("MyCustomLayer")
        class MyCustomLayer(nn.Module):
@ -969,7 +1020,7 @@ def use_kernel_forward_from_hub(layer_name: str):
        model = MyCustomLayer(768)
        # The layer can now be kernelized:
-        # model = kernelize(model, device="cuda")
+        # model = kernelize(model, mode=Mode.TRAINING | Mode.TORCH_COMPILE, device="cuda")
        ```
    """
@ -994,7 +1045,7 @@ def _get_kernel_layer(repo: LayerRepositoryProtocol) -> Type["nn.Module"]:
    return layer
-def _validate_layer(*, check_cls, cls):
+def _validate_layer(*, check_cls, cls, repo: LayerRepositoryProtocol):
    import torch.nn as nn
    # The layer must have at least have the following properties: (1) it
@ -1003,12 +1054,12 @@ def _validate_layer(*, check_cls, cls):
    # methods.
    if not issubclass(cls, nn.Module):
-        raise TypeError(f"Layer `{cls}` is not a Torch layer.")
+        raise TypeError(f"Layer `{cls.__name__}` is not a Torch layer.")
    # We verify statelessness by checking that the does not have its own
    # constructor (since the constructor could add member variables)...
    if cls.__init__ is not nn.Module.__init__:
-        raise TypeError("Layer must not override nn.Module constructor.")
+        raise TypeError(f"{repo} must not override nn.Module constructor.")
    # ... or predefined member variables.
    torch_module_members = {name for name, _ in inspect.getmembers(nn.Module)}
@ -1016,7 +1067,9 @@ def _validate_layer(*, check_cls, cls):
    difference = cls_members - torch_module_members
    # verify if : difference ⊄ {"can_torch_compile", "has_backward"}
    if not difference <= {"can_torch_compile", "has_backward"}:
-        raise TypeError("Layer must not contain additional members.")
+        raise TypeError(
            f"{repo} must not contain additional members compared to `{check_cls.__name__}`."
        )
    # Check whether the forward signatures are similar.
    params = inspect.signature(cls.forward).parameters
@ -1024,13 +1077,13 @@ def _validate_layer(*, check_cls, cls):
    if len(params) != len(ref_params):
        raise TypeError(
-            "Forward signature does not match: different number of arguments."
+            f"Forward signature of {repo} does not match `{check_cls.__name__}`: different number of arguments."
        )
    for param, ref_param in zip(params.values(), ref_params.values()):
        if param.kind != ref_param.kind:
            raise TypeError(
-                f"Forward signature does not match: different kind of arguments ({param} ({param.kind}) and {ref_param} ({ref_param.kind})"
+                f"Forward signature of {repo} does not match `{check_cls.__name__}`: different kind of arguments ({param} ({param.kind}) and {ref_param} ({ref_param.kind})"
            )
@ -1147,7 +1200,7 @@ def _get_layer_memoize(
        return layer
    layer = _get_kernel_layer(repo)
-    _validate_layer(check_cls=module_class, cls=layer)
+    _validate_layer(check_cls=module_class, cls=layer, repo=repo)
    _CACHED_LAYER[repo] = layer
    return layer
--- a/src/kernels/utils.py
+++ b/src/kernels/utils.py
@ -11,7 +11,7 @@ import sys
 from importlib.metadata import Distribution
 from pathlib import Path
 from types import ModuleType
-from typing import Dict, List, Optional, Tuple
+from typing import Dict, List, Optional, Tuple, Union
 from huggingface_hub import file_exists, snapshot_download
 from packaging.version import parse
@ -19,6 +19,8 @@ from packaging.version import parse
 from kernels._versions import select_revision_or_version
 from kernels.lockfile import KernelLock, VariantLock
 ENV_VARS_TRUE_VALUES = {"1", "ON", "YES", "TRUE"}
 def _get_cache_dir() -> Optional[str]:
    """Returns the kernels cache directory."""
@ -35,6 +37,14 @@ def _get_cache_dir() -> Optional[str]:
 CACHE_DIR: Optional[str] = _get_cache_dir()
 def _get_privateuse_backend_name() -> Optional[str]:
    import torch
    if hasattr(torch._C, "_get_privateuse1_backend_name"):
        return torch._C._get_privateuse1_backend_name()
    return None
 def build_variant() -> str:
    import torch
@ -46,11 +56,17 @@ def build_variant() -> str:
        compute_framework = f"rocm{rocm_version.major}{rocm_version.minor}"
    elif torch.backends.mps.is_available():
        compute_framework = "metal"
-    elif hasattr(torch, "xpu") and torch.xpu.is_available():
+    elif hasattr(torch.version, "xpu") and torch.version.xpu is not None:
-        compute_framework = "xpu"
+        version = torch.version.xpu
        compute_framework = f"xpu{version[0:4]}{version[5:6]}"
    elif _get_privateuse_backend_name() == "npu":
        from torch_npu.utils.collect_env import get_cann_version  # type: ignore[import-not-found]
        cann_major, cann_minor = get_cann_version()[0], get_cann_version()[2]
        compute_framework = f"cann{cann_major}{cann_minor}"
    else:
        raise AssertionError(
-            "Torch was not compiled with CUDA, Metal, XPU, or ROCm enabled."
+            "Torch was not compiled with CUDA, Metal, XPU, NPU, or ROCm enabled."
        )
    torch_version = parse(torch.__version__)
@ -94,6 +110,7 @@ def install_kernel(
    revision: str,
    local_files_only: bool = False,
    variant_locks: Optional[Dict[str, VariantLock]] = None,
    user_agent: Optional[Union[str, dict]] = None,
 ) -> Tuple[str, Path]:
    """
    Download a kernel for the current environment to the cache.
@ -109,6 +126,8 @@ def install_kernel(
            Whether to only use local files and not download from the Hub.
        variant_locks (`Dict[str, VariantLock]`, *optional*):
            Optional dictionary of variant locks for validation.
        user_agent (`Union[str, dict]`, *optional*):
            The `user_agent` info to pass to `snapshot_download()` for internal telemetry.
    Returns:
        `Tuple[str, Path]`: A tuple containing the package name and the path to the variant directory.
@ -116,6 +135,7 @@ def install_kernel(
    package_name = package_name_from_repo_id(repo_id)
    variant = build_variant()
    universal_variant = universal_build_variant()
    user_agent = _get_user_agent(user_agent=user_agent)
    repo_path = Path(
        snapshot_download(
            repo_id,
@ -123,6 +143,7 @@ def install_kernel(
            cache_dir=CACHE_DIR,
            revision=revision,
            local_files_only=local_files_only,
            user_agent=user_agent,
        )
    )
@ -199,7 +220,10 @@ def install_kernel_all_variants(
 def get_kernel(
-    repo_id: str, revision: Optional[str] = None, version: Optional[str] = None
+    repo_id: str,
    revision: Optional[str] = None,
    version: Optional[str] = None,
    user_agent: Optional[Union[str, dict]] = None,
 ) -> ModuleType:
    """
    Load a kernel from the kernel hub.
@ -215,6 +239,8 @@ def get_kernel(
        version (`str`, *optional*):
            The kernel version to download. This can be a Python version specifier, such as `">=1.0.0,<2.0.0"`.
            Cannot be used together with `revision`.
        user_agent (`Union[str, dict]`, *optional*):
            The `user_agent` info to pass to `snapshot_download()` for internal telemetry.
    Returns:
        `ModuleType`: The imported kernel module.
@ -231,7 +257,9 @@ def get_kernel(
        ```
    """
    revision = select_revision_or_version(repo_id, revision, version)
-    package_name, package_path = install_kernel(repo_id, revision=revision)
+    package_name, package_path = install_kernel(
        repo_id, revision=revision, user_agent=user_agent
    )
    return import_from_path(package_name, package_path / package_name / "__init__.py")
@ -487,3 +515,24 @@ def git_hash_object(data: bytes, object_type: str = "blob"):
 def package_name_from_repo_id(repo_id: str) -> str:
    return repo_id.split("/")[-1].replace("-", "_")
 def _get_user_agent(
    user_agent: Optional[Union[dict, str]] = None,
 ) -> Union[None, dict, str]:
    import torch
    from . import __version__
    if os.getenv("DISABLE_TELEMETRY", "false").upper() in ENV_VARS_TRUE_VALUES:
        return None
    if user_agent is None:
        user_agent = {
            "kernels": __version__,
            "torch": torch.__version__,
            "build_variant": build_variant(),
            "file_type": "kernel",
        }
    return user_agent
--- a/tests/conftest.py
+++ b/tests/conftest.py
@ -3,6 +3,8 @@ import sys
 import pytest
 import torch
 from kernels.utils import _get_privateuse_backend_name
 has_cuda = (
    hasattr(torch.version, "cuda")
    and torch.version.cuda is not None
@ -13,6 +15,20 @@ has_rocm = (
    and torch.version.hip is not None
    and torch.cuda.device_count() > 0
 )
 has_xpu = (
    hasattr(torch.version, "xpu")
    and torch.version.xpu is not None
    and torch.xpu.device_count() > 0
 )
 has_npu = _get_privateuse_backend_name() == "npu"
 def pytest_addoption(parser):
    parser.addoption(
        "--token",
        action="store_true",
        help="run tests that require a token with write permissions",
    )
 def pytest_runtest_setup(item):
@ -22,3 +38,9 @@ def pytest_runtest_setup(item):
        pytest.skip("skipping ROCm-only test on host without ROCm")
    if "darwin_only" in item.keywords and not sys.platform.startswith("darwin"):
        pytest.skip("skipping macOS-only test on non-macOS platform")
    if "xpu_only" in item.keywords and not has_xpu:
        pytest.skip("skipping XPU-only test on host without XPU")
    if "npu_only" in item.keywords and not has_npu:
        pytest.skip("skipping NPU-only test on host without NPU")
    if "token" in item.keywords and not item.config.getoption("--token"):
        pytest.skip("need --token option to run this test")
--- a/tests/kernel_locking/kernels.lock
+++ b/tests/kernel_locking/kernels.lock
@ -1,82 +1,70 @@
 [
  {
    "repo_id": "kernels-community/activation",
-    "sha": "fd6842e88f1f23f198551d78a4541b8eb07e0538",
+    "sha": "83046852be158d525114f68513cd79fd88911b37",
    "variants": {
      "torch25-cxx11-cu118-x86_64-linux": {
        "hash": "sha256-61e3e51b5b59b30d4a6ba943a5e6e4ef5a9c8260cc4bca40b9fb462c0777842b",
        "hash_type": "git_lfs_concat"
      },
      "torch25-cxx11-cu121-x86_64-linux": {
        "hash": "sha256-baa6b872040730bd1d676c011381f6f626fb96189837b828f587c806af8994fa",
        "hash_type": "git_lfs_concat"
      },
      "torch25-cxx11-cu124-x86_64-linux": {
        "hash": "sha256-c1ec7457847fa1f0e4ab43234dfc3cd0959977e03dc2ffe89b4f6b90970c7965",
        "hash_type": "git_lfs_concat"
      },
      "torch25-cxx98-cu118-x86_64-linux": {
        "hash": "sha256-412f9c841f20741e42f2c6cdb8c7da0e33ab436b219975acffe18b62b97ecd7c",
        "hash_type": "git_lfs_concat"
      },
      "torch25-cxx98-cu121-x86_64-linux": {
        "hash": "sha256-2fde7f97859506e000c1072b3916c0a75bc8cee750a9853ea8b68199e7b57bcd",
        "hash_type": "git_lfs_concat"
      },
      "torch25-cxx98-cu124-x86_64-linux": {
        "hash": "sha256-93309986f39a64a5630378108154866f0545178fa8dfef9b8f8ccfef9a78608e",
        "hash_type": "git_lfs_concat"
      },
      "torch26-cxx11-cu118-x86_64-linux": {
        "hash": "sha256-3284d3c64b76d92c1ee930bce8013aff307f16eefb16c2d5dea9f2ca70e71e1f",
        "hash_type": "git_lfs_concat"
      },
      "torch26-cxx11-cu124-x86_64-linux": {
        "hash": "sha256-36a8c93773c08ddf8ef624a8a6b2866be26d1861450dfe1ecac0bed59f9ffa47",
        "hash_type": "git_lfs_concat"
      },
      "torch26-cxx11-cu126-aarch64-linux": {
        "hash": "sha256-f5afb734520f587717665659798ff738a69e5ae1e34d4bd95624edd18fb165cd",
        "hash_type": "git_lfs_concat"
      },
      "torch26-cxx11-cu126-x86_64-linux": {
        "hash": "sha256-940841a7cb44f76c9a896d8b39f5bc0e0420f1c4c05ae9423da96778de4d1f2c",
        "hash_type": "git_lfs_concat"
      },
      "torch26-cxx98-cu118-x86_64-linux": {
        "hash": "sha256-8e0f907830c3acc8c6bebfc162c744012ff6973e8110d7bf8ecd74b492418204",
        "hash_type": "git_lfs_concat"
      },
      "torch26-cxx98-cu124-x86_64-linux": {
        "hash": "sha256-0833414cbe658baec55b7ff63537cddccc973fe99e3c03008cced5e66e38b6c1",
        "hash_type": "git_lfs_concat"
      },
      "torch26-cxx98-cu126-aarch64-linux": {
        "hash": "sha256-d94fa59a13a5b623b2071aadcd1e6c8477c4d557fd06ad144f15b46b1fc71aab",
        "hash_type": "git_lfs_concat"
      },
      "torch26-cxx98-cu126-x86_64-linux": {
        "hash": "sha256-64784f5f2f9e232d0f2fd824fbc47eadde505e3c232f351bead5b04c429c65c2",
        "hash_type": "git_lfs_concat"
      },
      "torch27-cxx11-cu118-x86_64-linux": {
-        "hash": "sha256-bcba3765f061649bac0e5a9159bea8349ced4780e24a2330aa62ce0f8d3a9d78",
+        "hash": "sha256-e34965c814c4c092fcb634ebadefe82ea9a05b98343f8ebdefa7305dcc05359e",
        "hash_type": "git_lfs_concat"
      },
      "torch27-cxx11-cu126-aarch64-linux": {
        "hash": "sha256-e4625df5706af025c70bd824d952b928d9a2965eeaefda72fc47be0fae680c5e",
        "hash_type": "git_lfs_concat"
      },
      "torch27-cxx11-cu126-x86_64-linux": {
-        "hash": "sha256-7d7d3e655f34a7b03d5603d7c1ab723ef3efc823291762421a8b3a4aa51bd405",
+        "hash": "sha256-5f92b35922b37224a416398a39a29b7e5f1aca1df17d5c69f1b9e9cdb7033561",
        "hash_type": "git_lfs_concat"
      },
      "torch27-cxx11-cu128-aarch64-linux": {
-        "hash": "sha256-60e076194dcd55b32c5aca72f09816cba0fff52f340c8a063b17ff0577154d99",
+        "hash": "sha256-125967cb23bacd2cec443799f184ac08247dfff33f5027e54ee16d3779ca5986",
        "hash_type": "git_lfs_concat"
      },
      "torch27-cxx11-cu128-x86_64-linux": {
-        "hash": "sha256-f0a3802382efdcd78b40601187a9c416579a24ef2ed5a60d2296ef0951a89597",
+        "hash": "sha256-496a84c99d7035a1b6f0ea1c026b751c3a2677956f4c1be546d3cc1505a5fdbb",
        "hash_type": "git_lfs_concat"
      },
      "torch28-cxx11-cu126-aarch64-linux": {
        "hash": "sha256-f0775a30ffa290c90aba3a41037e3ca91edb15b4a9367561fafd5f25455e117a",
        "hash_type": "git_lfs_concat"
      },
      "torch28-cxx11-cu126-x86_64-linux": {
        "hash": "sha256-081995e6230f306bdf6111186618794f2411cf0ffd9b4800330df60b4ebe1927",
        "hash_type": "git_lfs_concat"
      },
      "torch28-cxx11-cu128-aarch64-linux": {
        "hash": "sha256-b937fef62a0c1cd71ab98490b651c473577af209b9a3e2a6b452350283d8812c",
        "hash_type": "git_lfs_concat"
      },
      "torch28-cxx11-cu128-x86_64-linux": {
        "hash": "sha256-a3915686cc58641a3361ece63ab77b33e9d30315dea12547e4bda008d8810a01",
        "hash_type": "git_lfs_concat"
      },
      "torch28-cxx11-cu129-aarch64-linux": {
        "hash": "sha256-a24dca8e998f88be42491921c9df89d88a6112ca630acd2efc2dd34a64b91fcb",
        "hash_type": "git_lfs_concat"
      },
      "torch28-cxx11-cu129-x86_64-linux": {
        "hash": "sha256-df6c70a70f425db2f68b86561c6f93c5675c1d5e5d058766d88ab17472229907",
        "hash_type": "git_lfs_concat"
      },
      "torch29-cxx11-cu126-aarch64-linux": {
        "hash": "sha256-c120011c201072b4cfd70c2ba2d45c2f05337feaf604ddec3c6c4987def33ab3",
        "hash_type": "git_lfs_concat"
      },
      "torch29-cxx11-cu126-x86_64-linux": {
        "hash": "sha256-765a7f3279009979be4001a23c5c70e5e6ab9553098d67886731a5275a6d4b32",
        "hash_type": "git_lfs_concat"
      },
      "torch29-cxx11-cu128-aarch64-linux": {
        "hash": "sha256-266d057a9cd82b872a0e02f09ac5e2660fcffcf9a7b7fa1fa8ff33dc19c0f5c2",
        "hash_type": "git_lfs_concat"
      },
      "torch29-cxx11-cu128-x86_64-linux": {
        "hash": "sha256-6850e594ba4588f289b5904eb88eda5a41870ee20a3bf1586f3268307caf4b53",
        "hash_type": "git_lfs_concat"
      },
      "torch29-cxx11-cu130-aarch64-linux": {
        "hash": "sha256-23741b935462b53bdf868f8d1c9c8cff5f02f71ea3b0550df41dc8b030b0b474",
        "hash_type": "git_lfs_concat"
      },
      "torch29-cxx11-cu130-x86_64-linux": {
        "hash": "sha256-b884ae792dc1eada071f31645add0c2c76d479864f25aebcdd8318b675aaaf29",
        "hash_type": "git_lfs_concat"
      }
    }
--- a/tests/test_basic.py
+++ b/tests/test_basic.py
@ -10,10 +10,16 @@ def kernel():
@pytest.fixture
-def local_kernel():
+def local_kernel_path():
    package_name, path = install_kernel("kernels-community/activation", "main")
    # Path is the build variant path (build/torch-<...>), so the grandparent
    # is the kernel repository path.
    return package_name, path
@pytest.fixture
 def local_kernel(local_kernel_path):
    package_name, path = local_kernel_path
    return get_local_kernel(path.parent.parent, package_name)
@ -66,6 +72,39 @@ def test_local_kernel(local_kernel, device):
    assert torch.allclose(y, expected)
@pytest.mark.cuda_only
 def test_local_kernel_path_types(local_kernel_path, device):
    package_name, path = local_kernel_path
    # Top-level repo path
    # ie: /home/ubuntu/.cache/huggingface/hub/models--kernels-community--activation/snapshots/2fafa6a3a38ccb57a1a98419047cf7816ecbc071
    kernel = get_local_kernel(path.parent.parent, package_name)
    x = torch.arange(1, 10, dtype=torch.float16, device=device).view(3, 3)
    y = torch.empty_like(x)
    kernel.gelu_fast(y, x)
    expected = torch.tensor(
        [[0.8408, 1.9551, 2.9961], [4.0000, 5.0000, 6.0000], [7.0000, 8.0000, 9.0000]],
        device=device,
        dtype=torch.float16,
    )
    assert torch.allclose(y, expected)
    # Build directory path
    # ie: /home/ubuntu/.cache/huggingface/hub/models--kernels-community--activation/snapshots/2fafa6a3a38ccb57a1a98419047cf7816ecbc071/build
    kernel = get_local_kernel(path.parent.parent / "build", package_name)
    y = torch.empty_like(x)
    kernel.gelu_fast(y, x)
    assert torch.allclose(y, expected)
    # Explicit package path
    # ie: /home/ubuntu/.cache/huggingface/hub/models--kernels-community--activation/snapshots/2fafa6a3a38ccb57a1a98419047cf7816ecbc071/build/torch28-cxx11-cu128-x86_64-linux
    kernel = get_local_kernel(path, package_name)
    y = torch.empty_like(x)
    kernel.gelu_fast(y, x)
    assert torch.allclose(y, expected)
@pytest.mark.darwin_only
@pytest.mark.parametrize("dtype", [torch.float16, torch.float32])
 def test_relu_metal(metal_kernel, dtype):
--- a/tests/test_kernel_locking.py
+++ b/tests/test_kernel_locking.py
@ -35,6 +35,7 @@ def test_load_locked():
    load_kernel("kernels-community/activation", lockfile=project_dir / "kernels.lock")
@pytest.mark.cuda_only
 def test_layer_locked():
    project_dir = Path(__file__).parent / "layer_locking"
--- a/tests/test_kernel_upload.py
+++ b/tests/test_kernel_upload.py
@ -0,0 +1,122 @@
 import logging
 import os
 import re
 import tempfile
 from dataclasses import dataclass
 from pathlib import Path
 from typing import List
 import pytest
 from huggingface_hub import delete_repo, model_info, list_repo_refs
 from kernels.cli import upload_kernels
 REPO_ID = "valid_org/kernels-upload-test"
 PY_CONTENT = """\
 #!/usr/bin/env python3
 def main():
    print("Hello from torch-universal!")
 if __name__ == "__main__":
    main()
 """
@dataclass
 class UploadArgs:
    kernel_dir: None
    repo_id: None
    private: False
    branch: None
 def next_filename(path: Path) -> Path:
    """
    Given a path like foo_2050.py, return foo_2051.py.
    """
    m = re.match(r"^(.*?)(\d+)(\.py)$", path.name)
    if not m:
        raise ValueError(
            f"Filename {path.name!r} does not match pattern <prefix>_<number>.py"
        )
    prefix, number, suffix = m.groups()
    new_number = str(int(number) + 1).zfill(len(number))
    return path.with_name(f"{prefix}{new_number}{suffix}")
 def get_filename_to_change(repo_filenames):
    for f in repo_filenames:
        if "foo" in f and f.endswith(".py"):
            filename_to_change = os.path.basename(f)
            break
    assert filename_to_change
    return filename_to_change
 def get_filenames_from_a_repo(repo_id: str) -> List[str]:
    try:
        repo_info = model_info(repo_id=repo_id, files_metadata=True)
        repo_siblings = repo_info.siblings
        if repo_siblings is not None:
            return [f.rfilename for f in repo_siblings]
        else:
            raise ValueError("No repo siblings found.")
    except Exception as e:
        logging.error(f"Error connecting to the Hub: {e}.")
@pytest.mark.token
@pytest.mark.is_staging_test
@pytest.mark.parametrize("branch", (None, "foo"))
 def test_kernel_upload_works_as_expected(branch):
    with tempfile.TemporaryDirectory() as tmpdir:
        path = f"{tmpdir}/build/torch-universal/upload_test"
        build_dir = Path(path)
        build_dir.mkdir(parents=True, exist_ok=True)
        script_path = build_dir / "foo.py"
        script_path.write_text(PY_CONTENT)
        upload_kernels(UploadArgs(tmpdir, REPO_ID, False, branch))
    repo_filenames = get_filenames_from_a_repo(REPO_ID)
    assert any(str(script_path.name) for f in repo_filenames)
    if branch is not None:
        refs = list_repo_refs(repo_id=REPO_ID)
        assert any(ref_branch.name == branch for ref_branch in refs.branches)
    delete_repo(repo_id=REPO_ID)
@pytest.mark.token
@pytest.mark.is_staging_test
 def test_kernel_upload_deletes_as_expected():
    with tempfile.TemporaryDirectory() as tmpdir:
        path = f"{tmpdir}/build/torch-universal/upload_test"
        build_dir = Path(path)
        build_dir.mkdir(parents=True, exist_ok=True)
        script_path = build_dir / "foo_2025.py"
        script_path.write_text(PY_CONTENT)
        upload_kernels(UploadArgs(tmpdir, REPO_ID, False, None))
    repo_filenames = get_filenames_from_a_repo(REPO_ID)
    filename_to_change = get_filename_to_change(repo_filenames)
    with tempfile.TemporaryDirectory() as tmpdir:
        path = f"{tmpdir}/build/torch-universal/upload_test"
        build_dir = Path(path)
        build_dir.mkdir(parents=True, exist_ok=True)
        changed_filename = next_filename(Path(filename_to_change))
        script_path = build_dir / changed_filename
        script_path.write_text(PY_CONTENT)
        upload_kernels(UploadArgs(tmpdir, REPO_ID, False, None))
    repo_filenames = get_filenames_from_a_repo(REPO_ID)
    assert any(str(changed_filename) in k for k in repo_filenames), f"{repo_filenames=}"
    assert not any(
        str(filename_to_change) in k for k in repo_filenames
    ), f"{repo_filenames=}"
    delete_repo(repo_id=REPO_ID)
--- a/tests/test_layer.py
+++ b/tests/test_layer.py
@ -21,14 +21,21 @@ from kernels.layer import (
    _KERNEL_MAPPING,
    _validate_layer,
 )
-from kernels.utils import install_kernel
+from kernels.utils import (
    _get_privateuse_backend_name,
    install_kernel,
 )
 kernel_layer_mapping = {
    "SiluAndMul": {
        Device(type="cuda"): LayerRepository(
            repo_id="kernels-community/activation",
            layer_name="SiluAndMul",
-        )
+        ),
        "npu": LayerRepository(
            repo_id="kernels-ext-npu/SwiGlu",
            layer_name="SwiGlu",
        ),
    },
    "SiluAndMulNoCompile": {
        "cuda": LayerRepository(
@ -46,11 +53,37 @@ kernel_layer_mapping = {
            layer_name="SiluAndMul",
        )
    },
    "LigerRMSNorm": {
        "xpu": LayerRepository(
            repo_id="kernels-community/liger_kernels",
            layer_name="LigerRMSNorm",  # Triton
        )
    },
 }
 register_kernel_mapping(kernel_layer_mapping)
 class RMSNorm(nn.Module):
    def __init__(self, weight: torch.Tensor, eps: float = 1e-6):
        super().__init__()
        # Used to check that we called hub kernel.
        self.n_calls = 0
        self.weight = nn.Parameter(weight)
        self.variance_epsilon = eps
    def forward(self, x: torch.Tensor):
        self.n_calls += 1
        var = x.pow(2).mean(-1, keepdim=True)
        x_norm = x * torch.rsqrt(var + self.variance_epsilon)
        return x_norm * self.weight
@use_kernel_forward_from_hub("LigerRMSNorm")
 class RMSNormWithKernel(RMSNorm):
    pass
 class SiluAndMul(nn.Module):
    def __init__(self):
        super().__init__()
@ -90,6 +123,18 @@ class TorchLinearWithCounter(nn.Linear):
        return super().forward(input)
@pytest.fixture
 def device():
    if torch.cuda.is_available():
        return "cuda"
    elif hasattr(torch, "xpu") and torch.xpu.is_available():
        return "xpu"
    elif _get_privateuse_backend_name() == "npu":
        return "npu"
    pytest.skip("No CUDA, NPU or XPU")
 def test_arg_kinds():
    @use_kernel_forward_from_hub("ArgKind")
    class ArgKind(nn.Module):
@ -110,24 +155,20 @@ def test_arg_kinds():
@pytest.mark.cuda_only
@pytest.mark.parametrize("cls", [SiluAndMulWithKernel, SiluAndMulStringDevice])
-@pytest.mark.parametrize("device", ["cuda", "cpu"])
+def test_hub_forward(cls):
 def test_hub_forward(cls, device):
    torch.random.manual_seed(0)
    silu_and_mul = SiluAndMul()
-    X = torch.randn((32, 64), device=device)
+    X = torch.randn((32, 64), device="cuda")
    Y = silu_and_mul(X)
-    silu_and_mul_with_kernel = kernelize(cls(), device=device, mode=Mode.INFERENCE)
+    silu_and_mul_with_kernel = kernelize(cls(), device="cuda", mode=Mode.INFERENCE)
    Y_kernel = silu_and_mul_with_kernel(X)
    torch.testing.assert_close(Y_kernel, Y)
    assert silu_and_mul.n_calls == 1
-    if device == "cuda":
+    assert silu_and_mul_with_kernel.n_calls == 0
        assert silu_and_mul_with_kernel.n_calls == 0
    else:
        assert silu_and_mul_with_kernel.n_calls == 1
@pytest.mark.rocm_only
@ -151,6 +192,54 @@ def test_hub_forward_rocm():
    assert silu_and_mul_with_kernel.n_calls in [0, 1]
@pytest.mark.xpu_only
 def test_hub_forward_xpu():
    torch.manual_seed(0)
    hidden_size = 1024
    weight = torch.ones(hidden_size, device="xpu")
    rms_norm = RMSNorm(weight).to("xpu")
    X = torch.randn(4, 16, hidden_size, device="xpu", dtype=torch.float32)
    Y = rms_norm(X)
    rms_norm_with_kernel = kernelize(
        RMSNormWithKernel(weight), mode=Mode.INFERENCE, device="xpu"
    )
    Y_kernel = rms_norm_with_kernel(X)
    torch.testing.assert_close(Y_kernel, Y)
    assert rms_norm.n_calls == 1
    assert rms_norm_with_kernel.n_calls == 0
@pytest.mark.npu_only
 def test_hub_forward_npu():
    torch.manual_seed(0)
    silu_and_mul = SiluAndMul()
    X = torch.randn((32, 64), device="npu")
    Y = silu_and_mul(X)
    silu_and_mul_with_kernel = kernelize(
        SiluAndMulWithKernel(), device="npu", mode=Mode.INFERENCE
    )
    Y_kernel = silu_and_mul_with_kernel(X)
    torch.testing.assert_close(Y_kernel, Y)
    assert silu_and_mul.n_calls == 1
    assert silu_and_mul_with_kernel.n_calls == 0
@pytest.mark.skipif(
    hasattr(torch, "xpu") and getattr(torch.xpu, "is_available", lambda: False)(),
    reason="Skip on xpu devices",
 )
@pytest.mark.skipif(
    _get_privateuse_backend_name() == "npu",
    reason="Skip on npu devices",
 )
 def test_rocm_kernel_mapping():
    """Test that ROCm shorthand device mapping works correctly."""
    kernel_layer_mapping = {
@ -238,16 +327,16 @@ def test_layer_fallback_works():
    kernelize(silu_and_mul, device="cuda", mode=Mode.INFERENCE)
-def test_local_layer_repo():
+def test_local_layer_repo(device):
    # Fetch a kernel to the local cache.
    package_name, path = install_kernel("kernels-test/backward-marker-test", "main")
-    linear = TorchLinearWithCounter(32, 32).to("cuda")
+    linear = TorchLinearWithCounter(32, 32).to(device)
    with use_kernel_mapping(
        {
            "Linear": {
-                "cuda": LocalLayerRepository(
+                device: LocalLayerRepository(
                    # install_kernel will give the fully-resolved path.
                    repo_path=path.parent.parent,
                    package_name=package_name,
@ -259,7 +348,7 @@ def test_local_layer_repo():
    ):
        kernelize(linear, mode=Mode.INFERENCE)
-    X = torch.randn(10, 32, device="cuda")
+    X = torch.randn(10, 32, device=device)
    linear(X)
    assert linear.n_calls == 0
@ -327,6 +416,7 @@ def test_mapping_contexts():
        "SiluAndMul",
        "SiluAndMulStringDevice",
        "SiluAndMulNoCompile",
        "LigerRMSNorm",
    }
    extra_mapping1 = {
@ -344,6 +434,7 @@ def test_mapping_contexts():
            "SiluAndMul",
            "SiluAndMulStringDevice",
            "SiluAndMulNoCompile",
            "LigerRMSNorm",
            "TestKernel",
        }
@ -362,6 +453,7 @@ def test_mapping_contexts():
                "SiluAndMul",
                "SiluAndMulStringDevice",
                "SiluAndMulNoCompile",
                "LigerRMSNorm",
                "TestKernel",
            }
            assert (
@ -375,6 +467,7 @@ def test_mapping_contexts():
            "SiluAndMul",
            "SiluAndMulStringDevice",
            "SiluAndMulNoCompile",
            "LigerRMSNorm",
            "TestKernel",
        }
        assert (
@ -397,6 +490,7 @@ def test_mapping_contexts():
            "SiluAndMul",
            "SiluAndMulStringDevice",
            "SiluAndMulNoCompile",
            "LigerRMSNorm",
            "TestKernel",
        }
        assert (
@ -408,6 +502,7 @@ def test_mapping_contexts():
        "SiluAndMul",
        "SiluAndMulStringDevice",
        "SiluAndMulNoCompile",
        "LigerRMSNorm",
    }
@ -417,26 +512,43 @@ def test_validate_kernel_layer():
            super().__init__(*args, **kwargs)
            self.foo = 42
-    with pytest.raises(TypeError, match="not override"):
+    def stub_repo(layer):
-        _validate_layer(cls=BadLayer, check_cls=SiluAndMul)
+        return LayerRepository(
            repo_id="kernels-test/nonexisting", layer_name=layer.__name__
        )
    with pytest.raises(
        TypeError,
        match="`kernels-test/nonexisting`.*layer `BadLayer` must not override",
    ):
        _validate_layer(cls=BadLayer, check_cls=SiluAndMul, repo=stub_repo(BadLayer))
    class BadLayer2(nn.Module):
        foo: int = 42
-    with pytest.raises(TypeError, match="not contain additional members"):
+    with pytest.raises(
-        _validate_layer(cls=BadLayer2, check_cls=SiluAndMul)
+        TypeError,
        match="`kernels-test/nonexisting`.*layer `BadLayer2` must not contain.*SiluAndMul",
    ):
        _validate_layer(cls=BadLayer2, check_cls=SiluAndMul, repo=stub_repo(BadLayer2))
    class BadLayer3(nn.Module):
        def forward(self, x: torch.Tensor, foo: int) -> torch.Tensor: ...
-    with pytest.raises(TypeError, match="different number of arguments"):
+    with pytest.raises(
-        _validate_layer(cls=BadLayer3, check_cls=SiluAndMul)
+        TypeError,
        match="Forward.*`kernels-test/nonexisting`.*layer `BadLayer3` does not match `SiluAndMul`: different number of arguments",
    ):
        _validate_layer(cls=BadLayer3, check_cls=SiluAndMul, repo=stub_repo(BadLayer3))
    class BadLayer4(nn.Module):
        def forward(self, *, x: torch.Tensor) -> torch.Tensor: ...
-    with pytest.raises(TypeError, match="different kind of arguments"):
+    with pytest.raises(
-        _validate_layer(cls=BadLayer4, check_cls=SiluAndMul)
+        TypeError,
        match="Forward.*`kernels-test/nonexisting`.*layer `BadLayer4` does not match `SiluAndMul`: different kind of arguments",
    ):
        _validate_layer(cls=BadLayer4, check_cls=SiluAndMul, repo=stub_repo(BadLayer4))
@pytest.mark.cuda_only
@ -488,11 +600,6 @@ def test_kernel_modes():
        linear(X)
        assert linear.n_calls == 0
        # Same as previous, since TRAINING | TORCH_COMPILE is the default.
        kernelize(linear)
        linear(X)
        assert linear.n_calls == 0
    # Case 2: register a kernel just for training. If no base kernel
    #         layer is registered, we fall back to the original layer.
    with use_kernel_mapping(
@ -522,12 +629,6 @@ def test_kernel_modes():
        # TRAINING | TORCH_COMPILE cannot fall back to TRAINING kernel, so uses original.
        assert linear.n_calls == 1
        # Same as previous, since TRAINING | TORCH_COMPILE is the default.
        kernelize(linear)
        linear(X)
        # TRAINING | TORCH_COMPILE cannot fall back to TRAINING kernel, so uses original.
        assert linear.n_calls == 2
    # Case 3: register a kernel just for training and one for fallback.
    with use_kernel_mapping(
        {
@ -549,23 +650,17 @@ def test_kernel_modes():
        X = torch.randn(10, 32, device="cuda")
        linear(X)
        # Falls back to TRAINING.
-        assert linear.n_calls == 2
+        assert linear.n_calls == 1
        kernelize(linear, mode=Mode.TRAINING)
        linear(X)
        # Falls back to the TRAINING kernel.
-        assert linear.n_calls == 2
+        assert linear.n_calls == 1
        kernelize(linear, mode=Mode.TRAINING | Mode.TORCH_COMPILE)
        linear(X)
        # TRAINING | TORCH_COMPILE falls back to FALLBACK kernel.
-        assert linear.n_calls == 2
+        assert linear.n_calls == 1
        # Same as previous, since TRAINING | TORCH_COMPILE is the default.
        kernelize(linear)
        linear(X)
        # TRAINING | TORCH_COMPILE falls back to FALLBACK kernel.
        assert linear.n_calls == 2
    # Case 4: register a kernel with two preferences.
    with use_kernel_mapping(
@ -585,22 +680,17 @@ def test_kernel_modes():
        X = torch.randn(10, 32, device="cuda")
        linear(X)
        # Falls back to the TRAINING | TORCH_COMPILE kernel.
-        assert linear.n_calls == 2
+        assert linear.n_calls == 1
        kernelize(linear, mode=Mode.TRAINING)
        linear(X)
        # TRAINING can fall back to TRAINING | TORCH_COMPILE kernel.
-        assert linear.n_calls == 2
+        assert linear.n_calls == 1
        kernelize(linear, mode=Mode.TRAINING | Mode.TORCH_COMPILE)
        linear(X)
        # Uses TRAINING | TORCH_COMPILE kernel.
-        assert linear.n_calls == 2
+        assert linear.n_calls == 1
        kernelize(linear)
        linear(X)
        # Same as previous, since TRAINING | TORCH_COMPILE is the default.
        assert linear.n_calls == 2
@pytest.mark.cuda_only
@ -949,7 +1039,7 @@ def test_kernel_modes_cross_fallback():
        assert linear.n_calls == 2
-def test_layer_versions():
+def test_layer_versions(device):
    @use_kernel_forward_from_hub("Version")
    class Version(nn.Module):
        def forward(self) -> str:
@ -960,20 +1050,20 @@ def test_layer_versions():
    with use_kernel_mapping(
        {
            "Version": {
-                Device(type="cuda"): LayerRepository(
+                Device(type=device): LayerRepository(
                    repo_id="kernels-test/versions",
                    layer_name="Version",
                )
            }
        }
    ):
-        version = kernelize(version, device="cuda", mode=Mode.INFERENCE)
+        version = kernelize(version, device=device, mode=Mode.INFERENCE)
        assert version() == "0.2.0"
    with use_kernel_mapping(
        {
            "Version": {
-                Device(type="cuda"): LayerRepository(
+                Device(type=device): LayerRepository(
                    repo_id="kernels-test/versions",
                    layer_name="Version",
                    version="<1.0.0",
@ -981,13 +1071,13 @@ def test_layer_versions():
            }
        }
    ):
-        version = kernelize(version, device="cuda", mode=Mode.INFERENCE)
+        version = kernelize(version, device=device, mode=Mode.INFERENCE)
        assert version() == "0.2.0"
    with use_kernel_mapping(
        {
            "Version": {
-                Device(type="cuda"): LayerRepository(
+                Device(type=device): LayerRepository(
                    repo_id="kernels-test/versions",
                    layer_name="Version",
                    version="<0.2.0",
@ -995,13 +1085,13 @@ def test_layer_versions():
            }
        }
    ):
-        version = kernelize(version, device="cuda", mode=Mode.INFERENCE)
+        version = kernelize(version, device=device, mode=Mode.INFERENCE)
        assert version() == "0.1.1"
    with use_kernel_mapping(
        {
            "Version": {
-                Device(type="cuda"): LayerRepository(
+                Device(type=device): LayerRepository(
                    repo_id="kernels-test/versions",
                    layer_name="Version",
                    version=">0.1.0,<0.2.0",
@ -1009,13 +1099,13 @@ def test_layer_versions():
            }
        }
    ):
-        version = kernelize(version, device="cuda", mode=Mode.INFERENCE)
+        version = kernelize(version, device=device, mode=Mode.INFERENCE)
        assert version() == "0.1.1"
    with use_kernel_mapping(
        {
            "Version": {
-                Device(type="cuda"): LayerRepository(
+                Device(type=device): LayerRepository(
                    repo_id="kernels-test/versions",
                    layer_name="Version",
                    version=">0.2.0",
@ -1024,13 +1114,13 @@ def test_layer_versions():
        }
    ):
        with pytest.raises(ValueError, match=r"No version.*satisfies requirement"):
-            kernelize(version, device="cuda", mode=Mode.INFERENCE)
+            kernelize(version, device=device, mode=Mode.INFERENCE)
    with pytest.raises(ValueError, match=r"Either a revision or a version.*not both"):
        use_kernel_mapping(
            {
                "Version": {
-                    Device(type="cuda"): LayerRepository(
+                    Device(type=device): LayerRepository(
                        repo_id="kernels-test/versions",
                        layer_name="Version",
                        revision="v0.1.0",
Author	SHA1	Message	Date
Daniël de Kok	ed048616fe	Set version to 0.10.4.dev0 (#169 )	2025-10-16 20:21:35 +02:00
Sayak Paul	b182cd3458	feat: allow get_kernel to log telemetry. (#167 ) * feat: allow get_kernel to log telemetry. * Apply suggestions from code review Co-authored-by: Daniël de Kok <me@danieldk.eu> * doc --------- Co-authored-by: Daniël de Kok <me@danieldk.eu>	2025-10-16 20:16:41 +02:00
Sayak Paul	ce77658efc	fix: kernels upload to a repo branch (#168 ) * fix: kernels upload to a repo branch * up	2025-10-16 16:01:00 +02:00
Wang, Yi	b96b154e7f	Avoid exception when detecting XPU on Torch <= 2.6 (#165 ) torch.version has no xpu field in torch<=2.6 Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-10-14 09:01:53 +02:00
Daniël de Kok	b24ef9fa6b	Set version to 0.10.3.dev0 (#164 )	2025-10-13 17:23:39 +02:00
Sayak Paul	a7101b2cfd	feat: allow kernels to be uploaded to a revision (#161 ) * feat: allow kernels to be uploaded to a revision * revision -> branch	2025-10-13 10:31:11 +02:00
Mohamed Mekkouri	6241afa06e	Bump torch version in runner (#162 ) * bump torch version * run kernels lock tests/kernel_locking	2025-10-09 11:04:52 +02:00
Daniël de Kok	34a1932751	Link local kernel and local/locked kernel API docs (#160 )	2025-10-02 14:38:47 +02:00
Sayak Paul	e39eac09c1	up (#159 )	2025-09-30 17:42:09 +02:00
Daniël de Kok	b0c431fee4	Add the `kernels check` subcommand (#158 ) * Add the `kernels check` subcommand This subcommand checks a given kernel. Currently it applies the same ABI checks as `kernel-abi-check` in `kernel-builder`. * Print an error when `build` contains files * Forgot to update has_issues in two places	2025-09-25 19:05:29 +02:00
Sayak Paul	9a188eadbe	up (#157 )	2025-09-24 11:39:07 +02:00
Daniël de Kok	457c7c1b8d	Only run staging tests in one configuration (#156 )	2025-09-23 10:52:47 +02:00
Daniël de Kok	fb8cd99a2c	Add support for NPU kernelize/layers (#155 ) This change add support for Huawei Ascend NPUs. This is #146 with some formatting/typing fixes. Co-authored-by: zheliuyu <15750543867@163.com>	2025-09-23 10:46:41 +02:00
Daniël de Kok	dfee307d54	Set version to 0.10.2.dev0 (#154 )	2025-09-22 18:54:09 +02:00
Sayak Paul	93e5765611	[tests] turn the `kernels upload` tests to be staging tests (#152 )	2025-09-22 18:53:53 +02:00
Daniël de Kok	bf488208be	faq: why only replace `forward` methods? (#153 )	2025-09-19 17:38:03 +02:00
Lucain	2a14472e4c	Bump huggingface_hub upper bound <2.0 (#151 )	2025-09-19 16:56:30 +02:00
Daniël de Kok	055a953552	Document the `to-wheel` subcommand (#149 ) * Document the `to-wheel` subcommand * Capitalization	2025-09-17 17:02:41 +02:00
Daniël de Kok	692d5ad458	Fix some spelling errors to check docs CI is working (#120 )	2025-09-17 13:44:09 +02:00
Mohamed Mekkouri	2139df57f4	rm link (#148 )	2025-09-17 12:46:49 +02:00
Daniël de Kok	8f9a77bb6a	Describe the `get_kernel`/`LayerRepository` (#147 ) This was already in the API documentation, but describe this in the guides as well (since we want people to use versions).	2025-09-16 16:06:40 +02:00
Daniël de Kok	6c00194680	Improve errors for layer validation (#145 ) * Improve errors for layer validation Include the repo and layer name as well as the name of the class that is being compared to (when applicable). * Remove upload xfail * Only enable tests that require a token with `--token`	2025-09-16 14:40:54 +02:00
Sayak Paul	d6b51eefb7	[feat] add an uploading utility (#138 ) * add an uploading utility. * format * remove stale files. * black format * sorted imports. * up * up * add a test * propagate. * remove duplicate imports. * Apply suggestions from code review Co-authored-by: Daniël de Kok <me@danieldk.eu> * up * up * up * command to format all files at once would be nice. * up * up * up * Use token for upload test * assign env better. * docs * polish * up * xfail the test for now. --------- Co-authored-by: Daniël de Kok <me@danieldk.eu>	2025-09-16 08:56:54 +02:00
Daniël de Kok	d383fdd4b4	Add support for XPU layer repostories (#142 ) This change adds support for XPU layer repositories, e.g.: ``` kernel_mapping = { "LigerRMSNorm": { "xpu": LayerRepository( repo_id="kernels-community/liger_kernels", layer_name="LigerRMSNorm", ) }, } Co-authored-by: YangKai0616 <kai.yang@intel.com>	2025-09-11 15:51:02 +02:00
Daniël de Kok	07e5e8481a	Set version to 0.10.1.dev0 (#140 ) * Set version to 0.10.1.dev0 * Add `__version__` attribute to top-level module This is needed for doc generation.	2025-09-10 09:08:02 +02:00
Wang, Yi	88f55d4728	XPU: look up kernel by framework version (#139 ) Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-09-09 13:10:11 +02:00
Daniël de Kok	e801ebf332	Set version to v0.10.0.dev0 (#137 )	2025-09-05 10:48:41 +02:00
Daniël de Kok	0ae07f05fc	Remove default for `mode` argument of `kernelize` (#136 )	2025-08-29 17:44:20 +02:00
Daniël de Kok	7611021100	`cpu` is not (yet) a supported device type (#132 ) Fixes #131.	2025-08-25 16:25:58 +02:00
drbh	767e7ccf13	fix: add get local tests (#134 ) * fix: add tests for get local kernel * fix: update test and add path example comments * fix: run black linter	2025-08-21 13:01:48 -04:00