2.2 KiB
--8<-- [start:installation]
vLLM supports basic model inferencing and serving on x86 CPU platform, with data types FP32, FP16 and BF16.
--8<-- [end:installation]
--8<-- [start:requirements]
- OS: Linux
- CPU flags:
avx512f
(Recommended),avx512_bf16
(Optional),avx512_vnni
(Optional)
!!! tip
Use lscpu
to check the CPU flags.
--8<-- [end:requirements]
--8<-- [start:set-up-using-python]
--8<-- [end:set-up-using-python]
--8<-- [start:pre-built-wheels]
--8<-- [end:pre-built-wheels]
--8<-- [start:build-wheel-from-source]
--8<-- "docs/getting_started/installation/cpu/build.inc.md"
--8<-- [end:build-wheel-from-source]
--8<-- [start:pre-built-images]
https://gallery.ecr.aws/q9t5s3a7/vllm-cpu-release-repo
!!! warning
If deploying the pre-built images on machines without avx512f
, avx512_bf16
, or avx512_vnni
support, an Illegal instruction
error may be raised. It is recommended to build images for these machines with the appropriate build arguments (e.g., --build-arg VLLM_CPU_DISABLE_AVX512=true
, --build-arg VLLM_CPU_AVX512BF16=false
, or --build-arg VLLM_CPU_AVX512VNNI=false
) to disable unsupported features. Please note that without avx512f
, AVX2 will be used and this version is not recommended because it only has basic feature support.
--8<-- [end:pre-built-images]
--8<-- [start:build-image-from-source]
docker build -f docker/Dockerfile.cpu \
--build-arg VLLM_CPU_AVX512BF16=false (default)|true \
--build-arg VLLM_CPU_AVX512VNNI=false (default)|true \
--build-arg VLLM_CPU_DISABLE_AVX512=false (default)|true \
--tag vllm-cpu-env \
--target vllm-openai .
# Launching OpenAI server
docker run --rm \
--security-opt seccomp=unconfined \
--shm-size=4g \
-p 8000:8000 \
-e VLLM_CPU_KVCACHE_SPACE=<KV cache space> \
-e VLLM_CPU_OMP_THREADS_BIND=<CPU cores for inference> \
vllm-cpu-env \
--model=meta-llama/Llama-3.2-1B-Instruct \
--dtype=bfloat16 \
other vLLM OpenAI server arguments