mirror of
https://github.com/vllm-project/vllm.git
synced 2025-10-20 14:53:52 +08:00
[Feature] Fix guided decoding blocking bitmask memcpy (#12563)
**[Guided decoding performance optimization]** Sending the guided decoding bitmask in xgrammar to the GPU (`self.token_bitmask.to(scores.device)`) is a blocking operation that prevents the CPU from pre-launching the sampler kernels. The CPU waits until decode is complete, then copies the bitmask over. This PR changes the operation to async via setting `non-blocking=True`. (Current) The CPU is blocked on a `cudaStreamSynchronize` and only pre-empts the sampling kernels after bitmask application. Below is the Nsys profile for one decode phase from Llama 3.1 8B.  With the optimization, this is no longer the case:  --------- Signed-off-by: Ryan N <ryan.nguyen@centml.ai>
This commit is contained in:
@ -307,8 +307,8 @@ class XGrammarLogitsProcessor:
|
||||
# Note: In this method, if the tensors have different dimensions
|
||||
# on CPU device fails, but on GPU it runs without error. Hence the
|
||||
# unsqueeze above for scores, to match the token bitmask shape
|
||||
xgr.apply_token_bitmask_inplace(scores,
|
||||
self.token_bitmask.to(scores.device))
|
||||
xgr.apply_token_bitmask_inplace(
|
||||
scores, self.token_bitmask.to(scores.device, non_blocking=True))
|
||||
if device_type != "cuda":
|
||||
scores = scores.to(dtype).to(device_type).squeeze()
|
||||
|
||||
|
Reference in New Issue
Block a user