mirror of
https://github.com/vllm-project/vllm.git
synced 2025-10-20 14:53:52 +08:00
Fix figures in design doc (#18612)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
This commit is contained in:
@ -140,22 +140,18 @@ title: vLLM Paged Attention
|
||||
const scalar_t* q_ptr = q + seq_idx * q_stride + head_idx * HEAD_SIZE;
|
||||
```
|
||||
|
||||
<figure markdown="span">
|
||||
{ align="center" alt="query" width="70%" }
|
||||
<figcaption>
|
||||
</figcaption>
|
||||
</figure>
|
||||
<figure markdown="span">
|
||||
{ align="center" alt="query" width="70%" }
|
||||
</figure>
|
||||
|
||||
- Each thread defines its own `q_ptr` which points to the assigned
|
||||
query token data on global memory. For example, if `VEC_SIZE` is 4
|
||||
and `HEAD_SIZE` is 128, the `q_ptr` points to data that contains
|
||||
total of 128 elements divided into 128 / 4 = 32 vecs.
|
||||
|
||||
<figure markdown="span">
|
||||
{ align="center" alt="q_vecs" width="70%" }
|
||||
<figcaption>
|
||||
</figcaption>
|
||||
</figure>
|
||||
<figure markdown="span">
|
||||
{ align="center" alt="q_vecs" width="70%" }
|
||||
</figure>
|
||||
|
||||
```cpp
|
||||
__shared__ Q_vec q_vecs[THREAD_GROUP_SIZE][NUM_VECS_PER_THREAD];
|
||||
@ -192,11 +188,9 @@ title: vLLM Paged Attention
|
||||
points to key token data based on `k_cache` at assigned block,
|
||||
assigned head and assigned token.
|
||||
|
||||
<figure markdown="span">
|
||||
{ align="center" alt="key" width="70%" }
|
||||
<figcaption>
|
||||
</figcaption>
|
||||
</figure>
|
||||
<figure markdown="span">
|
||||
{ align="center" alt="key" width="70%" }
|
||||
</figure>
|
||||
|
||||
- The diagram above illustrates the memory layout for key data. It
|
||||
assumes that the `BLOCK_SIZE` is 16, `HEAD_SIZE` is 128, `x` is
|
||||
@ -209,11 +203,9 @@ title: vLLM Paged Attention
|
||||
elements for one token) that will be processed by 2 threads (one
|
||||
thread group) separately.
|
||||
|
||||
<figure markdown="span">
|
||||
{ align="center" alt="k_vecs" width="70%" }
|
||||
<figcaption>
|
||||
</figcaption>
|
||||
</figure>
|
||||
<figure markdown="span">
|
||||
{ align="center" alt="k_vecs" width="70%" }
|
||||
</figure>
|
||||
|
||||
```cpp
|
||||
K_vec k_vecs[NUM_VECS_PER_THREAD]
|
||||
@ -372,20 +364,14 @@ title: vLLM Paged Attention
|
||||
|
||||
<figure markdown="span">
|
||||
{ align="center" alt="value" width="70%" }
|
||||
<figcaption>
|
||||
</figcaption>
|
||||
</figure>
|
||||
|
||||
<figure markdown="span">
|
||||
{ align="center" alt="logits_vec" width="50%" }
|
||||
<figcaption>
|
||||
</figcaption>
|
||||
</figure>
|
||||
|
||||
<figure markdown="span">
|
||||
{ align="center" alt="v_vec" width="70%" }
|
||||
<figcaption>
|
||||
</figcaption>
|
||||
</figure>
|
||||
|
||||
- Now we need to retrieve the value data and perform dot multiplication
|
||||
|
Reference in New Issue
Block a user