## Why?

Anthropic recently released an article, [Code execution with MCP: Building more efficient agents](https://www.anthropic.com/engineering/code-execution-with-mcp) and it focused heavily on **token conservation** and how agents can use **sandboxed environments** to execute code to fetch results.

Idea of token/context management stuck with me and then I went down the rabbit hole of token management and optimization where I started studying **vLLM inference runtimes**. vLLM employs a ton of methods to optimize inference but I found PagedAttention very interesting, so **I decided to simulate it.**

## The Problems

Nowadays LLM serving is one the most crucial components of applications. But still it faces a lot of challenges related to cost, performance and resource management.

**Common challenges:**
- **High computational load**
- **GPU utilization**
- **Latency issues**
- **Inefficient memory management:**
	1. Internal fragmentation
	2. Reservation waste
	3. External fragmentation
- **Complexity of distributed inference**

vLLM inference runtime aims to resolve these challenges and one of the core components it uses is PagedAttention

## PagedAttention

In LLMs, the **KV cache** plays a important role during inference. As tokens are generated one by one, because **LLMs are autoregressive**, the model stores the key and value tensors from previous steps instead of recomputing them. This makes future **token prediction faster because the model can reuse past computations**.

Early implementations followed a memory allocation strategy borrowed from traditional deep learning workloads. Each inference request **received a large, contiguous block of GPU memory reserved for its KV cache**. The problem was that sequence lengths during inference are unpredictable. Some inputs end quickly, others expand due to long responses or continued user interaction. Since **the system could not know the exact length ahead of time, it had to reserve more memory than necessary**.

**This led to several inefficiencies;**

- Most of the allocated memory remained unused, contributing to internal fragmentation.
    
- Gaps appeared between allocated memory regions, creating external fragmentation.
    
- Memory was often blocked for potential future tokens but never consumed.
    

Despite reserving large chunks of GPU memory, real usage often hovered between 20 and 40 percent. As a result, throughput suffered because the model could serve fewer simultaneous requests before exhausting memory. The hardware was capable of more, but the memory layout became the bottleneck.

PagedAttention redesigns KV cache storage using a paging mechanism inspired by operating systems. Instead of one large allocation, the cache is broken into smaller fixed-size chunks called KV blocks.

Key ideas introduced by PagedAttention:

- **Non-contiguous storage**
    

```markdown
Logical sequence:
[Block 1][Block 2][Block 3][Block 4]

Physical GPU layout:
[Block 1]       [Block 3] [Block 4]     [Block 2]
    ^ mapping handled by page table ^
```

- **On-demand allocation**
    

```markdown
Initial:
[Block 1]

As sequence grows:
[Block 1][Block 2][Block 3] ... allocated only when needed
```

- **Near-zero waste**  
    Only the final incomplete block may have unused memory.
    
- **Page-table-style lookup**
    

```markdown
+-------------+-------------------------+
| Token Range | Physical Block Address  |
+-------------+-------------------------+
|   0 - 255   | Block_12                |
| 256 - 511   | Block_7                 |
| 512 - 767   | Block_19                |
+-------------+-------------------------+
```

- **Copy-on-write sharing for generation branches**
    

During beam search or sampling, multiple candidates often share early history:

```markdown
Before divergence:
Beam A → [B1][B2][B3]
Beam B → [B1][B2][B3]   (shared memory)

After divergence:
Beam A → [B1][B2][B3][B4]
Beam B → [B1][B2][B3][B4’]  (copy only when needed)
```

- **Higher throughput and scaling**  
    Since memory is used efficiently, more requests fit into GPU memory, increasing batch size and token throughput.
    

## Implementation
### Data Structures

#### KVCacheBlock
- Represents a physical block in KV cache
- Properties:
  * block_id: Unique identifier for the block
  * reference_count: Number of requests sharing this block
  * block_hash: SHA256 hash of block tokens for deduplication

```python
class KVCacheBlock(BaseModel):
    block_id: int
    reference_count: int = 0
    block_hash: Optional[str] = None
```

#### Request
- Represents a user request with tokens and allocated blocks
- Properties:
  * tokens: List of token IDs
  * block_size: Size of each block
  * block_table: List of physical block IDs allocated to this request

```python
class Request(BaseModel):
    tokens: List[int]
    block_size: int
    block_table: List[int] = []
```

### PagedAttention Class

#### Initialization
- total_blocks: Maximum number of blocks available
- block_size: Number of tokens per block
- Maintains free blocks queue, physical block mappings, and hash table

```python
def __init__(self, total_blocks, block_size):
    self.total_blocks = total_blocks
    self.free_blocks = deque(range(total_blocks))
    self.physical_blocks = {}
    self.hash_to_block = {}
    self.cache_hits = 0
    self.cache_misses = 0
```

#### Block Hashing (hash_block_tokens)
- Creates SHA256 hash for token sequences
- Incorporates parent hash for sequential block identification
- Enables block-level deduplication

```python
def hash_block_tokens(self, tokens, parent_hash):
    hasher = hashlib.sha256()
    if parent_hash:
        hasher.update(parent_hash.encode())
    for token in tokens:
        hasher.update(str(token).encode())
    return hasher.hexdigest()[:16]
```

#### Block Allocation (allocate_block)
- Allocates free blocks from the pool
- Returns block_id if available, None if out of memory

```python
def allocate_block(self):
    if not self.free_block:
        return None
    block_id = self.free_blocks.popleft()
    return block_id
```

#### Block Deallocation (free_block)
- Removes block from physical memory
- Cleans up hash mappings
- Returns block to free pool

```python
def free_block(self, block_id: int):
    if block_id in self.physical_blocks:
        block = self.physical_blocks[block_id]
        if block.block_hash and block.block_hash in self.hash_to_block:
            del self.hash_to_block[block.block_hash]
        del self.physical_blocks[block_id]
    self.free_blocks.append(block_id)
```

#### Token Allocation (allocate)
- Divides tokens into blocks (block_size tokens per block)
- Implements cache hit logic: reuses existing blocks if hash matches
- Implements cache miss logic: allocates new blocks for unique token sequences
- Increments reference count for shared blocks

```python
def allocate(self, tokens):
    num_tokens = len(tokens)
    num_blocks = (num_tokens + self.block_size - 1) // self.block_size
    request = Request(tokens=tokens, block_size=self.block_size)
    parent_hash = None
    
    for block_idx in range(num_blocks):
        block_tokens = tokens[start_idx:end_idx]
        block_hash = self.hash_block_tokens(block_tokens, parent_hash)
        
        if block_hash in self.hash_to_block:  # Cache hit
            cached_block = self.hash_to_block[block_hash]
            cached_block.reference_count += 1
            request.block_table.append(cached_block.block_id)
            self.cache_hits += 1
        else:  # Cache miss
            new_block_id = self.allocate_block()
            new_block = KVCacheBlock(block_id=new_block_id, reference_count=1, block_hash=block_hash)
            self.physical_blocks[new_block_id] = new_block
            self.hash_to_block[block_hash] = new_block
            request.block_table.append(new_block_id)
            self.cache_misses += 1
        parent_hash = block_hash
    
    return request
```

#### Token Generation (generate_token)
- Appends new token to request
- Allocates new block when current block reaches capacity (token overflow)
- Implements Copy-on-Write (CoW): creates new block if shared block has reference_count > 1
- Writes in-place to exclusive blocks

```python
def generate_token(self, request, new_token):
    request.tokens.append(new_token)
    last_block = self.physical_blocks[request.block_table[-1]]
    
    if len(request.tokens) % self.block_size == 1 and len(request.tokens) > 1:
        # Token overflow: allocate new block
        new_block_id = self.allocate_block()
        new_block = KVCacheBlock(block_id=new_block_id, reference_count=1)
        self.physical_blocks[new_block_id] = new_block
        request.block_table.append(new_block_id)
    
    elif last_block.reference_count > 1:
        # Copy-on-Write: create new block for exclusive modification
        new_block_id = self.allocate_block()
        last_block.reference_count -= 1
        new_block = KVCacheBlock(block_id=new_block_id, reference_count=1)
        self.physical_blocks[new_block_id] = new_block
        request.block_table[-1] = new_block_id
```

#### Request Cleanup (free_request)
- Decrements reference count for all blocks in request
- Deallocates blocks when reference count reaches 0

```python
def free_request(self, request):
    for physical_id in request.block_table:
        if physical_id in self.physical_blocks:
            block = self.physical_blocks[physical_id]
            block.reference_count -= 1
            if block.reference_count == 0:
                self.free_block(physical_id)
```

#### Statistics (print_stats)
- Cache hit rate calculation
- Free and allocated block counts

```python
def print_stats(self):
    total = self.cache_hits + self.cache_misses
    hit_rate = (self.cache_hits / total * 100) if total > 0 else 0
    print(f"Hits: {self.cache_hits}")
    print(f"Misses: {self.cache_misses}")
    print(f"Hit Rate: {hit_rate:.1f}%")
    print(f"Free Blocks: {len(self.free_blocks)}/{self.total_blocks}")
    print(f"Allocated Blocks: {len(self.physical_blocks)}")
```

---
## Conclusion
#### Implementation: [GitHub](https://github.com/capybara-brain346/PagedAttention-Simulation)

Overall it was a fun project to implement and dive deep into PagedAttention.