Making Agents Play Pictionary

agents-pictionary

I had a free Saturday and a question I couldn’t shake: what happens if you make AI agents play Pictionary against each other?

One agent draws. Another looks at the canvas and guesses. It’s silly, but it’s also a surprisingly natural way to probe two things at once can a model express an idea visually, and can another model interpret what it sees? No static dataset, no canned prompts. Just two agents going back and forth.

Hence, I built Agents Pictionary a turn-based game where Claude agents compete over HTTP, with a live browser UI streaming the match in real time.

1. How It’s Put Together

Three pieces:

A game server Python + FastAPI, in-memory state, single room called default.
A spectator UI React + TypeScript, streams events over WebSocket and renders the canvas live in the browser.
An MCP server runs on the user’s machine, wraps the HTTP API into tools that any Claude agent can call.

High-level architecture diagram showing three tiers: AI agents at the top connecting via MCP servers to a central FastAPI game server, which in turn streams events over WebSocket to a React spectator UI at the bottom.

The MCP server is just a thin adapter it knows nothing about game logic, it just translates tool calls into REST requests. What’s cool here is it runs as a separate process per agent, so each agent gets its own isolated session. Two agents in the same room can’t accidentally share tokens or state because the isolation is process-level.

Agents talk pure HTTP REST they never touch the WebSocket. The WebSocket is exclusively for the spectator UI, which subscribes to an append-only event stream at WS /rooms/{id}/events?since_seq=N. On reconnect, it replays all events from the given sequence number. The server never prunes this log, so the UI can crash and catch up cleanly.

The session token (tok_<urlsafe base64>) is issued on POST /rooms/{id}/join and lives only in MCP process RAM never written to disk, not durable across server restarts. Each agent’s MCP process holds its own token, which is why process-level isolation matters. There’s no shared cookie jar.

2. The HTTP API

The full surface is small eight endpoints for game actions plus the WebSocket and the PNG export:

Method	Path	What it does
`POST`	`/rooms/{id}/join`	Get `player_id`, `token`, initial `room_state`
`POST`	`/rooms/{id}/start`	Begin game (requires ≥2 players in lobby)
`GET`	`/rooms/{id}/turn`	Long-poll for turn context the interesting one
`POST`	`/rooms/{id}/strokes`	Submit stroke batch (drawer only, atomic)
`POST`	`/rooms/{id}/guess`	Submit guess → `{correct, score_delta}`
`POST`	`/rooms/{id}/give_up`	End turn early (drawer only)
`GET`	`/rooms/{id}/state`	Full room snapshot, no secret word
`GET`	`/rooms/{id}/canvas.png`	Raw PNG spectator only, agents never call this

The MCP server exposes eight tools that map directly onto these: pictionary_join, pictionary_start, pictionary_wait_for_turn, pictionary_draw, pictionary_guess, pictionary_give_up, pictionary_get_canvas, pictionary_get_state. The tool layer is deliberately thin it validates inputs and translates the response into a format Claude can consume, but all game logic stays server-side.

One thing worth noting about pictionary_wait_for_turn: it returns a list of MCP content items. The first is a text item with the full JSON context (role, time left, round number, recent guesses everything except the canvas). If there’s an active canvas, there’s a second item: a proper ImageContent block with the base64 decoded into MCP’s image format. Agents see a real image, not a raw base64 string embedded in text. That distinction matters for how multimodal models handle it.

3. How a Turn Works

The turn lifecycle is a small state machine:

lobby → in_round → between_rounds → in_round → ... → ended → lobby

One player is the drawer and gets the secret word. Everyone else is a guesser. The drawer submits strokes; guessers poll for canvas updates and submit text guesses. First correct guess ends the turn.

The canvas is 1000×600 pixels, white background, black strokes only. Agents never receive raw stroke data only base64 PNG snapshots delivered inline in JSON. This is deliberate. Whether an agent is GPT-4o or Claude Opus, it sees the same pixels. No model gets to cheat by parsing vector paths.

Between rounds is server-driven. When a turn ends, an asyncio background task fires after 10 seconds and advances drawer_index, then starts the next turn. Agents don’t need to poll for it they just call pictionary_wait_for_turn again and block until the server is ready. If the word list runs dry during begin_turn(), the server calls end_game() automatically. No awkward half-states.

4. Canvas Rendering

Every call to /turn re-renders the canvas from scratch using Pillow. The stroke list is replayed in order:

A single-point stroke becomes a filled ellipse with radius = width / 2.
A multi-point stroke becomes a polyline with round joints, capped at both endpoints with circles.

The result is encoded as base64 and embedded inline in the JSON response. Agents never get a URL to fetch the image arrives with the turn context or not at all. GET /rooms/{id}/canvas.png exists but it’s for the spectator UI, not agents.

Stroke batches are atomic. If any stroke in a batch fails validation, the entire batch is rejected. Out-of-bounds points are clamped to canvas edges rather than rejected (a point at x=1050 becomes x=999) this is lenient by design, since an agent drawing near the edge shouldn’t lose a whole batch over a few pixels. But a stroke with zero points or an illegal color fails hard, and takes the rest of the batch with it.

Canvas versioning is a simple integer counter that increments on every accepted stroke batch. Guessers track this as since_version in their long-poll requests.

5. The Long-Poll Delivery Contract

This was the hardest part to get right. The endpoint is GET /rooms/{id}/turn?wait=true&since_version=N&max_wait_seconds=60. On the surface it’s a standard long-poll block until something changes, return, repeat. But “something changed” is where it gets nuanced.

The server tracks per-player, per-turn delivery state in a _last_delivery dict keyed by (player_id, turn_id). For each entry it records the last delivered canvas version, the last delivered guess count, and a monotonic timestamp. A guesser gets a new response when either:

The canvas version advanced and at least CANVAS_DEBOUNCE_SECONDS = 2 seconds have elapsed since the last delivery, or
New guesses have arrived since the last delivery.

The 2-second debounce is the important part. A drawer making rapid small strokes could otherwise flood guessers with a response per batch. With the debounce, the server accumulates strokes and delivers one snapshot every 2 seconds at most. Guessers see the canvas advance in meaningful steps rather than getting re-woken every 200ms.

The drawer’s flow is different. The first call to /turn after a round starts returns immediately with the secret word and canvas. Subsequent calls return None the drawer doesn’t re-poll while drawing, it just submits strokes and the server delivers updates to guessers automatically. If max_wait_seconds elapses with no update, the server returns TurnContextWaiting and the client reissues immediately. No busy-loop; the retry is instant because the server already waited.

6. The Interesting Design Decisions

Guess matching. Guesses go through normalization (lowercase, strip whitespace), then it’s either an exact match or rapidfuzz.ratio ≥ 0.9:

def is_correct(guess: str, target: str, threshold: float = 0.9) -> bool:
    g, t = normalize(guess), normalize(target)
    if g == t:
        return True
    return fuzz.ratio(g, t) / 100 >= threshold

def normalize(text: str) -> str:
    return re.sub(r"\s+", " ", text.strip().lower())

"elphant" matches "elephant" (ratio ~0.93). "i think it's an elephant" never matches (~0.4 against "elephant"). Sentences deliberately don’t score. This forces agents to output a single word rather than hedging with prose. An agent that doesn’t read the protocol contract simply never scores. The threshold is tunable per room config if you want to experiment with looser or stricter matching.

Scoring. Guesser score: round(max(50, 200 * time_left / time_total)). Drawer score: round(100 * time_left / time_total). Everyone else: 0. On timeout or give-up: all 0.

The floor at 50 for guessers is load-bearing. Without it, an agent might rationally stop guessing late in a turn because expected value approaches zero. The floor keeps guessing rational all the way to the buzzer. The drawer’s formula has no floor their reward scales purely with how fast the guesser figured it out. With an 80-second turn, every second costs a guesser ~2.5 points (200 / 80). Time pressure is real.

Equal epistemic footing. Agents get PNG, not vector strokes. The raw stroke data coordinates, widths, ordering lives on the server and is never sent to guessers. Every model sees the same rendered pixels. This prevents any future model from gaining an edge by parsing stroke geometry instead of actually reading the image, which would defeat the point of the experiment.

Turn delivery. Guessers call GET /rooms/{id}/turn with a since_version cursor. The server holds the connection open until the canvas advances meaningfully or the turn ends. There’s a 2-second gate so a drawer making lots of small strokes doesn’t flood guessers with a response per stroke.

7. The Event Stream

The WebSocket at /rooms/{id}/events?since_seq=N gives the spectator UI a live feed of everything that happens. Events are typed and sequenced:

player_joined someone connected
turn_start {turn_id, drawer_id, time_limit_seconds, round}
strokes_added {canvas_version, strokes: [...]}
guess {player_id, text, correct}
turn_end {end_reason, winner_id, word, score_deltas, duration_seconds}
game_end {final_scores}

The turn_end event includes the secret word. That’s the moment the spectator UI reveals it to the audience which is exactly the kind of thing that makes Pictionary feel like Pictionary, even for an AI game. The log is append-only and never pruned, so the UI can reconnect at any sequence number and replay from there. Useful for debugging too a full event replay is a complete record of the game.

8. What I Learned

The game logic was the easy part. The hard part was the turn delivery contract figuring out what counts as “new enough” to wake a waiting guesser, how to avoid noise on stroke-heavy turns, what to return when a turn ends mid-wait. That _last_delivery dict and the 2-second debounce were the last things I added and the things that made the system actually playable.

The code is at github.com/capybara-brain346/agents-pictionary