How We Built an AI Dungeon Master with Claude: Architecture Deep-Dive

How Scrollbook uses Claude's 200K context window, prompt caching (90% cost savings), and tool use to run live D&D sessions inside Discord. Full technical breakdown.

Building an AI Dungeon Master isn't just about hooking up an LLM to a Discord bot. It requires careful engineering around context management, prompt caching, real-time responsiveness, and cost optimization. Here's how we did it.

Why Claude Sonnet 4.5?

When we started Cipher, we evaluated multiple AI providers:

OpenAI GPT-4: Excellent, but expensive for long-context campaigns
Google Gemini: Good context window, but less consistent personalities
AWS Bedrock: Convenient for deployment, but limited features
Anthropic Claude: 200K context, prompt caching, and tool use

We chose Claude Sonnet 4.5 for three critical features:

1. 200K Token Context Window

D&D campaigns are long. A single session can generate 10K-20K tokens. Over 10 sessions, that's 100K-200K tokens of context. Claude's massive context window means we can include:

Entire campaign history
All NPC interactions and personalities
Character backstories and progression
World state and lore
House rules and homebrew content

Without truncating or summarizing. The AI genuinely remembers everything.

2. Prompt Caching (90% Cost Savings)

This feature is a game-changer. Here's how it works:

Traditional AI calls:

text

Every request = Full context + new prompt
Cost = $3 per million input tokens

With prompt caching:

text

First request = Full context + new prompt (cached)
Subsequent requests = Cache reference + new prompt
Cost = $0.30 per million cached tokens (10x cheaper!)

For a 50K token campaign context, caching saves us ~$0.15 per request. Multiply that across thousands of interactions per session, and it's the difference between sustainable pricing and bankruptcy.

3. Tool Use (Function Calling)

Claude can invoke functions to:

Roll dice and calculate modifiers
Look up spells and monster stat blocks
Update character sheet values
Track initiative and combat state
Query campaign database

This hybrid approach (AI + deterministic tools) gives us the best of both worlds: creative storytelling with mechanical accuracy.

Architecture Overview

Here's our high-level architecture:

text

┌─────────────┐
│   Discord   │
│   Players   │
└──────┬──────┘
       │
       ↓
┌─────────────────────┐
│   Discord Bot       │
│   (Python/Discord.py)│
└──────┬──────────────┘
       │
       ↓
┌────────────────────────────────┐
│   Cipher Context Service       │
│   - Assembles campaign context  │
│   - Manages prompt caching      │
│   - Handles tool calls          │
└──────┬─────────────────────────┘
       │
       ↓
┌──────────────────┐      ┌──────────────┐
│  Claude API      │◄────►│  PostgreSQL  │
│  (Sonnet 4.5)    │      │  + pgvector  │
└──────────────────┘      └──────────────┘
       │
       ↓
┌──────────────────────┐
│   Response Handler   │
│   - Formats output   │
│   - Updates game state│
│   - Logs interactions │
└──────────────────────┘

Context Building: The Heart of Cipher

The hardest problem isn't calling the API - it's what context to send. Here's our approach:

Layer 1: System Prompt (Cached)

python

system_prompt = f"""
You are an expert Dungeon Master for D&D 5e.

CAMPAIGN: {campaign.name}
ERA: {campaign.era}
TONE: {campaign.tone}

HOUSE RULES:
{campaign.house_rules}

Your responsibilities:
- Narrate scenes with vivid descriptions
- Voice NPCs with distinct personalities
- Apply D&D 5e rules accurately
- Track combat state and initiative
- Adapt to player choices
- Use tools for mechanical tasks
"""

Caching: This changes rarely (only on campaign settings updates), so it stays cached for days.

Layer 2: World State (Cached)

python

world_context = f"""
LOCATIONS:
{serialize_locations(campaign.locations)}

NPCS:
{serialize_npcs(campaign.npcs)}

FACTIONS:
{serialize_factions(campaign.factions)}

ACTIVE QUESTS:
{serialize_quests(campaign.quests)}
"""

Caching: Updated between sessions, cached during sessions.

Layer 3: Character Sheets (Partially Cached)

python

party_context = f"""
PARTY COMPOSITION:
{serialize_characters(session.characters)}
"""

Caching: Character basics cached, current HP/resources updated each turn.

Layer 4: Session History (Cached)

python

history_context = get_recent_messages(
    session_id=session.id,
    limit=50,  # Last 50 messages
    include_summaries=True  # Summaries of older sessions
)

Caching: Recent messages cached, only latest player input is new.

Layer 5: Current Turn (Not Cached)

python

current_input = f"""
CURRENT SITUATION:
{combat_state if in_combat else world_state}

PLAYER ACTION:
{player_message}
"""

Not cached: This is the new content that changes every request.

Total Context Size

Typical breakdown:

System prompt: 2K tokens (cached)
World state: 10K tokens (cached)
Character sheets: 5K tokens (cached)
Session history: 30K tokens (cached)
Current turn: 1K tokens (new)

Result: 48K cached tokens + 1K new tokens = 49K total

Cost per request:

Without caching: $0.147
With caching: $0.0174
Savings: 88%

Real-Time Responsiveness

Discord users expect fast responses. Our optimization strategy:

1. Streaming Responses

We use Claude's streaming API to start sending responses before completion:

python

async for chunk in claude_client.messages.stream(...):
    await discord_channel.send(chunk.content)

Players see Cipher "thinking" in real-time, just like reading a DM's narration.

2. Parallel Processing

For multi-part responses (narration + dice rolls + state updates), we parallelize:

python

async def process_turn(player_action):
    # Run these concurrently
    narration, dice_results, state_updates = await asyncio.gather(
        claude_generate_narration(player_action),
        roll_dice_tools(player_action),
        update_game_state(player_action)
    )

    return combine_responses(narration, dice_results, state_updates)

3. Prefetch Context

When a session starts, we build and cache the context BEFORE players begin:

python

@bot.event
async def on_session_start(session_id):
    # Warm up the cache
    await build_session_context(session_id)
    # This takes 2-3 seconds, but happens BEFORE gameplay

Result: First player action responds in ~2 seconds instead of ~5 seconds.

Tool Use: Hybrid AI + Deterministic Functions

Claude can call tools, but we carefully designed which operations stay deterministic:

AI Handles:

Creative narration
NPC dialogue and reactions
Plot adaptation
Rule interpretation (ambiguous cases)

Tools Handle:

Dice rolling (RNG must be provably fair)
Stat calculations (must be mathematically correct)
Database queries (direct DB access faster than AI)
Combat math (accuracy critical)

Example tool definition:

python

tools = [
    {
        "name": "roll_dice",
        "description": "Roll dice using standard notation",
        "input_schema": {
            "type": "object",
            "properties": {
                "notation": {"type": "string", "description": "e.g., '2d20+5'"},
                "advantage": {"type": "boolean", "default": False},
                "disadvantage": {"type": "boolean", "default": False}
            },
            "required": ["notation"]
        }
    },
    # ... more tools
]

When Claude says:

"You swing your sword at the goblin. Let me roll your attack..."

Claude invokes:

python

tool_use = {
    "tool": "roll_dice",
    "input": {
        "notation": "1d20+5",
        "advantage": False
    }
}

We execute the tool, return results, and Claude continues:

"...you rolled a 17! That hits the goblin's AC."

Cost Optimization Strategies

Running AI for every session requires careful cost management:

1. Prompt Caching (88% savings)

Already covered above.

2. Context Pruning

We summarize old sessions instead of including full transcripts:

python

if session_age > 5_sessions:
    # Replace full transcript with AI-generated summary
    context = get_session_summary(session_id)
else:
    # Include full conversation
    context = get_full_history(session_id)

3. Batching Operations

Instead of calling Claude for every dice roll narration, we batch:

python

# Bad: 5 API calls
for attack in attacks:
    narrate(attack)

# Good: 1 API call
narrate_all(attacks)

4. Smart Context Invalidation

We only regenerate context when necessary:

python

@cache(ttl=300)  # Cache for 5 minutes
async def get_campaign_context(campaign_id):
    # Expensive DB queries
    return build_context(campaign_id)

During combat (many rapid turns), context stays cached.

5. Usage Tracking

We track AI usage down to the token level and expose it to users:

python

async def log_ai_usage(session_id, tokens_used, cost):
    usage = AIUsage(
        session_id=session_id,
        tokens_input=tokens_used['input'],
        tokens_output=tokens_used['output'],
        tokens_cached=tokens_used['cached'],
        cost_usd=cost,
        timestamp=datetime.utcnow()
    )
    await db.save(usage)

Users can see exactly how many "AI hours" they've used and when.

Challenges We Solved

Challenge #1: Context Ordering

Order matters! We learned (through trial and error) that:

python

# Bad: AI focuses too much on recent history
[system_prompt, history, world_state, current_input]

# Good: AI balances all context
[system_prompt, world_state, history, current_input]

Putting world state before history helps Claude remember NPC names and locations.

Challenge #2: Combat Latency

Combat requires many rapid-fire calls. Solution: pre-generate common responses:

python

# Pre-cache common combat narrations
combat_templates = {
    "hit_narration": [...],
    "miss_narration": [...],
    "critical_hit": [...]
}

For repetitive actions (basic attack), we template instead of regenerating.

Challenge #3: Hallucinated Rules

Claude occasionally invents D&D rules. Solution: grounding with tool use:

python

async def verify_rule(rule_claim):
    # Check against SRD database
    srd_rule = await db.query_srd(rule_claim)

    if not srd_rule:
        # Claude hallucinated - correct it
        await claude_send_correction(srd_rule)

Challenge #4: Character Voice Consistency

NPCs should sound the same across sessions. Solution: detailed NPC profiles:

python

npc_context = f"""
NPC: Mayor Elara Brightwood
Voice: Warm, maternal, speaks in complete sentences
Quirks: Ends statements with "wouldn't you say?"
Mood: Currently worried about bandit attacks
History with party: Grateful for rescue of village
"""

What's Next?

We're constantly improving the AI integration:

Q1 2025

Voice channel integration (speech-to-text + text-to-speech)
Multi-lingual support (non-English campaigns)
Image generation (character portraits, maps)

Q2 2025

Emotion analysis (detect player frustration/excitement)
Dynamic difficulty (AI adjusts encounter CR based on engagement)
Cross-campaign learning (AI learns DM preferences over time)

Try It Yourself

Want to see Claude in action? Start a free campaign and run a test session. The first 3 AI hours are free.

Start Free Campaign →

Interested in the technical details? Join our Discord where we discuss architecture, AI techniques, and infrastructure.

Technical FAQs

Q: What's your average Claude API latency? A: ~2 seconds for streaming first token, ~5-8 seconds for full response (depending on context size).

Q: How do you handle Claude API outages? A: We have AWS Bedrock as a fallback (though it lacks prompt caching and tool use).

Q: What's your average cost per 2-hour session? A: ~$0.20-0.40 depending on combat frequency (we charge $2.50-3.00, so ~85-90% margins).

Q: Do you fine-tune Claude? A: No, we use prompt engineering + retrieval augmented generation (RAG) for customization.

Q: How do you prevent prompt injection? A: We sanitize user input, use strict tool schemas, and have content moderation filters.

Q: Is the code open source? A: Selected components are open source on GitHub. Core IP (context building, caching strategy) is proprietary.

How We Built an AI Dungeon Master with Claude: Architecture Deep-Dive

Why Claude Sonnet 4.5?

1. 200K Token Context Window

2. Prompt Caching (90% Cost Savings)

3. Tool Use (Function Calling)

Architecture Overview

Context Building: The Heart of Cipher

Layer 1: System Prompt (Cached)

Layer 2: World State (Cached)

Layer 3: Character Sheets (Partially Cached)

Layer 4: Session History (Cached)

Layer 5: Current Turn (Not Cached)

Total Context Size

Real-Time Responsiveness

1. Streaming Responses

2. Parallel Processing

3. Prefetch Context

Tool Use: Hybrid AI + Deterministic Functions

AI Handles:

Tools Handle:

Cost Optimization Strategies

1. Prompt Caching (88% savings)

2. Context Pruning

3. Batching Operations

4. Smart Context Invalidation

5. Usage Tracking

Challenges We Solved

Challenge #1: Context Ordering

Challenge #2: Combat Latency

Challenge #3: Hallucinated Rules

Challenge #4: Character Voice Consistency

What's Next?

Q1 2025

Q2 2025

Try It Yourself

Technical FAQs

About the Author

Ready to Try Cipher?