How We Built an AI Dungeon Master with Claude: Architecture Deep-Dive
How Scrollbook uses Claude's 200K context window, prompt caching (90% cost savings), and tool use to run live D&D sessions inside Discord. Full technical breakdown.
Building an AI Dungeon Master isn't just about hooking up an LLM to a Discord bot. It requires careful engineering around context management, prompt caching, real-time responsiveness, and cost optimization. Here's how we did it.
Why Claude Sonnet 4.5?
When we started Cipher, we evaluated multiple AI providers:
- OpenAI GPT-4: Excellent, but expensive for long-context campaigns
- Google Gemini: Good context window, but less consistent personalities
- AWS Bedrock: Convenient for deployment, but limited features
- Anthropic Claude: 200K context, prompt caching, and tool use
We chose Claude Sonnet 4.5 for three critical features:
1. 200K Token Context Window
D&D campaigns are long. A single session can generate 10K-20K tokens. Over 10 sessions, that's 100K-200K tokens of context. Claude's massive context window means we can include:
- Entire campaign history
- All NPC interactions and personalities
- Character backstories and progression
- World state and lore
- House rules and homebrew content
Without truncating or summarizing. The AI genuinely remembers everything.
2. Prompt Caching (90% Cost Savings)
This feature is a game-changer. Here's how it works:
Traditional AI calls:
Every request = Full context + new prompt
Cost = $3 per million input tokens
With prompt caching:
First request = Full context + new prompt (cached)
Subsequent requests = Cache reference + new prompt
Cost = $0.30 per million cached tokens (10x cheaper!)
For a 50K token campaign context, caching saves us ~$0.15 per request. Multiply that across thousands of interactions per session, and it's the difference between sustainable pricing and bankruptcy.
3. Tool Use (Function Calling)
Claude can invoke functions to:
- Roll dice and calculate modifiers
- Look up spells and monster stat blocks
- Update character sheet values
- Track initiative and combat state
- Query campaign database
This hybrid approach (AI + deterministic tools) gives us the best of both worlds: creative storytelling with mechanical accuracy.
Architecture Overview
Here's our high-level architecture:
┌─────────────┐
│ Discord │
│ Players │
└──────┬──────┘
│
↓
┌─────────────────────┐
│ Discord Bot │
│ (Python/Discord.py)│
└──────┬──────────────┘
│
↓
┌────────────────────────────────┐
│ Cipher Context Service │
│ - Assembles campaign context │
│ - Manages prompt caching │
│ - Handles tool calls │
└──────┬─────────────────────────┘
│
↓
┌──────────────────┐ ┌──────────────┐
│ Claude API │◄────►│ PostgreSQL │
│ (Sonnet 4.5) │ │ + pgvector │
└──────────────────┘ └──────────────┘
│
↓
┌──────────────────────┐
│ Response Handler │
│ - Formats output │
│ - Updates game state│
│ - Logs interactions │
└──────────────────────┘
Context Building: The Heart of Cipher
The hardest problem isn't calling the API - it's what context to send. Here's our approach:
Layer 1: System Prompt (Cached)
system_prompt = f"""
You are an expert Dungeon Master for D&D 5e.
CAMPAIGN: {campaign.name}
ERA: {campaign.era}
TONE: {campaign.tone}
HOUSE RULES:
{campaign.house_rules}
Your responsibilities:
- Narrate scenes with vivid descriptions
- Voice NPCs with distinct personalities
- Apply D&D 5e rules accurately
- Track combat state and initiative
- Adapt to player choices
- Use tools for mechanical tasks
"""
Caching: This changes rarely (only on campaign settings updates), so it stays cached for days.
Layer 2: World State (Cached)
world_context = f"""
LOCATIONS:
{serialize_locations(campaign.locations)}
NPCS:
{serialize_npcs(campaign.npcs)}
FACTIONS:
{serialize_factions(campaign.factions)}
ACTIVE QUESTS:
{serialize_quests(campaign.quests)}
"""
Caching: Updated between sessions, cached during sessions.
Layer 3: Character Sheets (Partially Cached)
party_context = f"""
PARTY COMPOSITION:
{serialize_characters(session.characters)}
"""
Caching: Character basics cached, current HP/resources updated each turn.
Layer 4: Session History (Cached)
history_context = get_recent_messages(
session_id=session.id,
limit=50, # Last 50 messages
include_summaries=True # Summaries of older sessions
)
Caching: Recent messages cached, only latest player input is new.
Layer 5: Current Turn (Not Cached)
current_input = f"""
CURRENT SITUATION:
{combat_state if in_combat else world_state}
PLAYER ACTION:
{player_message}
"""
Not cached: This is the new content that changes every request.
Total Context Size
Typical breakdown:
- System prompt: 2K tokens (cached)
- World state: 10K tokens (cached)
- Character sheets: 5K tokens (cached)
- Session history: 30K tokens (cached)
- Current turn: 1K tokens (new)
Result: 48K cached tokens + 1K new tokens = 49K total
Cost per request:
- Without caching: $0.147
- With caching: $0.0174
- Savings: 88%
Real-Time Responsiveness
Discord users expect fast responses. Our optimization strategy:
1. Streaming Responses
We use Claude's streaming API to start sending responses before completion:
async for chunk in claude_client.messages.stream(...):
await discord_channel.send(chunk.content)
Players see Cipher "thinking" in real-time, just like reading a DM's narration.
2. Parallel Processing
For multi-part responses (narration + dice rolls + state updates), we parallelize:
async def process_turn(player_action):
# Run these concurrently
narration, dice_results, state_updates = await asyncio.gather(
claude_generate_narration(player_action),
roll_dice_tools(player_action),
update_game_state(player_action)
)
return combine_responses(narration, dice_results, state_updates)
3. Prefetch Context
When a session starts, we build and cache the context BEFORE players begin:
@bot.event
async def on_session_start(session_id):
# Warm up the cache
await build_session_context(session_id)
# This takes 2-3 seconds, but happens BEFORE gameplay
Result: First player action responds in ~2 seconds instead of ~5 seconds.
Tool Use: Hybrid AI + Deterministic Functions
Claude can call tools, but we carefully designed which operations stay deterministic:
AI Handles:
- Creative narration
- NPC dialogue and reactions
- Plot adaptation
- Rule interpretation (ambiguous cases)
Tools Handle:
- Dice rolling (RNG must be provably fair)
- Stat calculations (must be mathematically correct)
- Database queries (direct DB access faster than AI)
- Combat math (accuracy critical)
Example tool definition:
tools = [
{
"name": "roll_dice",
"description": "Roll dice using standard notation",
"input_schema": {
"type": "object",
"properties": {
"notation": {"type": "string", "description": "e.g., '2d20+5'"},
"advantage": {"type": "boolean", "default": False},
"disadvantage": {"type": "boolean", "default": False}
},
"required": ["notation"]
}
},
# ... more tools
]
When Claude says:
"You swing your sword at the goblin. Let me roll your attack..."
Claude invokes:
tool_use = {
"tool": "roll_dice",
"input": {
"notation": "1d20+5",
"advantage": False
}
}
We execute the tool, return results, and Claude continues:
"...you rolled a 17! That hits the goblin's AC."
Cost Optimization Strategies
Running AI for every session requires careful cost management:
1. Prompt Caching (88% savings)
Already covered above.
2. Context Pruning
We summarize old sessions instead of including full transcripts:
if session_age > 5_sessions:
# Replace full transcript with AI-generated summary
context = get_session_summary(session_id)
else:
# Include full conversation
context = get_full_history(session_id)
3. Batching Operations
Instead of calling Claude for every dice roll narration, we batch:
# Bad: 5 API calls
for attack in attacks:
narrate(attack)
# Good: 1 API call
narrate_all(attacks)
4. Smart Context Invalidation
We only regenerate context when necessary:
@cache(ttl=300) # Cache for 5 minutes
async def get_campaign_context(campaign_id):
# Expensive DB queries
return build_context(campaign_id)
During combat (many rapid turns), context stays cached.
5. Usage Tracking
We track AI usage down to the token level and expose it to users:
async def log_ai_usage(session_id, tokens_used, cost):
usage = AIUsage(
session_id=session_id,
tokens_input=tokens_used['input'],
tokens_output=tokens_used['output'],
tokens_cached=tokens_used['cached'],
cost_usd=cost,
timestamp=datetime.utcnow()
)
await db.save(usage)
Users can see exactly how many "AI hours" they've used and when.
Challenges We Solved
Challenge #1: Context Ordering
Order matters! We learned (through trial and error) that:
# Bad: AI focuses too much on recent history
[system_prompt, history, world_state, current_input]
# Good: AI balances all context
[system_prompt, world_state, history, current_input]
Putting world state before history helps Claude remember NPC names and locations.
Challenge #2: Combat Latency
Combat requires many rapid-fire calls. Solution: pre-generate common responses:
# Pre-cache common combat narrations
combat_templates = {
"hit_narration": [...],
"miss_narration": [...],
"critical_hit": [...]
}
For repetitive actions (basic attack), we template instead of regenerating.
Challenge #3: Hallucinated Rules
Claude occasionally invents D&D rules. Solution: grounding with tool use:
async def verify_rule(rule_claim):
# Check against SRD database
srd_rule = await db.query_srd(rule_claim)
if not srd_rule:
# Claude hallucinated - correct it
await claude_send_correction(srd_rule)
Challenge #4: Character Voice Consistency
NPCs should sound the same across sessions. Solution: detailed NPC profiles:
npc_context = f"""
NPC: Mayor Elara Brightwood
Voice: Warm, maternal, speaks in complete sentences
Quirks: Ends statements with "wouldn't you say?"
Mood: Currently worried about bandit attacks
History with party: Grateful for rescue of village
"""
What's Next?
We're constantly improving the AI integration:
Q1 2025
- Voice channel integration (speech-to-text + text-to-speech)
- Multi-lingual support (non-English campaigns)
- Image generation (character portraits, maps)
Q2 2025
- Emotion analysis (detect player frustration/excitement)
- Dynamic difficulty (AI adjusts encounter CR based on engagement)
- Cross-campaign learning (AI learns DM preferences over time)
Try It Yourself
Want to see Claude in action? Start a free campaign and run a test session. The first 3 AI hours are free.
Interested in the technical details? Join our Discord where we discuss architecture, AI techniques, and infrastructure.
Technical FAQs
Q: What's your average Claude API latency? A: ~2 seconds for streaming first token, ~5-8 seconds for full response (depending on context size).
Q: How do you handle Claude API outages? A: We have AWS Bedrock as a fallback (though it lacks prompt caching and tool use).
Q: What's your average cost per 2-hour session? A: ~$0.20-0.40 depending on combat frequency (we charge $2.50-3.00, so ~85-90% margins).
Q: Do you fine-tune Claude? A: No, we use prompt engineering + retrieval augmented generation (RAG) for customization.
Q: How do you prevent prompt injection? A: We sanitize user input, use strict tool schemas, and have content moderation filters.
Q: Is the code open source? A: Selected components are open source on GitHub. Core IP (context building, caching strategy) is proprietary.
About the Author
Cipher Team
Part of the Cipher team building AI-powered tools for D&D campaign management.