Architecture Design Prompt Templates

AI prompt templates for software architecture. Design scalable, maintainable system architectures.

Overview

Good architecture makes systems easier to build, deploy, and maintain, bad architecture creates compounding problems. These prompts help you design systems that handle current requirements while remaining adaptable to change. The key is understanding tradeoffs: every architectural decision has costs and benefits.

Best Practices

1

Start with requirements and constraints before jumping to solutions, context drives design

2

Specify scale expectations (users, requests, data volume) so the AI can suggest appropriate patterns

3

Describe your team's expertise, an architecture your team can't maintain is wrong regardless of elegance

4

Include operational concerns: deployment, monitoring, on-call burden, not just the happy path

5

Ask about tradeoffs, not just recommendations, understand what you're giving up

Prompt Templates

1. System Architecture Design

Design a system architecture for this application:

**Application:** [WHAT YOU'RE BUILDING]
**Core features:**
- [FEATURE 1]
- [FEATURE 2]
- [FEATURE 3]

**Scale requirements:**
- Users: [EXPECTED USERS - NOW AND IN 2 YEARS]
- Traffic: [REQUESTS PER SECOND / CONCURRENT USERS]
- Data: [EXPECTED DATA VOLUME]

**Constraints:**
- Team size: [NUMBER OF DEVELOPERS]
- Budget: [INFRASTRUCTURE BUDGET IF RELEVANT]
- Timeline: [WHEN IT NEEDS TO LAUNCH]
- Compliance: [ANY REGULATORY REQUIREMENTS]

**Existing infrastructure:** [WHAT YOU ALREADY HAVE]

**Provide:**
1. High-level architecture diagram (describe it)
2. Component breakdown with responsibilities
3. Data flow for key operations
4. Technology recommendations with justification
5. Scalability path (what to add as you grow)
6. Key tradeoffs made
Application: Real-time collaborative document editor (like Google Docs)
Features: Multiple cursors, live typing, comments, document sharing, offline support
Scale: 10K users now, 1M in 2 years, 50 concurrent editors per document
Constraints: Team of 5, $5K/month infrastructure, 6 months to MVP
Existing: AWS account, PostgreSQL experience
## System Architecture: Collaborative Document Editor

### High-Level Architecture

```
┌─────────────────────────────────────────────────────────────────┐
│ Clients │
│ (React SPA + Service Worker for offline) │
└────────────────┬───────────────────────────────────┬────────────┘
 │ │
 │ HTTP/REST │ WebSocket
 ▼ ▼
┌────────────────────────┐ ┌────────────────────────────┐
│ API Gateway │ │ WebSocket Gateway │
│ (AWS ALB + Lambda) │ │ (AWS API Gateway WS) │
└───────────┬────────────┘ └────────────┬───────────────┘
 │ │
 ▼ ▼
┌────────────────────────┐ ┌────────────────────────────┐
│ REST API Service │ │ Collaboration Service │
│ (Node.js on ECS) │ │ (Node.js + Yjs CRDT) │
│ - Auth, user mgmt │ │ - Real-time sync │
│ - Document CRUD │ │ - Presence (cursors) │
│ - Sharing/permissions │ │ - Conflict resolution │
└───────────┬────────────┘ └────────────┬───────────────┘
 │ │
 ▼ ▼
┌────────────────────────────────────────────────────────────────┐
│ Data Layer │
│ ┌──────────────┐ ┌──────────────┐ ┌────────────────────┐ │
│ │ PostgreSQL │ │ Redis │ │ S3 (documents) │ │
│ │ (RDS) │ │ (ElastiCache)│ │ │ │
│ │ - Users │ │ - Sessions │ │ - Doc snapshots │ │
│ │ - Metadata │ │ - Presence │ │ - Attachments │ │
│ │ - Sharing │ │ - Pub/sub │ │ │ │
│ └──────────────┘ └──────────────┘ └────────────────────┘ │
└────────────────────────────────────────────────────────────────┘
```

### Component Responsibilities

| Component | Responsibility | Technology |
|-----------|---------------|------------|
| Web Client | UI, offline mode, local CRDT state | React + Yjs + IndexedDB |
| API Gateway | Auth, rate limiting, routing | AWS ALB |
| REST API | CRUD, auth, permissions | Node.js/Express on ECS |
| Collaboration Service | Real-time sync, presence | Node.js + Yjs + ws |
| PostgreSQL | Users, permissions, doc metadata | AWS RDS |
| Redis | Sessions, pub/sub for horizontal scale | AWS ElastiCache |
| S3 | Document storage, snapshots | AWS S3 |

### Data Flow: Collaborative Editing

```
1. User A types "Hello"
2. Yjs (client) creates CRDT operation
3. Operation sent via WebSocket to Collaboration Service
4. Service broadcasts to all connected clients on document
5. Service periodically snapshots to S3 (every 30s or 100 ops)
6. User B's Yjs merges operation (conflict-free)
7. Both users see same result
```

### Key Technology Choices

**Why Yjs (CRDT) over OT:**
- Simpler to implement correctly
- Works offline (operations queue locally)
- No central authority needed
- Team of 5 can ship in 6 months

**Why WebSocket over polling:**
- Sub-100ms latency for real-time feel
- Efficient for frequent small updates
- API Gateway WebSocket scales to millions

### Scalability Path

**Now (10K users):**
- Single collaboration service instance
- Single PostgreSQL instance
- Cost: ~$500/month

**Growth (100K users):**
- Add Redis pub/sub for multi-instance collaboration service
- PostgreSQL read replicas
- Cost: ~$2K/month

**Scale (1M users):**
- Collaboration service horizontally scaled (sticky sessions per document)
- PostgreSQL sharding by organization
- CDN for document assets
- Cost: ~$10K/month

### Key Tradeoffs

| Decision | Benefit | Cost |
|----------|---------|------|
| CRDT over OT | Simpler, offline works | Slightly larger payloads |
| Managed services | Less ops burden | Higher $ than self-hosted |
| S3 snapshots | Cheap storage | Not real-time persistent |
| Single region | Simple, low latency | No geo-redundancy |

Starting new projects, major feature additions, or re-architecting existing systems

  • Design for your current scale, not the scale you hope for, premature optimization creates complexity
  • Include the scalability PATH, not just the end state, know how you'll evolve

2. Microservices Design

Help me decompose this system into services:

**Current state:**
- [MONOLITH / EXISTING SERVICES / GREENFIELD]

**Domain areas:**
- [DOMAIN 1 - e.g. User management]
- [DOMAIN 2 - e.g. Orders]
- [DOMAIN 3 - e.g. Payments]

**Team structure:**
- [HOW TEAMS ARE ORGANIZED]
- [WHO OWNS WHAT]

**Dependencies:**
- [WHAT DEPENDS ON WHAT]

**Pain points (if existing system):**
- [CURRENT PROBLEMS TO SOLVE]

**Provide:**
1. Service boundaries with clear responsibilities
2. Data ownership (which service owns which data)
3. Communication patterns (sync vs async)
4. Shared concerns (auth, logging)
5. Migration strategy (if from monolith)
Current: Django monolith with 200K lines, 50 models, 6 developers
Domains: Users/Auth, Products/Inventory, Orders, Payments, Shipping, Notifications
Teams: 2 squads, want to work independently
Pain points: Deployments affect everything, scaling payments differently from catalog, different release cadences needed
## Microservices Decomposition

### Service Boundaries

```
┌─────────────────────────────────────────────────────────────────┐
│ API Gateway / BFF │
│ (Auth validation, routing, rate limiting) │
└───────┬─────────────┬─────────────┬─────────────┬───────────────┘
 │ │ │ │
 ▼ ▼ ▼ ▼
 ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐
 │ User │ │ Catalog │ │ Order │ │ Payment │
 │ Service │ │ Service │ │ Service │ │ Service │
 └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘
 │ │ │ │
 │ │ │ │
 ┌────┴────┐ ┌────┴────┐ ┌────┴────┐ ┌────┴────┐
 │ User DB │ │Catalog │ │Order DB │ │Payment │
 │ │ │ DB │ │ │ │ DB │
 └─────────┘ └─────────┘ └─────────┘ └─────────┘
 
 └──────────────────┬──────────────────────┘
 │
 ┌──────┴──────┐
 │ Message Bus │
 │ (Events) │
 └─────────────┘
 │
 ┌────────────┼────────────┐
 ▼ ▼ ▼
 ┌─────────┐ ┌─────────┐ ┌─────────┐
 │Shipping │ │Notific- │ │Analytics│
 │ Service │ │ ation │ │ Service │
 └─────────┘ └─────────┘ └─────────┘
```

### Service Responsibilities & Data Ownership

| Service | Owns | Responsibilities |
|---------|------|------------------|
| User Service | users, profiles, preferences | Registration, auth tokens, profile CRUD |
| Catalog Service | products, categories, inventory | Product info, stock levels, search |
| Order Service | orders, line items, order history | Cart, checkout, order status |
| Payment Service | payment methods, transactions | Payment processing, refunds, PCI compliance |
| Shipping Service | shipments, tracking | Carrier integration, tracking updates |
| Notification Service | notification preferences | Email, SMS, push notifications |

### Communication Patterns

**Synchronous (REST/gRPC):**
- User → Order: Validate user exists, check permissions
- Order → Catalog: Check inventory, get prices
- Order → Payment: Process payment

**Asynchronous (Events):**
```
OrderCreated → Notification (send confirmation)
OrderCreated → Analytics (track conversion)
PaymentCompleted → Order (update status)
PaymentCompleted → Shipping (initiate shipment)
ShipmentUpdated → Notification (tracking update)
InventoryLow → Notification (alert to team)
```

**Why async for these:**
- Order doesn't need to WAIT for email to send
- Payment completion triggers multiple actions
- Decouples services, Notification can be down without breaking checkout

### Shared Concerns

```
┌─────────────────────────────────────────────────┐
│ Shared Platform │
│ │
│ ┌───────────┐ ┌───────────┐ ┌─────────────┐ │
│ │Auth Library│ │Logging │ │ Tracing │ │
│ │(JWT valid.)│ │(ELK/Loki)│ │(Jaeger/Zipkin│ │
│ └───────────┘ └───────────┘ └─────────────┘ │
│ │
│ ┌───────────┐ ┌───────────┐ │
│ │Config Mgmt│ │ Service │ │
│ │(Consul) │ │ Mesh(opt.)│ │
│ └───────────┘ └───────────┘ │
└─────────────────────────────────────────────────┘
```

### Migration Strategy (Strangler Fig)

**Phase 1: Extract Payments (highest value)**
1. Create Payment Service with same DB connection
2. Monolith calls Payment Service instead of internal code
3. Once stable, migrate payment tables to Payment DB
4. Duration: 6-8 weeks

**Phase 2: Extract Notifications**
1. Event-driven, low coupling
2. Monolith emits events, new service consumes
3. Duration: 4 weeks

**Phase 3: Extract Catalog**
1. Read-heavy, can improve independently
2. Add caching layer during extraction
3. Duration: 6 weeks

**Phase 4: Extract Order + User**
1. These are coupled, extract together or User first
2. Most complex, highest risk
3. Duration: 8-10 weeks

**Total migration: 6-9 months** with parallel feature work

### Team Alignment

| Squad | Owns | Services |
|-------|------|----------|
| Squad A (Growth) | User acquisition, conversion | User, Catalog, Analytics |
| Squad B (Commerce) | Revenue, fulfillment | Order, Payment, Shipping, Notification |

When monoliths become painful, when teams need to deploy independently, or when different components have different scaling needs

  • Align service boundaries with team boundaries, Conway's Law is real
  • Start by extracting the service with clearest boundaries and highest value

3. Technology Selection

Help me choose technologies for this project:

**Project:** [WHAT YOU'RE BUILDING]

**Requirements:**
- [FUNCTIONAL REQ 1]
- [FUNCTIONAL REQ 2]
- [NON-FUNCTIONAL REQ - performance, scale, etc.]

**Constraints:**
- Team expertise: [WHAT LANGUAGES/FRAMEWORKS TEAM KNOWS]
- Timeline: [WHEN IT NEEDS TO SHIP]
- Hiring plans: [WILL YOU HIRE? FOR WHAT SKILLS?]
- Budget: [INFRASTRUCTURE / LICENSING CONSTRAINTS]

**Decisions needed:**
- [AREA 1 - e.g. Backend language/framework]
- [AREA 2 - e.g. Database]
- [AREA 3 - e.g. Infrastructure/hosting]

**Provide:**
1. Recommendation for each area with alternatives
2. Pros/cons comparison
3. Risks and mitigations
4. Long-term considerations
Project: B2B SaaS analytics dashboard
Requirements: Real-time data updates, complex visualizations, multi-tenant, SOC2 compliance
Constraints: Team knows Python and React, 4 months to MVP, planning to hire 2 more engineers
Decisions needed: Backend framework, database, real-time updates approach, hosting
## Technology Recommendations

### Backend Framework: FastAPI (Python)

**Recommendation:** FastAPI

| Option | Pros | Cons |
|--------|------|------|
| **FastAPI** ⭐ | Team knows Python, async support, auto OpenAPI docs, type hints | Smaller ecosystem than Django |
| Django | Batteries included, mature | Async story weaker, heavier |
| Node.js | Great async, large ecosystem | Team would need to learn |

**Why FastAPI:**
- Team's Python expertise = faster MVP
- Native async for real-time features
- Type hints catch bugs early, help with hiring
- Pydantic for data validation (SOC2-friendly audit trails)

**Risk:** Smaller community than Django
**Mitigation:** Most issues solvable, core framework is solid

---

### Database: PostgreSQL + TimescaleDB

**Primary:** PostgreSQL (RDS)
**Time-series extension:** TimescaleDB

| Option | Pros | Cons |
|--------|------|------|
| **PostgreSQL + TimescaleDB** ⭐ | One database, SQL familiarity, hypertables for analytics | Need to learn TimescaleDB specifics |
| PostgreSQL + ClickHouse | ClickHouse is fastest for analytics | Two systems to manage, different query languages |
| MongoDB | Flexible schema | Weaker for complex analytics queries |

**Why this choice:**
- Single database reduces ops complexity
- PostgreSQL handles multi-tenant well (row-level security)
- TimescaleDB: automatic partitioning, continuous aggregates for dashboards
- Team likely knows SQL already

**Schema approach:**
```sql
-- Multi-tenant with schema per tenant (SOC2 data isolation)
CREATE SCHEMA tenant_acme;
CREATE TABLE tenant_acme.events (...) -- hypertable
```

---

### Real-Time Updates: Server-Sent Events (SSE)

| Option | Pros | Cons |
|--------|------|------|
| **SSE** ⭐ | Simple, HTTP-based, auto-reconnect | One-way only |
| WebSocket | Bi-directional | More complex, harder to scale |
| Polling | Simplest | Wasteful, higher latency |

**Why SSE:**
- Dashboard is mostly server → client (one-way)
- Works through proxies/load balancers easily
- FastAPI has good SSE support
- Simpler than WebSocket for your use case

**Implementation:**
```python
@app.get("/stream/{dashboard_id}")
async def stream_updates(dashboard_id: str):
 async def event_generator():
 async for update in get_dashboard_updates(dashboard_id):
 yield f"data: {update.json()}\n\n"
 return StreamingResponse(event_generator(), media_type="text/event-stream")
```

---

### Hosting: AWS (ECS + RDS)

| Option | Pros | Cons |
|--------|------|------|
| **AWS ECS** ⭐ | Simpler than K8s, SOC2 compliant, team can manage | Vendor lock-in |
| Kubernetes | Portable, powerful | Overkill for team of 4-6, ops burden |
| Heroku | Simplest | Expensive at scale, less control |
| Vercel/Railway | Great DX | Less suitable for complex backend |

**Why AWS ECS:**
- SOC2 compliance built-in (AWS compliance programs)
- Managed container orchestration without K8s complexity
- RDS for managed PostgreSQL with TimescaleDB
- Can scale as you grow without re-platforming

**Cost estimate:** ~$800-1500/month at MVP scale

---

### Long-Term Considerations

**In 6-12 months, consider:**
- Redis for caching dashboard queries (when you have repeat visitors)
- CDN for static dashboard assets
- Read replicas if analytics queries compete with writes

**Hiring alignment:**
- Python + FastAPI: Large candidate pool
- React: Huge candidate pool
- PostgreSQL: Everyone knows it
- TimescaleDB: Niche, but SQL-based so learnable

**Migration paths if needed:**
- FastAPI → Django: Possible, same language
- PostgreSQL → other: Hard, but unlikely to need
- ECS → Kubernetes: Container-based, migration feasible

At project kickoff, when evaluating major technology changes, or when current stack isn't meeting needs

  • Improve for team productivity over theoretical perfection, a technology your team knows beats a 'better' one they don't
  • Consider the hiring market, choosing obscure tech limits your talent pool

4. Architecture Decision Record

Help me write an Architecture Decision Record (ADR):

**Decision title:** [WHAT DECISION]

**Context:**
- [CURRENT SITUATION]
- [PROBLEM TO SOLVE]
- [CONSTRAINTS]

**Options considered:**
1. [OPTION 1]
2. [OPTION 2]
3. [OPTION 3]

**Evaluation criteria:**
- [CRITERION 1 - e.g. performance]
- [CRITERION 2 - e.g. maintainability]
- [CRITERION 3 - e.g. cost]

**Decision:** [WHICH OPTION AND WHY]

**Format:** Standard ADR format with Status, Context, Decision, Consequences
Decision: Choose message broker for event-driven architecture
Context: Moving from sync REST calls to async events, need reliability, team has no Kafka experience
Options: RabbitMQ, Kafka, AWS SQS+SNS, Redis Streams
Criteria: Reliability (must not lose messages), operational complexity, cost, learning curve
# ADR-007: Message Broker Selection for Event-Driven Architecture

## Status

**Accepted** - 2024-01-15

## Context

We are transitioning from synchronous REST calls between services to an event-driven architecture to improve reliability and decoupling. Key requirements:

- **Reliability**: Events must not be lost (order payments, inventory updates)
- **Scale**: Currently ~1000 events/minute, expect 10x growth in 18 months
- **Team**: 6 engineers, no Kafka experience, some AWS experience
- **Existing infra**: Running on AWS, PostgreSQL, Redis already in use

## Decision

We will use **AWS SQS + SNS** for our event-driven messaging.

## Options Considered

### Option 1: Apache Kafka

| Criterion | Score | Notes |
|-----------|-------|-------|
| Reliability | ⭐⭐⭐⭐⭐ | Excellent, designed for durability |
| Ops complexity | ⭐⭐ | Requires dedicated ops knowledge |
| Cost | ⭐⭐⭐ | Infrastructure + ops time |
| Learning curve | ⭐⭐ | Steep, team has no experience |
| Scale ceiling | ⭐⭐⭐⭐⭐ | Unlimited |

**Rejected because:** Operational complexity too high for current team size. Would need to hire or dedicate engineer to Kafka ops.

### Option 2: RabbitMQ

| Criterion | Score | Notes |
|-----------|-------|-------|
| Reliability | ⭐⭐⭐⭐ | Good with proper config |
| Ops complexity | ⭐⭐⭐ | Manageable, some learning needed |
| Cost | ⭐⭐⭐⭐ | Lower infra cost |
| Learning curve | ⭐⭐⭐ | Moderate |
| Scale ceiling | ⭐⭐⭐ | Good, but clustering complex |

**Rejected because:** Still requires infrastructure management. Clustering for HA adds complexity.

### Option 3: AWS SQS + SNS ⭐ Selected

| Criterion | Score | Notes |
|-----------|-------|-------|
| Reliability | ⭐⭐⭐⭐⭐ | AWS SLA, built-in redundancy |
| Ops complexity | ⭐⭐⭐⭐⭐ | Fully managed, zero ops |
| Cost | ⭐⭐⭐⭐ | Pay per message, free tier |
| Learning curve | ⭐⭐⭐⭐ | Team knows AWS |
| Scale ceiling | ⭐⭐⭐⭐ | Millions of messages/day |

**Selected because:** Best fit for team capabilities and operational constraints.

### Option 4: Redis Streams

| Criterion | Score | Notes |
|-----------|-------|-------|
| Reliability | ⭐⭐⭐ | Depends on persistence config |
| Ops complexity | ⭐⭐⭐⭐ | Already running Redis |
| Cost | ⭐⭐⭐⭐⭐ | Already paying for Redis |
| Learning curve | ⭐⭐⭐⭐ | Team knows Redis |
| Scale ceiling | ⭐⭐⭐ | Memory-bound |

**Rejected because:** Reliability concerns for critical events. Not designed as primary message broker.

## Consequences

### Positive

- **Zero operational burden**: AWS manages availability, scaling, patching
- **Immediate start**: Team can begin implementing today
- **Cost-effective**: ~$50/month at current scale
- **AWS integration**: Works well with Lambda, CloudWatch, IAM

### Negative

- **Vendor lock-in**: Harder to migrate away from AWS
- **Feature limitations**: No message replay (unlike Kafka), no complex routing (unlike RabbitMQ)
- **Latency**: Higher than self-hosted solutions (~20-50ms vs ~5ms)

### Risks and Mitigations

| Risk | Mitigation |
|------|------------|
| Vendor lock-in | Abstract behind interface, could swap later |
| No message replay | Store events in database for replay capability |
| Cost at scale | Review at 100K events/day, consider alternatives |

## Review Date

Revisit this decision in **12 months** or when event volume exceeds 100K/day.

## References

- [AWS SQS Documentation](https://docs.aws.amazon.com/sqs/)
- [Comparison: Kafka vs SQS](internal-wiki/kafka-vs-sqs)
- Related: ADR-005 (Event Schema Standards)

When making significant architectural decisions that should be documented for future reference

  • ADRs are for your future self and teammates, explain the 'why' not just the 'what'
  • Include what you rejected and why, prevents revisiting the same debates

Common Mistakes to Avoid

Over-architecting for scale you don't have, complexity has real costs today

Choosing technologies based on hype rather than team capabilities and actual requirements

Designing without considering operations, who's on-call at 3am when this breaks?

Frequently Asked Questions

Good architecture makes systems easier to build, deploy, and maintain, bad architecture creates compounding problems. These prompts help you design systems that handle current requirements while remaining adaptable to change. The key is understanding tradeoffs: every architectural decision has costs and benefits.

Related Templates

Have your own prompt to optimize?