Data Modeling Prompt Templates
AI prompt templates for data modeling. Design schemas, relationships, and data architectures.
Overview
Good data models make analysis easy. Bad ones make every query painful. These prompts help you design schemas that balance normalization with query performance, plan for growth, and avoid common pitfalls. Whether you're building a transactional database or an analytics warehouse, the principles here will save you from redesigns later.
Best Practices
Start with the questions you need to answer, then design the schema to support them
Consider query patterns before optimizing for storage
Document your assumptions and business rules in the schema design
Plan for how the data will grow over time (volume, new attributes, new relationships)
Distinguish between transactional needs (OLTP) and analytical needs (OLAP)
Prompt Templates
1. Schema Design from Requirements
Design a database schema for [APPLICATION/DOMAIN]. Business requirements: [KEY_REQUIREMENTS] Entities to track: [LIST_OF_THINGS] Key relationships: [HOW_THINGS_RELATE] Query patterns: [COMMON_QUERIES] Database: [POSTGRESQL/MYSQL/SNOWFLAKE/ETC] Scale: [EXPECTED_VOLUME]
Design a schema for an e-commerce order management system. Business requirements: Track orders from placement to delivery, support multiple shipping addresses per customer, handle partial shipments (one order, multiple packages), track inventory levels. Entities: Customers, Orders, Order Items, Products, Shipments, Addresses, Inventory. Key relationships: Customers have many orders, orders have many items and shipments, products can be in multiple orders. Query patterns: Get all orders for a customer, get order status with shipment tracking, find products low on inventory. Database: PostgreSQL. Scale: 10K orders/day, 500K products.
**Schema Design: E-commerce Order Management**
```sql
-- Customers
CREATE TABLE customers (
customer_id SERIAL PRIMARY KEY,
email VARCHAR(255) UNIQUE NOT NULL,
created_at TIMESTAMP DEFAULT NOW(),
updated_at TIMESTAMP DEFAULT NOW()
);
-- Addresses (separate table for multiple addresses per customer)
CREATE TABLE addresses (
address_id SERIAL PRIMARY KEY,
customer_id INT REFERENCES customers(customer_id),
address_type VARCHAR(20), -- 'shipping', 'billing'
is_default BOOLEAN DEFAULT FALSE,
street_1 VARCHAR(255) NOT NULL,
street_2 VARCHAR(255),
city VARCHAR(100) NOT NULL,
state VARCHAR(50),
postal_code VARCHAR(20) NOT NULL,
country VARCHAR(50) NOT NULL,
created_at TIMESTAMP DEFAULT NOW()
);
CREATE INDEX idx_addresses_customer ON addresses(customer_id);
-- Products
CREATE TABLE products (
product_id SERIAL PRIMARY KEY,
sku VARCHAR(50) UNIQUE NOT NULL,
name VARCHAR(255) NOT NULL,
price DECIMAL(10,2) NOT NULL,
is_active BOOLEAN DEFAULT TRUE,
created_at TIMESTAMP DEFAULT NOW()
);
-- Inventory (separate from products for easier updates)
CREATE TABLE inventory (
product_id INT PRIMARY KEY REFERENCES products(product_id),
quantity_on_hand INT NOT NULL DEFAULT 0,
reorder_point INT NOT NULL DEFAULT 10,
updated_at TIMESTAMP DEFAULT NOW()
);
CREATE INDEX idx_inventory_low_stock ON inventory(quantity_on_hand)
WHERE quantity_on_hand <= reorder_point;
-- Orders
CREATE TABLE orders (
order_id SERIAL PRIMARY KEY,
customer_id INT REFERENCES customers(customer_id),
shipping_address_id INT REFERENCES addresses(address_id),
order_status VARCHAR(20) NOT NULL DEFAULT 'pending',
-- pending, processing, partially_shipped, shipped, delivered, cancelled
order_total DECIMAL(10,2) NOT NULL,
placed_at TIMESTAMP DEFAULT NOW(),
updated_at TIMESTAMP DEFAULT NOW()
);
CREATE INDEX idx_orders_customer ON orders(customer_id);
CREATE INDEX idx_orders_status ON orders(order_status) WHERE order_status NOT IN ('delivered', 'cancelled');
-- Order Items
CREATE TABLE order_items (
order_item_id SERIAL PRIMARY KEY,
order_id INT REFERENCES orders(order_id),
product_id INT REFERENCES products(product_id),
quantity INT NOT NULL,
unit_price DECIMAL(10,2) NOT NULL, -- Price at time of order (denormalized)
shipment_id INT, -- NULL until assigned to shipment
created_at TIMESTAMP DEFAULT NOW()
);
CREATE INDEX idx_order_items_order ON order_items(order_id);
-- Shipments (supports partial shipments)
CREATE TABLE shipments (
shipment_id SERIAL PRIMARY KEY,
order_id INT REFERENCES orders(order_id),
carrier VARCHAR(50),
tracking_number VARCHAR(100),
status VARCHAR(20) DEFAULT 'pending',
-- pending, in_transit, delivered
shipped_at TIMESTAMP,
delivered_at TIMESTAMP,
created_at TIMESTAMP DEFAULT NOW()
);
CREATE INDEX idx_shipments_order ON shipments(order_id);
ALTER TABLE order_items ADD CONSTRAINT fk_shipment
FOREIGN KEY (shipment_id) REFERENCES shipments(shipment_id);
```
**Design decisions explained**:
1. **Addresses as separate table**: Supports multiple addresses per customer, preserves address history even if customer updates their default.
2. **Inventory separate from products**: Inventory updates frequently, products don't. Separating them reduces lock contention.
3. **unit_price in order_items**: Denormalized intentionally. Product price can change, but order should reflect price at time of purchase.
4. **Partial index on inventory**: `idx_inventory_low_stock` only indexes products below reorder point, making low-stock queries fast.
5. **shipment_id in order_items**: Links items to shipments for partial shipment support. NULL until item ships.
**Useful queries this supports**:
```sql
-- All orders for a customer with status
SELECT o.*, s.tracking_number, s.status as shipment_status
FROM orders o
LEFT JOIN shipments s ON o.order_id = s.order_id
WHERE o.customer_id = ?;
-- Low inventory products
SELECT p.sku, p.name, i.quantity_on_hand, i.reorder_point
FROM inventory i
JOIN products p ON i.product_id = p.product_id
WHERE i.quantity_on_hand <= i.reorder_point;
```Starting a new database or major feature that needs new tables
- •Denormalization isn't bad. Sometimes storing calculated or historical values makes queries simpler and faster
- •Add indexes based on query patterns, not just foreign keys. Think about WHERE and ORDER BY clauses
2. Star Schema for Analytics
Design a star schema for analyzing [BUSINESS_PROCESS]. Fact table should capture: [WHAT_EVENTS_OR_TRANSACTIONS] Dimensions needed: [HOW_USERS_WILL_SLICE_DATA] Grain: [ONE_ROW_REPRESENTS_WHAT] Typical queries: [ANALYSIS_QUESTIONS] Data warehouse: [SNOWFLAKE/REDSHIFT/BIGQUERY]
Design a star schema for analyzing sales performance. Fact table should capture: Individual line items from orders. Dimensions: Date, Product, Customer, Store/Channel, Promotion. Grain: One row per order line item. Typical queries: Revenue by month, product category, and region; YoY growth comparison; promotion effectiveness. Data warehouse: Snowflake.
**Star Schema: Sales Analytics** ```sql -- DIMENSION TABLES -- Date dimension (pre-populated calendar table) CREATE TABLE dim_date ( date_key INT PRIMARY KEY, -- YYYYMMDD format full_date DATE NOT NULL, day_of_week INT, day_name VARCHAR(10), day_of_month INT, day_of_year INT, week_of_year INT, month_num INT, month_name VARCHAR(10), quarter INT, year INT, is_weekend BOOLEAN, is_holiday BOOLEAN, fiscal_quarter INT, fiscal_year INT ); -- Product dimension CREATE TABLE dim_product ( product_key INT PRIMARY KEY, -- Surrogate key product_id VARCHAR(50), -- Natural key from source sku VARCHAR(50), product_name VARCHAR(255), category VARCHAR(100), subcategory VARCHAR(100), brand VARCHAR(100), supplier VARCHAR(100), unit_cost DECIMAL(10,2), is_active BOOLEAN, -- SCD Type 2 fields effective_date DATE, expiration_date DATE, is_current BOOLEAN ); -- Customer dimension CREATE TABLE dim_customer ( customer_key INT PRIMARY KEY, customer_id VARCHAR(50), email VARCHAR(255), customer_segment VARCHAR(50), -- New, Active, VIP, Churned acquisition_channel VARCHAR(50), acquisition_date DATE, city VARCHAR(100), state VARCHAR(50), country VARCHAR(50), region VARCHAR(50), -- For regional analysis is_current BOOLEAN ); -- Channel dimension CREATE TABLE dim_channel ( channel_key INT PRIMARY KEY, channel_id VARCHAR(50), channel_name VARCHAR(50), -- Web, Mobile App, Retail Store, Phone channel_type VARCHAR(50), -- Digital, Physical store_id VARCHAR(50), store_name VARCHAR(100), store_city VARCHAR(100), store_state VARCHAR(50) ); -- Promotion dimension CREATE TABLE dim_promotion ( promotion_key INT PRIMARY KEY, promotion_id VARCHAR(50), promotion_name VARCHAR(100), promotion_type VARCHAR(50), -- Discount, BOGO, Free Shipping discount_percent DECIMAL(5,2), start_date DATE, end_date DATE, is_active BOOLEAN ); -- FACT TABLE CREATE TABLE fact_sales ( -- Surrogate keys to dimensions date_key INT REFERENCES dim_date(date_key), product_key INT REFERENCES dim_product(product_key), customer_key INT REFERENCES dim_customer(customer_key), channel_key INT REFERENCES dim_channel(channel_key), promotion_key INT REFERENCES dim_promotion(promotion_key), -- Degenerate dimension (from transaction, no separate table) order_id VARCHAR(50), order_line_number INT, -- Measures quantity INT, unit_price DECIMAL(10,2), discount_amount DECIMAL(10,2), gross_revenue DECIMAL(10,2), -- quantity * unit_price net_revenue DECIMAL(10,2), -- gross - discount cost DECIMAL(10,2), -- quantity * unit_cost at time of sale profit DECIMAL(10,2), -- net_revenue - cost -- For Snowflake clustering PRIMARY KEY (date_key, order_id, order_line_number) ); -- Cluster on date for time-series queries (Snowflake-specific) ALTER TABLE fact_sales CLUSTER BY (date_key); ``` **Sample analytical queries this supports**: ```sql -- Revenue by month and category SELECT d.year, d.month_name, p.category, SUM(f.net_revenue) as revenue, SUM(f.profit) as profit FROM fact_sales f JOIN dim_date d ON f.date_key = d.date_key JOIN dim_product p ON f.product_key = p.product_key GROUP BY d.year, d.month_num, d.month_name, p.category ORDER BY d.year, d.month_num; -- YoY growth by region WITH current_year AS ( SELECT c.region, SUM(f.net_revenue) as revenue FROM fact_sales f JOIN dim_customer c ON f.customer_key = c.customer_key JOIN dim_date d ON f.date_key = d.date_key WHERE d.year = 2024 GROUP BY c.region ), prior_year AS ( SELECT c.region, SUM(f.net_revenue) as revenue FROM fact_sales f JOIN dim_customer c ON f.customer_key = c.customer_key JOIN dim_date d ON f.date_key = d.date_key WHERE d.year = 2023 GROUP BY c.region ) SELECT cy.region, cy.revenue as current_revenue, py.revenue as prior_revenue, (cy.revenue - py.revenue) / py.revenue * 100 as yoy_growth FROM current_year cy JOIN prior_year py ON cy.region = py.region; -- Promotion effectiveness SELECT pr.promotion_name, pr.promotion_type, COUNT(DISTINCT f.order_id) as orders, SUM(f.net_revenue) as revenue, SUM(f.discount_amount) as discount_given, SUM(f.profit) as profit FROM fact_sales f JOIN dim_promotion pr ON f.promotion_key = pr.promotion_key WHERE pr.promotion_key > 0 -- Exclude no-promotion baseline GROUP BY pr.promotion_name, pr.promotion_type ORDER BY profit DESC; ``` **Key design decisions**: 1. **Pre-calculated measures**: gross_revenue, net_revenue, profit in fact table. Avoids repeated calculation in every query. 2. **SCD Type 2 for products**: Product attributes can change. effective_date/expiration_date lets you analyze with the attribute values at time of sale. 3. **Date dimension**: Pre-populated calendar table makes time-based analysis trivial. Include fiscal calendar if your business uses one. 4. **Promotion dimension with key 0**: Use promotion_key = 0 for sales without promotion. Simplifies joins vs. NULLs.
Building a data warehouse or analytics layer
- •Pre-calculate and store common measures (revenue, profit) in the fact table. Storage is cheap, compute at query time isn't
- •A date dimension table is worth creating even for small projects. It makes time-based queries much simpler
3. Schema Review and Optimization
Review this database schema and suggest improvements. Current schema: [DDL_OR_DESCRIPTION] Problems we're experiencing: [ISSUES] Query patterns: [HOW_DATA_IS_USED] Growth expectations: [SCALE]
Current schema: Orders table has columns: id, customer_email, customer_name, customer_phone, shipping_street, shipping_city, shipping_state, shipping_zip, billing_street, billing_city, billing_state, billing_zip, product_1_name, product_1_qty, product_1_price, product_2_name, product_2_qty, product_2_price (up to product_5), order_total, status, created_at. Problems: Can't easily query customers across orders, adding more products per order requires schema change, lots of null columns. Query patterns: Customer order history, revenue by product, order status reports. Growth: Currently 50K orders, expecting 500K/year.
**Schema Review: Orders Table** **Critical issues identified**: 1. **Denormalized customer data**: Customer info repeated on every order. Email could be different across orders for same person. No way to update customer info without touching all orders. 2. **Horizontal product columns (product_1, product_2, etc.)**: This is a classic anti-pattern. Can't have more than 5 products, can't easily aggregate by product, wastes space with nulls. 3. **Mixed concerns**: One table doing the job of at least 4 tables. **Recommended redesign**: ```sql -- Customers (normalized) CREATE TABLE customers ( customer_id SERIAL PRIMARY KEY, email VARCHAR(255) UNIQUE NOT NULL, name VARCHAR(255), phone VARCHAR(50), created_at TIMESTAMP DEFAULT NOW() ); CREATE INDEX idx_customers_email ON customers(email); -- Customer addresses CREATE TABLE customer_addresses ( address_id SERIAL PRIMARY KEY, customer_id INT REFERENCES customers(customer_id), address_type VARCHAR(20), -- 'shipping' or 'billing' street VARCHAR(255), city VARCHAR(100), state VARCHAR(50), zip VARCHAR(20), is_default BOOLEAN DEFAULT FALSE ); CREATE INDEX idx_addresses_customer ON customer_addresses(customer_id); -- Products (if not already existing) CREATE TABLE products ( product_id SERIAL PRIMARY KEY, name VARCHAR(255) NOT NULL, current_price DECIMAL(10,2) NOT NULL, is_active BOOLEAN DEFAULT TRUE ); -- Orders (just order-level data) CREATE TABLE orders ( order_id SERIAL PRIMARY KEY, customer_id INT REFERENCES customers(customer_id), shipping_address_id INT REFERENCES customer_addresses(address_id), billing_address_id INT REFERENCES customer_addresses(address_id), order_total DECIMAL(10,2) NOT NULL, status VARCHAR(20) NOT NULL DEFAULT 'pending', created_at TIMESTAMP DEFAULT NOW() ); CREATE INDEX idx_orders_customer ON orders(customer_id); CREATE INDEX idx_orders_created ON orders(created_at); -- Order items (unlimited products per order) CREATE TABLE order_items ( order_item_id SERIAL PRIMARY KEY, order_id INT REFERENCES orders(order_id), product_id INT REFERENCES products(product_id), quantity INT NOT NULL, unit_price DECIMAL(10,2) NOT NULL, -- Price at time of order line_total DECIMAL(10,2) NOT NULL -- quantity * unit_price ); CREATE INDEX idx_order_items_order ON order_items(order_id); CREATE INDEX idx_order_items_product ON order_items(product_id); ``` **Migration approach**: ```sql -- 1. Create new tables (above) -- 2. Migrate customers (dedupe by email) INSERT INTO customers (email, name, phone, created_at) SELECT DISTINCT ON (customer_email) customer_email, customer_name, customer_phone, MIN(created_at) FROM old_orders GROUP BY customer_email, customer_name, customer_phone; -- 3. Migrate addresses INSERT INTO customer_addresses (customer_id, address_type, street, city, state, zip) SELECT DISTINCT c.customer_id, 'shipping', o.shipping_street, o.shipping_city, o.shipping_state, o.shipping_zip FROM old_orders o JOIN customers c ON o.customer_email = c.email WHERE o.shipping_street IS NOT NULL; -- 4. Migrate orders (simplified, would need address lookups) -- 5. Unpivot and migrate order items INSERT INTO order_items (order_id, product_id, quantity, unit_price, line_total) SELECT order_id, p.product_id, o.product_1_qty, o.product_1_price, o.product_1_qty * o.product_1_price FROM old_orders o JOIN products p ON o.product_1_name = p.name WHERE o.product_1_name IS NOT NULL UNION ALL -- Repeat for product_2 through product_5 ``` **Benefits of new schema**: 1. **Customer order history**: Simple join on customer_id 2. **Revenue by product**: GROUP BY on order_items.product_id 3. **Unlimited products per order**: Just add rows to order_items 4. **No nulls**: Clean data model 5. **Customer updates**: Change once, reflects everywhere
When an existing schema is causing problems or won't scale
- •Horizontal columns (thing_1, thing_2, thing_3) are almost always a sign you need a separate table
- •If the same data appears in multiple places, it will eventually get out of sync. Normalize it
4. Entity Relationship Design
Help me figure out the relationships between these entities. Entities: [LIST_OF_THINGS] Business rules: [RULES_THAT_AFFECT_RELATIONSHIPS] Questions I'm unsure about: [SPECIFIC_QUESTIONS] What relationships should I model and how?
Entities: Students, Courses, Instructors, Classrooms, Semesters. Business rules: A course can have multiple sections per semester, each section has one instructor, classrooms can host multiple sections but not at the same time, students enroll in sections (not courses directly), instructors can teach multiple sections. Questions: How do I model the time constraint on classrooms? Should Course and Section be one table or two?
**Entity Relationship Analysis: University Scheduling** **Relationship breakdown**: 1. **Courses → Sections**: One-to-Many - Course is the catalog entry (CS101: Intro to Programming) - Section is a specific offering (CS101-001, Fall 2024, MWF 9am) - Keep separate: Course changes rarely, sections change every semester 2. **Sections → Instructors**: Many-to-One - Each section has exactly one instructor - Instructor can teach many sections - Foreign key: section.instructor_id 3. **Sections → Classrooms**: Many-to-One (with time constraint) - Each section assigned to one classroom - Classroom hosts many sections (but not simultaneously) - Foreign key: section.classroom_id - Time handled by: section.day_of_week + start_time + end_time 4. **Sections → Semesters**: Many-to-One - Each section belongs to one semester - Semester has many sections - Foreign key: section.semester_id 5. **Students → Sections**: Many-to-Many - Students enroll in multiple sections - Sections have multiple students - Requires junction table: enrollments **Proposed schema**: ```sql -- Courses (catalog) CREATE TABLE courses ( course_id SERIAL PRIMARY KEY, course_code VARCHAR(20) UNIQUE NOT NULL, -- CS101 title VARCHAR(255) NOT NULL, credits INT NOT NULL, description TEXT ); -- Semesters CREATE TABLE semesters ( semester_id SERIAL PRIMARY KEY, name VARCHAR(50) NOT NULL, -- Fall 2024 start_date DATE NOT NULL, end_date DATE NOT NULL ); -- Instructors CREATE TABLE instructors ( instructor_id SERIAL PRIMARY KEY, name VARCHAR(255) NOT NULL, email VARCHAR(255) UNIQUE, department VARCHAR(100) ); -- Classrooms CREATE TABLE classrooms ( classroom_id SERIAL PRIMARY KEY, building VARCHAR(100) NOT NULL, room_number VARCHAR(20) NOT NULL, capacity INT NOT NULL, UNIQUE(building, room_number) ); -- Students CREATE TABLE students ( student_id SERIAL PRIMARY KEY, name VARCHAR(255) NOT NULL, email VARCHAR(255) UNIQUE, enrollment_year INT ); -- Sections (the heart of scheduling) CREATE TABLE sections ( section_id SERIAL PRIMARY KEY, course_id INT REFERENCES courses(course_id), semester_id INT REFERENCES semesters(semester_id), instructor_id INT REFERENCES instructors(instructor_id), classroom_id INT REFERENCES classrooms(classroom_id), section_number VARCHAR(10) NOT NULL, -- 001, 002, etc. day_of_week VARCHAR(10) NOT NULL, -- MWF, TR, etc. start_time TIME NOT NULL, end_time TIME NOT NULL, max_enrollment INT NOT NULL, UNIQUE(course_id, semester_id, section_number) ); -- Enrollments (junction table) CREATE TABLE enrollments ( enrollment_id SERIAL PRIMARY KEY, student_id INT REFERENCES students(student_id), section_id INT REFERENCES sections(section_id), enrolled_at TIMESTAMP DEFAULT NOW(), grade VARCHAR(2), -- A, B+, etc. (NULL until graded) UNIQUE(student_id, section_id) -- Can't enroll twice ); ``` **Handling the classroom time constraint**: This is a business rule, not a schema relationship. You can enforce it two ways: **Option 1: Application-level check** (simpler) ```sql -- Run this before inserting/updating a section SELECT COUNT(*) FROM sections WHERE classroom_id = [proposed_classroom] AND semester_id = [proposed_semester] AND day_of_week = [proposed_days] AND ( (start_time < [proposed_end] AND end_time > [proposed_start]) ); -- If count > 0, conflict exists ``` **Option 2: Database constraint** (PostgreSQL using exclusion constraint) ```sql -- Requires btree_gist extension CREATE EXTENSION IF NOT EXISTS btree_gist; ALTER TABLE sections ADD CONSTRAINT no_classroom_overlap EXCLUDE USING gist ( classroom_id WITH =, semester_id WITH =, day_of_week WITH =, tsrange(start_time::timestamp, end_time::timestamp) WITH && ); ``` **Answer to your questions**: 1. **Course vs Section as one table**: Definitely two tables. Courses are stable reference data. Sections are transactional data that changes every semester. Combining them would mean duplicating course info for every section. 2. **Classroom time constraint**: Use application logic or a database exclusion constraint (shown above). This isn't a foreign key problem, it's a scheduling conflict problem.
When you're unsure how entities relate or how to model complex constraints
- •Draw the diagram. Boxes for entities, lines for relationships. Many-to-many relationships always need a junction table
- •When in doubt, err toward more tables. It's easier to denormalize later than to normalize a messy table
Common Mistakes to Avoid
Creating wide tables with repeated columns (product_1, product_2, product_3) instead of using a related table
Storing derived data (like totals) without a strategy to keep it in sync with source data
Designing for current needs only without considering how data will grow and change
Frequently Asked Questions
Good data models make analysis easy. Bad ones make every query painful. These prompts help you design schemas that balance normalization with query performance, plan for growth, and avoid common pitfalls. Whether you're building a transactional database or an analytics warehouse, the principles here will save you from redesigns later.
Related Templates
SQL Query Prompt Templates
AI prompt templates for writing SQL queries. Create SELECT, JOIN, aggregate, and complex queries.
Data Analysis Prompt Templates
AI prompt templates for data analysis. Extract insights, identify patterns, and interpret results.
Data Visualization Prompt Templates
AI prompt templates for data visualization. Create effective charts, dashboards, and visual reports.
Have your own prompt to optimize?