Schema Design: Building the Foundation of Your Database
Learn how to design effective database schemas with proper data types, constraints, and relationships that scale with your application.
Schema Design: Building the Foundation of Your Database
A well-designed schema makes everything easier. Queries are fast, your application grows without constant reworking, and you don’t spend weekends fixing data that got into a weird state because nobody enforced the rules at the database level. The payoff for thinking carefully upfront is real.
This guide covers the core concepts: tables, columns, data types, constraints, and how relationships between tables work. I’ll assume you know what a database is and have written basic SQL before.
Tables: The Core Building Blocks
Tables are where your data lives. Each one represents a type of thing—users, orders, products—and every row is one actual instance of that thing. Columns hold attributes.
CREATE TABLE users (
id SERIAL PRIMARY KEY,
email VARCHAR(255) NOT NULL UNIQUE,
full_name VARCHAR(100),
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
When you’re deciding what goes in a table, think about real entities and what describes them. Orders don’t belong in a users table. They go in their own table, linked by a relationship.
Naming Conventions
Consistent names matter more than clever ones:
- Table names should be singular (
usernotusers) — the table describes a type - Use snake_case (
first_namenotfirstName) - Be specific (
subscription_statusnotstatus) - Foreign keys should make the reference obvious (
orders.customer_id)
Choosing Data Types
The type you pick affects storage, speed, and what errors you can catch. It’s worth thinking about this.
Common Data Types
| Type | Use Case | Considerations |
|---|---|---|
| INTEGER / BIGINT | Whole numbers | BIGINT for values over 2 billion |
| DECIMAL / NUMERIC | Exact decimals (money) | Specify precision and scale |
| VARCHAR(n) | Variable text | Set a max length |
| TEXT | Long-form text | Use when length varies significantly |
| BOOLEAN | True/false | More storage-efficient than INT |
| DATE / TIMESTAMP | Dates and times | TIMESTAMP includes time; DATE is just the day |
| JSONB | Semi-structured data | PostgreSQL’s binary JSON, supports indexing |
| UUID | Unique IDs | Works across distributed systems |
VARCHAR Without Length: Specify It
Always set a length for VARCHAR unless you have a reason not to. VARCHAR(255) works for most names. VARCHAR(1000) suits longer descriptions. Without a length, some databases allocate maximum space per row, which wastes storage.
-- Good: length is specified
email VARCHAR(255)
-- Avoid: no length constraint unless you specifically want TEXT
description TEXT -- just use TEXT directly instead
Constraints: Rules That Enforce Themselves
Constraints are how you make the database enforce your rules. Without them, bad data gets in and causes problems later—sometimes silently, which is worse.
Primary Keys
Every table needs one. A primary key uniquely identifies each row. It’s how you reliably reference a specific record from other tables.
CREATE TABLE products (
id BIGINT PRIMARY KEY,
sku VARCHAR(50) UNIQUE NOT NULL,
name VARCHAR(200) NOT NULL,
price DECIMAL(10, 2) NOT NULL
);
NOT NULL
If a column must have a value, say so with NOT NULL. If you forget, NULLs sneak in, and your application has to deal with missing data everywhere.
-- Required
email VARCHAR(255) NOT NULL,
-- Optional
middle_name VARCHAR(100) NULL
UNIQUE
Unique constraints prevent duplicates. A table can have multiple UNIQUE constraints, unlike primary keys.
-- One email per user
email VARCHAR(255) NOT NULL UNIQUE,
-- One badge per employee
badge_number VARCHAR(20) UNIQUE
CHECK
CHECK constraints validate conditions before allowing an insert. This catches business logic errors at the database gate.
price DECIMAL(10, 2) NOT NULL CHECK (price >= 0),
status VARCHAR(20) NOT NULL CHECK (status IN ('active', 'inactive', 'suspended')),
percentage DECIMAL(5, 2) NOT NULL CHECK (percentage >= 0 AND percentage <= 100)
DEFAULT Values
Defaults handle cases where no value is provided. They work well for timestamps, booleans, and status fields.
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
is_active BOOLEAN DEFAULT true,
status VARCHAR(20) DEFAULT 'pending'
Relationships Between Tables
Relationships let you connect data across tables. The three types you’ll encounter most are one-to-one, one-to-many, and many-to-many.
erDiagram
CUSTOMERS ||--o{ ORDERS : places
ORDERS ||--|{ ORDER_ITEMS : contains
PRODUCTS ||--o{ ORDER_ITEMS : "ordered in"
CUSTOMERS {
int customer_id PK
varchar email
varchar name
}
ORDERS {
int order_id PK
int customer_id FK
decimal total
timestamp created_at
}
ORDER_ITEMS {
int order_id FK
int product_id FK
int quantity
decimal unit_price
}
PRODUCTS {
int product_id PK
varchar sku
varchar name
decimal price
}
When to Use One-to-One
One-to-one suits optional data that only some rows need. User accounts with optional billing addresses, employees with optional parking assignments. If the optional data is large (long text, many columns) and most rows do not need it, keeping it separate avoids wasting space. If you always query the data together, merging into one table is simpler.
When Not to Use One-to-One
If you find yourself joining the two tables in almost every query, merge them. One-to-one adds query complexity without benefit when the data is always needed together. If the relationship is actually “zero or one” on one side, consider whether that side just needs a nullable foreign key instead.
When to Use One-to-Many
One-to-many fits any parent-child relationship where children belong to exactly one parent. Customers and their orders. Categories and products. Authors and blog posts. The pattern is correct when a child can only have one parent and you frequently query children by parent or parent with children.
When Not to Use One-to-Many
If you need to query children from multiple parents together (like tags on posts where posts can have many tags and tags apply to many posts), you have a many-to-many, not one-to-many. Forgetting this distinction leads to data modeling bugs.
When to Use Many-to-Many
Many-to-many fits when entities associate in arbitrary combinations. Posts and tags. Students and courses. Products and categories. The junction table is not a failure of design — it accurately models the reality that these relationships exist independently of the entities themselves.
When Not to Use Many-to-Many
If the “many” side is actually bounded and small (an order always has exactly 3 line items), consider whether a fixed-array column or separate columns serve you better. Many-to-many queries through junction tables add join overhead. When query performance matters and the combination space is small, denormalizing into arrays or fixed columns eliminates joins.
One-to-One
One row corresponds to exactly one row in another table. A foreign key with a UNIQUE constraint makes this work. Sometimes it makes more sense to just merge the tables—it depends on whether you always use the data together.
CREATE TABLE users (
id SERIAL PRIMARY KEY,
email VARCHAR(255) NOT NULL UNIQUE,
password_hash VARCHAR(255) NOT NULL
);
CREATE TABLE user_profiles (
id SERIAL PRIMARY KEY,
user_id INTEGER NOT NULL UNIQUE REFERENCES users(id),
bio TEXT,
avatar_url VARCHAR(500),
preferred_language VARCHAR(10) DEFAULT 'en'
);
The UNIQUE on user_id prevents a user from having more than one profile.
One-to-Many
The most common relationship. One user can have many orders, but each order has exactly one user.
CREATE TABLE customers (
id SERIAL PRIMARY KEY,
email VARCHAR(255) NOT NULL UNIQUE,
name VARCHAR(200) NOT NULL
);
CREATE TABLE orders (
id SERIAL PRIMARY KEY,
customer_id INTEGER NOT NULL REFERENCES customers(id),
order_number VARCHAR(50) NOT NULL UNIQUE,
total DECIMAL(12, 2) NOT NULL,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
Many-to-Many
Requires a junction table. An order contains many products, and a product can appear in many orders.
CREATE TABLE products (
id SERIAL PRIMARY KEY,
sku VARCHAR(50) NOT NULL UNIQUE,
name VARCHAR(200) NOT NULL,
price DECIMAL(10, 2) NOT NULL
);
CREATE TABLE order_items (
order_id INTEGER NOT NULL REFERENCES orders(id),
product_id INTEGER NOT NULL REFERENCES products(id),
quantity INTEGER NOT NULL DEFAULT 1,
unit_price DECIMAL(10, 2) NOT NULL,
PRIMARY KEY (order_id, product_id)
);
The composite primary key (order_id, product_id) means each product shows up once per order—prevents duplicates too.
Practical Scenarios
Addresses
Embed all fields in one table if you’ll always query them together. Separate them out if different parts of your app need different addresses (shipping vs billing, for example).
-- Simple but inflexible
CREATE TABLE orders (
id SERIAL PRIMARY KEY,
shipping_street VARCHAR(255),
shipping_city VARCHAR(100),
shipping_state VARCHAR(100),
shipping_postal_code VARCHAR(20),
shipping_country VARCHAR(100)
);
-- Reusable and flexible
CREATE TABLE addresses (
id SERIAL PRIMARY KEY,
customer_id INTEGER REFERENCES customers(id),
address_type VARCHAR(20) CHECK (address_type IN ('shipping', 'billing')),
street VARCHAR(255) NOT NULL,
city VARCHAR(100) NOT NULL,
state VARCHAR(100),
postal_code VARCHAR(20) NOT NULL,
country VARCHAR(100) NOT NULL
);
Categories and Tags
Single category? A simple reference works. Product in multiple categories? You need a junction table.
CREATE TABLE categories (
id SERIAL PRIMARY KEY,
name VARCHAR(100) NOT NULL UNIQUE,
parent_id INTEGER REFERENCES categories(id)
);
CREATE TABLE product_categories (
product_id INTEGER REFERENCES products(id),
category_id INTEGER REFERENCES categories(id),
PRIMARY KEY (product_id, category_id)
);
Audit Fields
Most tables should track who created and modified records.
CREATE TABLE sensitive_data (
id SERIAL PRIMARY KEY,
data_value VARCHAR(500),
-- Audit trail
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
created_by INTEGER REFERENCES users(id),
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_by INTEGER REFERENCES users(id),
-- Soft delete
deleted_at TIMESTAMP,
deleted_by INTEGER REFERENCES users(id)
);
Normalization Trade-offs at a Glance
| Aspect | Normalized (3NF+) | Denormalized |
|---|---|---|
| Write complexity | Higher (more joins on insert/update) | Lower (single table writes) |
| Read performance | Slower (multi-table joins) | Faster (fewer joins) |
| Storage efficiency | Better (no redundant data) | Wasted space from duplication |
| Update anomaly risk | Lower (data lives in one place) | Higher (same data in multiple rows) |
| Query flexibility | Higher (flexible filtering) | Lower (schema baked into table) |
| Scalability | Better horizontal scaling | harder with very large tables |
Common Production Failures
Referential integrity breaks: Teams sometimes disable foreign key constraints to speed up bulk loads, then forget to re-enable them. Or they re-enable but skip the validation step. Orphaned rows pile up quietly. After any bulk load operation, run a query that checks for orphaned foreign keys before calling it done.
Constraint violation cascades: CASCADE DELETE on a table with millions of child rows will lock the database long enough to cause an incident. I have seen this happen. Test cascade behavior with production-scale data before it reaches production. Batched deletes or soft deletes are safer for large child tables.
Index bloat on foreign keys: Foreign key columns that get frequent updates but lack indexes accumulate dead tuples fast. This is especially painful on high-throughput tables. Keep an eye on index size and bloat on FK columns.
Wrong data type choices: VARCHAR(255) for a country code is lazy. Every row wastes space. VARCHAR(2) for country_code stores the same two characters correctly in a tiny fraction of the space. Guess the actual maximum length, do not just copy-paste 255.
Common Pitfalls and Anti-Patterns
Using VARCHAR without a length specification: VARCHAR without a length limit behaves like TEXT and may bypass application-level length validation. Fix: always specify VARCHAR(n) with an explicit maximum based on the actual data domain.
Naming tables and columns inconsistently: Mixing snake_case, camelCase, and PascalCase across tables makes queries fragile and ORM mapping error-prone. Fix: adopt snake_case for PostgreSQL and enforce it through a schema linting tool.
Ignoring the created_at / updated_at audit fields: Tables without these fields make it impossible to audit when data was created or changed. Fix: add created_at TIMESTAMP DEFAULT NOW() and updated_at TIMESTAMP to every table that tracks persistent data.
Using TEXT for everything to avoid type decisions: TEXT columns waste storage and prevent the database from enforcing meaningful length constraints. Fix: use VARCHAR(n) for short strings, TEXT only for truly unbounded content, and appropriate numeric types (SMALLINT, INTEGER, BIGINT) for numbers.
Dropping tables without verifying foreign key dependencies: A DROP TABLE on a parent table without CASCADE silently fails if child tables have FK references, leaving the schema in an inconsistent state.
One-to-one means each row in Table A maps to exactly one row in Table B and vice versa. One-to-many means a row in Table A can map to multiple rows in Table B, but each row in Table B maps back to only one row in Table A. Choose one-to-one when you have optional data that only some rows need (like billing addresses), when the data is large and most rows do not need it, or when you want to split a wide table for organizational reasons. Choose one-to-many for standard parent-child relationships like customers and their orders. One-to-one adds join complexity when you always query the data together, so merge tables if they are always queried jointly.
The 47-column table with repeated supplier fields is a denormalized design that causes update anomalies, storage waste, and data inconsistency. When a supplier changes their address, you must update every product row that references that supplier. Missing updates create inconsistent data. The fix is normalization: extract suppliers into their own table with a surrogate key, then replace the denormalized supplier fields in the products table with a foreign key reference. This eliminates redundancy, prevents update anomalies, and ensures address changes happen in one place. The risk is that 47-column tables often have other problems lurking, so a full schema review is advisable.
Always use a junction table for many-to-many relationships. Storing comma-separated IDs or JSON arrays violates first normal form (atomic values), prevents efficient joins, makes indexing impossible on the stored relationships, and complicates updates. A junction table with foreign keys to both tables enables proper joins, enforces referential integrity, allows indexing on either side of the relationship, and accurately models the true cardinality. The only case where denormalized storage might make sense is when the relationship is truly optional and almost never queried, but even then the tradeoff is usually not worth it.
Embedding addresses in orders works when you always need the address with the order and the address never changes after the order is placed. It simplifies queries by eliminating joins and performs well for read-heavy order reporting. The separate addresses table is correct when addresses are shared across multiple entities (customers, suppliers), when addresses can be updated and those updates should propagate, or when you need to query "all orders shipped to California" across all customers. Embedding creates redundancy and update anomaly risk. The right choice depends on whether the address is an attribute of the order or a shared entity that happens to be referenced by the order.
Use VARCHAR(2) or VARCHAR(3) for country codes when you only need to store and display the code without additional context. Use a lookup table when you need to store additional data about each country (name, region, population, calling code) or when you want the database to enforce referential integrity preventing invalid codes. VARCHAR(255) for a two-letter country code wastes storage and bypasses the ability to catch typos at the database level. A countries table with a foreign key is the right choice when the country is a domain entity you care about beyond just the code.
Audit fields track when records were created and modified. created_at should default to CURRENT_TIMESTAMP and never change. updated_at should also default to CURRENT_TIMESTAMP but update on every row change using a trigger or application-level logic. deleted_at supports soft deletes, allowing records to appear deleted without physical removal. Implementation choices: application-level (your ORM or code sets the values), database triggers (guaranteed regardless of access path), or generated columns (in PostgreSQL, GENERATED ALWAYS AS). The key decision is whether you need the database to enforce these automatically or whether application-level implementation is sufficient. Database-level enforcement is more robust since it covers direct SQL access.
Excessive NULLs often indicate a schema that has merged two entities that should be separate. NULL means "unknown" or "not applicable," but if 80% of rows have a value in one column and only 20% in another, those columns might belong in a separate table with a one-to-one relationship. NULLable foreign keys are appropriate when the relationship is genuinely optional (a user may or may not have a spouse). NULLable columns that represent different categories of the same entity suggest subtype modeling is needed. Review whether the table is trying to represent multiple concepts or whether normalization would clarify the model.
CASCADE DELETE can silently remove large numbers of rows when a parent record is deleted. If you delete a customer who has 50,000 orders, the database locks every related row during deletion, blocking other writes to those tables for seconds or minutes. On busy production systems this causes incidents. CASCADE DELETE is dangerous when the child tables have many rows or when the delete happens without application-level awareness. Safer alternatives: SET NULL or SET DEFAULT (requires nullable FK), soft delete (set deleted_at instead), or batched hard deletes that chunk the deletion and let the system breathe between batches. Always test CASCADE behavior with production-scale data before deploying.
Big table schema changes require a multi-phase approach to avoid locking. Add new columns as nullable with no defaults first, then backfill in batches (UPDATE LIMIT 1000 with sleep between batches to reduce load), then add NOT NULL constraints once backfill is complete, then add defaults. For adding NOT NULL with a default on PostgreSQL 11+, the default is stored in the catalog and applied on read, so it does not require a table rewrite. For older versions or more complex changes, use the expand-contract pattern: add new column, migrate data, deprecate old column, drop old column. Never alter a multi-billion row table in a single statement in production.
Hard delete permanently removes the row from the database. Soft delete sets a deleted_at timestamp or is_active flag, keeping the row in the table but invisible to normal queries. Hard delete is appropriate when the data has no regulatory retention requirement, when deleted data serves no analytical purpose, and when storage is a concern. Soft delete is appropriate when you need an audit trail, when deleted records are referenced by other tables that must remain consistent, when regulatory requirements mandate data retention, or when you need the ability to restore accidentally deleted records. Soft delete complicates queries (every query must filter deleted rows) so only use it when the benefit justifies the overhead.
CHECK constraints validate that column values meet a condition before allowing insert or update. They enforce business logic at the database level that would otherwise require application-layer validation. Use CHECK when you need to validate domain-specific rules like price >= 0, status IN ('active', 'inactive'), or percentage BETWEEN 0 AND 100. Unlike NOT NULL, CHECK can validate complex conditions across multiple columns. Unlike triggers, CHECK is declarative and more performant. The tradeoff is that complex CHECK constraints can be harder to maintain than application validation, and some teams prefer keeping business logic in the application layer for testability.
SERIAL and BIGSERIAL are auto-incrementing integers suitable for single-instance databases with moderate row counts. BIGSERIAL handles values beyond 2 billion. UUIDs are 16-byte identifiers suitable for distributed systems where multiple databases might generate IDs independently. Choose SERIAL for small to medium tables where you never need to expose the ID externally. Choose BIGSERIAL for high-volume tables that might exceed 2 billion rows. Choose UUID when you need distributed ID generation, when IDs cross database boundaries, or when you want to obscure the count of records. UUIDs have storage and index performance costs so only use them when the distributed generation benefit justifies the overhead.
PRIMARY KEY enforces uniqueness, cannot be NULL, and creates a clustered index by default (except PostgreSQL). A table can only have one PRIMARY KEY. UNIQUE enforces uniqueness but allows NULL (multiple NULLs depending on database), and creates a non-clustered index by default. A table can have multiple UNIQUE constraints. PRIMARY KEY is appropriate for the row identifier. UNIQUE is appropriate for columns that should be unique but are not the row identifier, like an email address or badge number. The practical difference is that PRIMARY KEY implies NOT NULL and is indexed differently.
Three common approaches: Adjacency list (parent_id column referencing the same table) is simple and supports quick immediate parent lookups but requires recursive queries for tree traversals. Nested sets (lt/rgt columns) enable fast subtree queries but inserts are expensive since all right values must shift. Materialized path (storing full path like "/1/5/12/") enables prefix matching for subtree queries with reasonable inserts. For PostgreSQL, also consider ltree extension for hierarchical data. Choose adjacency list for simple hierarchies with frequent updates. Choose nested sets or materialized path for read-heavy trees where you frequently query entire subtrees.
JSONB stores semi-structured data as binary JSON, allowing flexible schemas without schema migrations. It suits data where fields vary per row or where the structure evolves rapidly. Normalized tables suit data with fixed schemas, where you need referential integrity, and where you query specific fields frequently. JSONB tradeoffs: you lose type safety at the column level, querying nested fields is slower than indexed column access, and you cannot enforce relationships with foreign keys. Normalization tradeoffs: schema changes require ALTER TABLE, and joins multiply for highly related data. Use JSONB for configuration, user preferences, or when the data structure genuinely varies. Use normalized tables for core business data with fixed schemas.
Estimate per-row storage by summing column sizes: INTEGER is 4 bytes, BIGINT is 8 bytes, VARCHAR(n) stores n + 1 to n + 4 bytes depending on content length, BOOLEAN is 1 byte, TIMESTAMP is 8 bytes, JSONB is variable but roughly content size plus overhead. Add row overhead (PostgreSQL row header is 23 bytes plus padding). Multiply by expected row count for table size, then multiply by 1.3 to 1.5 for index overhead. For columns with wide distributions, sample actual data sizes before committing to VARCHAR lengths. The goal is to catch storage mistakes before they become production problems.
Table names are singular (user not users) since the table represents a type. Columns use snake_case (first_name not firstName). Foreign keys follow the pattern fk_table_referencedtable (fk_orders_customers). Primary keys follow pk_table (pk_users). Unique constraints follow uq_table_column (uq_users_email). Check constraints follow ck_table_condition (ck_orders_price_positive). This convention makes the purpose of each object immediately clear from its name. Consistency matters more than the specific convention chosen.
A surrogate key is a database-generated identifier with no meaning outside the database (auto-increment INT, UUID). A natural key is a value that exists in the real world and means something (ISBN, Social Security Number, email address). Use surrogate keys for most tables because they do not change when real-world identifiers change, they have no business logic attached, and they are compact. Use natural keys when the business already has a stable identifier that genuinely identifies the entity and is unlikely to change, like an ISBN for books or a country code that appears in external systems. Natural keys that appear in URLs or APIs should be treated as immutable contracts.
The expand-contract pattern safely migrates existing columns in phases: Phase 1 (expand) add the new column alongside the old one, deploy code that writes to both, backfill new column from old. Phase 2 (migrate) switch reads to new column, deploy code that reads from new column only. Phase 3 (contract) remove old column in a separate deployment. This pattern is necessary when you cannot safely backfill a column in a single operation on large tables, when the migration might require changing data types that cannot be done in place, or when you need to maintain zero-downtime operation throughout. The pattern adds complexity so only use it when the table is too large for direct ALTER TABLE.
Use a separate lookup table (statuses) when you need to store additional attributes about each status (status_name, display_order, color_code, is_active), when you want referential integrity to prevent invalid status values, or when the set of values changes frequently. Use a column with CHECK constraint (status VARCHAR(20) CHECK (status IN ('active', 'inactive'))) when the set of values is fixed and rarely changes, when you do not need additional attributes per status, and when simplicity outweighs the flexibility benefit. The lookup table approach is correct for domain entities. The CHECK constraint approach is correct for simple enumerated values that are never queried independently.
Further Reading
- PostgreSQL Documentation: The SQL Language - Data Definition — Comprehensive guide to table creation, constraints, and schema design in PostgreSQL
- MySQL 8.0 Reference Manual: Using SQL to Modify Tables — MySQL-specific schema modification patterns
- “Choosing a Wrong Primary Key” by_DATABAiley — Analysis of real-world problems from bad key choices
- “The Great Primary Key Debate” - Natural vs Surrogate Keys — Comprehensive comparison with production case studies
Conclusion
A well-designed schema is the foundation everything else builds on. The time you invest in getting tables, data types, constraints, and relationships right pays back with every query you write and every time your application needs to scale. Start with clear entities, enforce rules at the database level, model relationships accurately, and revisit decisions as your domain knowledge grows. The fundamentals covered here—tables as entity representations, thoughtful data type selection, constraint-driven data integrity, and correct relationship cardinality—apply regardless of which database you use. Master these and everything more advanced becomes easier.
Category
Related Posts
Database Normalization: From 1NF to BCNF
Learn database normalization from 1NF through BCNF. Understand how normalization eliminates redundancy, prevents update anomalies, and when denormalization makes sense for performance.
Constraint Enforcement: Database vs Application Level
A guide to CHECK, UNIQUE, NOT NULL, and exclusion constraints. Learn database vs application-level enforcement and performance implications.
Understanding SQL JOINs and Database Relationships
Master SQL JOINs with this practical guide covering INNER, LEFT, RIGHT, FULL OUTER, and CROSS joins. Learn how relationship types between tables shape your queries.