SQL and Database Performance: When 'Works on My Machine' Stops Working
The classic trap: you write a SQL query on your laptop with 500 test users. It runs instantly. You ship to production with 10 million users. It times out.
This gap—between SQL that works in development and SQL that works at scale—is where most database problems live.
The Query Performance Problem
Here's a common requirement: "Find all users who placed an order in the last 30 days."
First attempt:
SELECT DISTINCT u.*
FROM users u
JOIN orders o ON u.id = o.user_id
WHERE o.created_at >= NOW() - INTERVAL '30 days';
Works perfectly with test data. Crawls in production. Why?
SELECT DISTINCT
forces the database to load ALL matching rows into memory, sort them, and remove duplicates. A user with 1,000 orders means 1,000 identical user records loaded, sorted, and deduplicated. Multiply across millions of users.
A better approach:
SELECT u.*
FROM users u
WHERE EXISTS (
SELECT 1
FROM orders o
WHERE o.user_id = u.id
AND o.created_at >= NOW() - INTERVAL '30 days'
);
With the right index on orders(user_id, created_at)
, this is dramatically faster. The EXISTS check short-circuits—it stops looking once it finds one matching order. No loading thousands of duplicate rows.
This is the difference between knowing SQL syntax and understanding database performance.
Index Strategy: Not All Indexes Help
Indexes make queries faster—except when they don't. The problem isn't knowing that indexes help; it's knowing which indexes to create.
Take this query:
SELECT * FROM orders
WHERE user_id = 123
AND status = 'pending'
AND created_at > '2025-01-01'
ORDER BY created_at DESC;
A naive approach might create three separate indexes:
CREATE INDEX idx_user ON orders(user_id);
CREATE INDEX idx_status ON orders(status);
CREATE INDEX idx_created ON orders(created_at);
The problem: the database can only use ONE index efficiently for most queries. Creating three separate indexes wastes space and slows down writes (every INSERT and UPDATE must maintain all three indexes).
Better approach—a composite index:
CREATE INDEX idx_orders_lookup
ON orders(user_id, status, created_at DESC);
This single index handles the entire WHERE clause AND the ORDER BY in one pass. But column order matters critically:
This index helps with:
WHERE user_id = ?
WHERE user_id = ? AND status = ?
WHERE user_id = ? AND status = ? AND created_at > ?
It's useless for:
WHERE status = ?
(doesn't start with user_id)WHERE created_at > ?
(missing intermediate columns)
The order follows the "leftmost prefix" rule—the index can be used as long as you're filtering on a continuous prefix of columns from the left.
Understanding this comes from debugging slow queries in production, not from tutorials.
JOIN Behavior: Where Filters Matter
Consider this requirement: "Show all products and how many times each was ordered. Products with zero orders should show 0, not be excluded."
Standard approach:
SELECT p.name, COUNT(o.id) as order_count
FROM products p
LEFT JOIN orders o ON p.id = o.product_id
GROUP BY p.id, p.name;
Works fine. But what if the requirement changes: "Only count orders that weren't cancelled"?
Adding a WHERE clause seems logical:
SELECT p.name, COUNT(o.id) as order_count
FROM products p
LEFT JOIN orders o ON p.id = o.product_id
WHERE o.status != 'cancelled' -- WRONG!
GROUP BY p.id, p.name;
This breaks everything. The WHERE clause filters AFTER the JOIN, which means products without orders (where o.status
is NULL) get filtered out. The LEFT JOIN effectively becomes an INNER JOIN.
The fix is moving the filter into the JOIN condition:
SELECT p.name, COUNT(o.id) as order_count
FROM products p
LEFT JOIN orders o
ON p.id = o.product_id
AND o.status != 'cancelled'
GROUP BY p.id, p.name;
Now the filter applies DURING the join. Products without orders remain in the results with a count of 0.
The difference between WHERE and ON in a LEFT JOIN is subtle in syntax but massive in behavior.
Schema Design: Many-to-Many Relationships
A blog needs posts with tags. How do you model this?
Wrong approach (seen in production):
CREATE TABLE posts (
id SERIAL PRIMARY KEY,
title VARCHAR(200),
content TEXT,
tags VARCHAR(500) -- "python,sql,tutorial"
);
Storing comma-separated tags in a string creates multiple problems:
- Can't efficiently query "find all posts with tag X"
- Can't index individual tags
- Can't maintain referential integrity
- Can't prevent duplicate tags like "Python" vs "python"
Proper approach—a many-to-many relationship with a junction table:
CREATE TABLE tags (
id SERIAL PRIMARY KEY,
name VARCHAR(50) UNIQUE NOT NULL
);
CREATE TABLE post_tags (
post_id INTEGER REFERENCES posts(id) ON DELETE CASCADE,
tag_id INTEGER REFERENCES tags(id) ON DELETE CASCADE,
PRIMARY KEY (post_id, tag_id)
);
-- Index for finding all posts with a given tag
CREATE INDEX idx_post_tags_tag ON post_tags(tag_id);
Now you can efficiently query:
-- All posts with tag "python"
SELECT p.*
FROM posts p
JOIN post_tags pt ON p.id = pt.post_id
JOIN tags t ON pt.tag_id = t.id
WHERE t.name = 'python';
-- Tag counts
SELECT t.name, COUNT(pt.post_id) as post_count
FROM tags t
LEFT JOIN post_tags pt ON t.id = pt.tag_id
GROUP BY t.id, t.name;
The ON DELETE CASCADE
ensures that deleting a post automatically removes its tag associations—referential integrity handled by the database, not application code.
When Normalization Becomes a Problem
Database normalization is taught as a virtue. But in production, strict normalization sometimes causes problems.
Example: an e-commerce order system.
Fully normalized approach:
CREATE TABLE orders (
id INTEGER PRIMARY KEY,
user_id INTEGER REFERENCES users(id),
created_at TIMESTAMP
);
CREATE TABLE order_items (
order_id INTEGER REFERENCES orders(id),
product_id INTEGER REFERENCES products(id),
quantity INTEGER
-- No price stored here - reference products table
);
The problem: what happens when you change a product's price?
Suddenly, old orders display the NEW price. Revenue reports become wrong. Customers see discrepancies between their receipts and what the system shows.
The fix is intentional denormalization:
CREATE TABLE order_items (
order_id INTEGER REFERENCES orders(id),
product_id INTEGER REFERENCES products(id),
quantity INTEGER,
price_at_purchase DECIMAL(10,2), -- Snapshot the price
product_name VARCHAR(200) -- Snapshot the name too
);
Now order history is immutable. Changing product prices or names doesn't affect past orders. This violates Third Normal Form, but it solves a real business problem.
The decision to denormalize should be conscious, based on requirements—not accidental.
Transaction Safety and Race Conditions
Building a money transfer feature. User A sends $100 to User B.
First attempt:
UPDATE accounts SET balance = balance - 100 WHERE user_id = 1;
UPDATE accounts SET balance = balance + 100 WHERE user_id = 2;
Dangerous. If the first UPDATE succeeds but the second fails (network issue, server crash), $100 disappears.
Better—use transactions for atomicity:
BEGIN;
UPDATE accounts SET balance = balance - 100 WHERE user_id = 1;
UPDATE accounts SET balance = balance + 100 WHERE user_id = 2;
COMMIT;
Both succeed or both fail. But there's still a problem: race conditions.
What if User A has $50, and two concurrent transfers both try to send $50? Both might check the balance, see $50, and proceed. Now the account is overdrawn.
The solution is row-level locking:
BEGIN;
-- Lock the row for this transaction
SELECT balance FROM accounts WHERE user_id = 1 FOR UPDATE;
-- Check sufficient funds (in application code)
IF balance >= 100 THEN
UPDATE accounts SET balance = balance - 100 WHERE user_id = 1;
UPDATE accounts SET balance = balance + 100 WHERE user_id = 2;
COMMIT;
ELSE
ROLLBACK;
END IF;
FOR UPDATE
locks the row, preventing other transactions from reading it until this transaction completes. This eliminates the race condition.
Knowing about locking separates developers who've worked with concurrent writes from those who've only built single-user applications.
Query Performance: The Cartesian Explosion
This query looks reasonable:
SELECT u.username,
COUNT(p.id) as post_count,
COUNT(c.id) as comment_count
FROM users u
LEFT JOIN posts p ON u.id = p.user_id
LEFT JOIN comments c ON u.id = c.user_id
GROUP BY u.id, u.username;
But on a real database, it's slow. Why?
Cartesian explosion. If a user has 10 posts and 100 comments, the JOIN creates 1,000 intermediate rows (10 × 100) before aggregation. For users with hundreds of posts and thousands of comments, this becomes millions of intermediate rows.
The fix is to aggregate separately and then join:
SELECT
u.username,
COALESCE(p.post_count, 0) as post_count,
COALESCE(c.comment_count, 0) as comment_count
FROM users u
LEFT JOIN (
SELECT user_id, COUNT(*) as post_count
FROM posts
GROUP BY user_id
) p ON u.id = p.user_id
LEFT JOIN (
SELECT user_id, COUNT(*) as comment_count
FROM comments
GROUP BY user_id
) c ON u.id = c.user_id;
Each subquery aggregates independently—no Cartesian product. A query that took 30 seconds might now run in 200ms.
This pattern appears whenever you're joining multiple one-to-many relationships.
Common Mistakes That Indicate Inexperience
Storing dates as strings:
created_at VARCHAR(50) -- "2025-01-15"
Can't use date functions, can't sort properly, different formats cause chaos.
No foreign key constraints: The logic: "We'll handle referential integrity in application code." Reality: someone writes a bug, orphaned records accumulate, data integrity degrades.
SELECT * everywhere: Fetching all columns when you only need two wastes network bandwidth, prevents covering index optimization, and breaks queries if someone adds a large BLOB column later.
Premature optimization: Adding 15 indexes to a table with 1,000 rows because "indexes make things fast." The indexes consume more space than the data and slow down writes.
These aren't syntax errors—they're judgment errors.
How to Build Real Understanding
Work with real data volumes: Load millions of rows into a local database. Write queries. Watch them slow down. Add indexes. Measure the difference. This teaches you what actually matters.
Study execution plans: Run EXPLAIN ANALYZE
on every query. Learn to read the output. Understand the difference between Index Scan, Sequential Scan, Bitmap Heap Scan, and Nested Loop.
Break things intentionally: Remove an index from a production-like dataset and measure impact. Create Cartesian products. Then optimize. You'll remember lessons from debugging more than from tutorials.
Read production schemas: Look at open-source projects' database migrations. See how they handle versioning, constraints, and indexes. Projects like Discourse and GitLab have well-designed schemas to study.
What Separates Levels
Junior developers know SQL syntax. Senior developers understand database behavior.
The difference shows when debugging production slowness. Faced with timeouts, do you:
- Immediately suggest caching? (Treating symptoms)
- Start rewriting without understanding? (Guessing)
- Check slow query logs and execution plans? (Systematic)
The best developers explain trade-offs: "I'd use JSONB for user preferences instead of a normalized table. It's denormalized, but preferences are always fetched with the user, so a JOIN would add latency for no benefit."
They think about data volume: "How many rows are we expecting? That changes the index strategy."
They've debugged real problems: "I'd check if the slowness is consistent or spike-based. Consistent suggests missing indexes. Spikes might indicate locking or connection pool exhaustion."
This experience comes from building systems, watching them slow down under load, and fixing the problems—not from reading documentation.
Practice realistic scenarios on Vibe Interviews, but remember: the real learning happens when you load a million rows and your query takes 30 seconds.
Vibe Interviews Team
Part of the Vibe Interviews team, dedicated to helping job seekers ace their interviews and land their dream roles.
Ready to Practice Your Interview Skills?
Apply what you've learned with AI-powered mock interviews. Get instant feedback and improve with every session.
Start Practicing Now