Chapter 4

Databases — Where Everything Is Remembered

Close your eyes and imagine the moment you tap "Place Order" on Swiggy.

In that instant, the system needs to remember: what you ordered, which restaurant it's from, your delivery address, your payment method, the price you agreed to, and the timestamp. It needs to remember all of this permanently — not for a few seconds, not until the app closes, but forever. Because next week, when you look at "Order History," every detail must still be there. And if there's a billing dispute six months later, the data must match to the last rupee.

That's what a database does. It remembers things that matter, reliably, for as long as you need them.

Why Databases Exist

Let's start with why we need them at all, because your computer already has ways to store things.

Analogy

Think of your computer's memory (RAM — Random Access Memory) as a whiteboard in a meeting room. During the meeting, people write ideas on it — it's fast, visible, everyone can access it. But when the meeting ends and someone flips the power switch off? Everything on that whiteboard is gone. RAM is like that: it's where your computer keeps things it's actively working on, and it's blazingly fast. But it's volatile — turn off the power, and everything disappears.

A database is the filing cabinet in the back office. It's slower to walk there and pull out a folder, but once a document goes into that cabinet, it stays there. Through power outages, through restarts, through years of operation. The filing cabinet is persistent.

Every running application uses RAM to hold the data it's working with right now. The Swiggy app on your phone keeps your current cart in RAM — that's why it loads instantly when you switch back to the app. But the moment you place that order, the data moves from the whiteboard (RAM) to the filing cabinet (database). It has to. Because if Swiggy's server restarted and your order vanished, you'd never use Swiggy again.

Real-Life Example

When Instagram crashed for approximately six hours on October 4, 2021 (along with Facebook and WhatsApp), users were terrified their photos were gone forever. They weren't. Every photo, every comment, every DM was sitting safely in Meta's databases. The crash was a network routing issue — the servers couldn't be reached. But the filing cabinets were intact. The distinction between "the app is down" and "the data is lost" is the distinction between RAM and database.

Why This Exists

Databases exist because human memory is imperfect and machines restart. Every business, from a chai stall keeping a ledger of who owes money, to a bank tracking billions of transactions, solves the same problem: how do I remember things reliably? Software databases are digital ledgers. The sophistication varies — a corner shop uses a notebook, HDFC Bank uses Oracle — but the core need is identical.

Here's the technical reality: a database is a specialized piece of software that runs on a server (or many servers) and does four things exceptionally well:

  1. Store data reliably on disk (permanent storage)
  2. Retrieve specific data quickly, even from billions of records
  3. Update data without corrupting what's already there
  4. Delete data cleanly, without leaving orphaned references

These four operations are so fundamental that they have a name: CRUD — Create, Read, Update, Delete. Every app you've ever used, from Zomato to Zerodha to WhatsApp, is a CRUD application at its core with a nice interface on top.

SQL Databases: The Spreadsheet with Strict Rules

The most common type of database you'll encounter is a relational database, and the language used to talk to it is called SQL (Structured Query Language — pronounced "sequel" by most engineers, "S-Q-L" by the rest).

Analogy

Imagine a perfectly organized Google Sheet. Each tab is a table. Each row is a record (one person, one order, one transaction). Each column is a field (name, email, amount). And there are strict rules: you can't put a name in the "email" column, you can't leave the "amount" column blank, and every row must have a unique ID number. A SQL database is this — a collection of organized tables with enforced rules.

Let's make this concrete. Imagine you're building Instagram. You need to track who follows whom. Here's how that looks as a database table:

┌─────────────────────────────────────────────────────────────────────┐
│                        TABLE: follows                               │
├──────────┬──────────────┬──────────────┬───────────────────────────┤
│  id      │ follower_id  │ following_id │ created_at                │
├──────────┼──────────────┼──────────────┼───────────────────────────┤
│  1       │ 4821 (Rahul) │ 1092 (Virat) │ 2025-01-15 09:32:11      │
│  2       │ 4821 (Rahul) │ 7734 (Priya) │ 2025-01-15 09:33:45      │
│  3       │ 7734 (Priya) │ 1092 (Virat) │ 2025-02-03 14:12:08      │
│  4       │ 1092 (Virat) │ 4821 (Rahul) │ 2025-03-21 18:45:33      │
│  5       │ 7734 (Priya) │ 4821 (Rahul) │ 2025-04-10 11:09:22      │
└──────────┴──────────────┴──────────────┴───────────────────────────┘

Reading this table:
  → Row 1: Rahul follows Virat
  → Row 4: Virat follows Rahul back (mutual follow)
  → Row 5: Priya follows Rahul
  → Rahul follows 2 people. Virat follows 1. Priya follows 2.

Now, a separate table stores the actual user information:

┌──────────────────────────────────────────────────────────────────┐
│                        TABLE: users                              │
├──────────┬────────────┬──────────────────────┬───────────────────┤
│  id      │ username   │ email                │ created_at        │
├──────────┼────────────┼──────────────────────┼───────────────────┤
│  1092    │ virat.k    │ virat@email.com      │ 2024-06-01        │
│  4821    │ rahul_s    │ rahul@email.com      │ 2024-09-12        │
│  7734    │ priya.m    │ priya@email.com      │ 2024-11-28        │
└──────────┴────────────┴──────────────────────┴───────────────────┘

Notice something: the follows table doesn't store usernames — it stores follower_id and following_id, which are numbers that correspond to the id column in the users table. This connection between tables is called a relationship (hence "relational database"), and the connecting column (id) is called a key.

Why not store the username directly in the follows table? Because if Rahul changes his username from rahul_s to rahul_official, you'd have to update every single row in the follows table. With IDs, you update exactly one row in the users table, and every reference stays correct automatically.

JOINs: Connecting the Pieces

To get something useful — like "show me the usernames of everyone Rahul follows" — you need to combine data from both tables. In SQL, this is called a JOIN (exactly what it sounds like: joining two tables together based on a shared value).

-- "Give me the usernames of everyone that user 4821 (Rahul) follows"
SELECT users.username
FROM follows
JOIN users ON follows.following_id = users.id
WHERE follows.follower_id = 4821;

Let's read this in plain English:

  1. SELECT users.username — "I want the username column from the users table"
  2. FROM follows — "Start by looking at the follows table"
  3. JOIN users ON follows.following_id = users.id — "Connect each row to the users table by matching following_id to id"
  4. WHERE follows.follower_id = 4821 — "But only for rows where Rahul is the follower"

The result: virat.k, priya.m. Rahul follows Virat and Priya.

Why This Exists

SQL reads almost like English once you get the pattern. SELECT is "give me." FROM is "look in." WHERE is "but only when." JOIN is "also bring in." Every SQL query is a question asked in a structured way. "Give me the username from the follows+users tables but only when the follower is Rahul." That's what the query above says.

Real Instagram has over 2 billion users and hundreds of billions of follow relationships. The query structure is identical — the database handles the scale.

The Major SQL Databases

You'll encounter these names repeatedly:

  • PostgreSQL (often called "Postgres") — the most popular open-source relational database. Powers Supabase, used by Apple, Instagram, Spotify. Known for reliability and rich features.
  • MySQL — another major open-source option. Powers much of the web, including Wikipedia and (historically) Facebook.
  • SQLite — a lightweight database that runs inside your application without a separate server. Every iPhone and Android phone has dozens of SQLite databases running right now.
  • Oracle — enterprise database used by banks, governments, and large corporations. Expensive, powerful, and the source of many enterprise IT memes.

For this book, we'll use PostgreSQL through Supabase — a platform that gives you a production-grade Postgres database with a visual interface, API access, and a generous free tier.

NoSQL: The Messy Notebook

SQL databases are excellent when your data is structured and predictable — users have names and emails, orders have items and prices, transactions have amounts and dates. But what happens when your data is... complicated?

Analogy

A SQL database is like a library with rigid organization: every book has a catalog number, a shelf position, a genre classification. A NoSQL database is more like a creative person's notebook — some pages have text, some have sketches, some have both, some have Post-it notes stuck to them. There's no fixed format, and that's the point. Some information doesn't fit neatly into rows and columns.

Consider a food delivery app. A restaurant's menu looks different from restaurant to restaurant. A pizza place has sizes and toppings. A biryani joint has serving sizes. A juice bar has add-ons. If you tried to force all of this into one rigid SQL table, you'd end up with columns like topping_1, topping_2, topping_3... topping_15 — most of them empty for non-pizza restaurants. It's wasteful and ugly.

A NoSQL database (which confusingly stands for "Not Only SQL," not "No SQL") lets you store each restaurant's menu in its own format. Here's what a MongoDB document looks like — it's structured, but flexible:

{
  "restaurant": "Pizza Paradise",
  "menu": [
    {
      "item": "Margherita",
      "sizes": ["small", "medium", "large"],
      "toppings": ["mushroom", "olive", "jalapeño", "extra cheese"],
      "base_price": 299
    }
  ]
}

And a completely different structure for another restaurant:

{
  "restaurant": "Biryani Blues",
  "menu": [
    {
      "item": "Hyderabadi Chicken Biryani",
      "serves": "1-2 people",
      "spice_level": ["mild", "medium", "hot"],
      "price": 349,
      "add_ons": ["raita", "mirchi ka salan"]
    }
  ]
}

Same database, same collection, different shapes. NoSQL is designed for this flexibility.

Types of NoSQL Databases

NoSQL isn't one thing — it's a family of approaches, each optimized for different problems:

Document databases (MongoDB) — Store data as JSON-like documents. Each document can have a different structure. Best for: catalogs, content management, user profiles with varying fields. MongoDB is the most well-known, used by companies like Forbes, Toyota, and the UK Government.

Key-Value stores (Redis) — The conceptual equivalent of a dictionary: look up a key, get a value. Blazingly fast because they keep data in RAM. Best for: caching, session storage, leaderboards. Redis can handle millions of operations per second. When you see "online" status updates in WhatsApp or Discord, that's often backed by Redis.

Wide-column stores (Cassandra) — Designed for massive scale across many servers. Best for: time-series data, IoT sensor data, activity logs. Netflix uses Cassandra to track everything you watch — hundreds of billions of events across tens of thousands of servers.

Graph databases (Neo4j) — Store data as nodes and relationships, optimized for traversing connections. Best for: social networks, recommendation engines, fraud detection. LinkedIn uses a graph database to power "People You May Know."

SQL vs NoSQL: When to Use What

This is one of the most debated topics in software engineering, and the honest answer is: it depends on what you're building.

┌─────────────────────────────────────────────────────────────────────────┐
│                    SQL vs NoSQL — Decision Guide                        │
├─────────────────────┬────────────────────┬──────────────────────────────┤
│                     │ SQL (Relational)   │ NoSQL (Non-Relational)       │
├─────────────────────┼────────────────────┼──────────────────────────────┤
│ Data structure      │ Fixed schema       │ Flexible schema              │
│                     │ (columns defined   │ (each record can have        │
│                     │ upfront)           │ different fields)            │
├─────────────────────┼────────────────────┼──────────────────────────────┤
│ Relationships       │ Excellent — built  │ Possible but awkward —       │
│ between data        │ for JOINs across   │ often requires duplicating   │
│                     │ tables             │ data                         │
├─────────────────────┼────────────────────┼──────────────────────────────┤
│ Consistency         │ Strong — data is   │ Eventual — data may take     │
│                     │ always accurate    │ milliseconds to sync         │
│                     │ after a write      │ across servers               │
├─────────────────────┼────────────────────┼──────────────────────────────┤
│ Scale               │ Vertical (bigger   │ Horizontal (more servers)    │
│                     │ server)            │                              │
├─────────────────────┼────────────────────┼──────────────────────────────┤
│ Best for            │ Financial data,    │ Real-time feeds, IoT,        │
│                     │ user accounts,     │ catalogs, chat messages,     │
│                     │ e-commerce orders  │ gaming leaderboards          │
├─────────────────────┼────────────────────┼──────────────────────────────┤
│ Example products    │ PostgreSQL, MySQL  │ MongoDB, Redis, Cassandra,   │
│                     │ SQLite, Oracle     │ DynamoDB, Neo4j              │
├─────────────────────┼────────────────────┼──────────────────────────────┤
│ Real-world user     │ Zerodha, Stripe,   │ Netflix, Uber (driver       │
│                     │ Instagram,         │ locations), Discord,         │
│                     │ most banks         │ Swiggy (catalogs)            │
└─────────────────────┴────────────────────┴──────────────────────────────┘
Real-Life Example

Banks use SQL databases because when you transfer ₹10,000 from your account to someone else's, that ₹10,000 must leave your account and arrive in theirs — atomically, precisely, with zero ambiguity. SQL databases are built for this. Chat applications like Discord lean toward NoSQL because speed and flexibility matter more than strict consistency — if your message takes 50ms to appear on a friend's screen instead of being perfectly synchronized, nobody notices.

The real-world truth? Most production systems use both. Uber stores ride records in a SQL database (because billing must be exact) but tracks live driver locations in a NoSQL store (because location updates happen every second and need to be fast, not perfectly consistent).

Vector Databases: The New Paradigm

Everything we've discussed so far deals with exact data — a number, a string, a date. You ask for user #4821 and you get user #4821. But what if you want to search by meaning?

Analogy

Imagine a library. Traditional databases are like the Dewey Decimal System — every book has an exact category number. You look up 636.7 and you find books about dogs. Precise, organized, and completely useless if your question is "I want a book that feels like the one I read last summer — you know, that cozy story about finding yourself in a small town."

A vector database is a library organized by vibes. Books that feel similar sit on the same shelf, even if they're technically different genres. A cozy small-town romance sits near a heartwarming family drama, which sits near a gentle comedy about community — because they share an emotional texture, even though their Dewey Decimal numbers would place them in completely different sections.

How does this work technically? Through embeddings — a way to represent any piece of content (a sentence, a paragraph, an image) as a list of numbers that captures its meaning. An embedding model (a type of AI) converts text into these number lists, called vectors.

The sentence "I love playing cricket on weekends" might become a vector like [0.23, -0.41, 0.87, 0.12, ...] — hundreds or thousands of numbers. The sentence "Weekend cricket matches are my favorite hobby" would produce a very similar vector, even though the words are different, because the meaning is similar.

A vector database stores these vectors and can answer questions like: "Find me the 10 documents most similar to this query." This is what powers:

  • Semantic search — Google understands "best laptop for students" and "affordable computer for college" mean roughly the same thing
  • RAG (Retrieval-Augmented Generation) — AI chatbots that search your company's documents to answer questions accurately (covered in depth in Chapter 13)
  • Recommendation engines — Spotify finding songs that "feel like" what you've been listening to
  • Image search — Google Photos letting you search "sunset at beach" and finding your actual sunset photos

Major vector databases include Pinecone, Weaviate, Qdrant, Chroma (lightweight, good for prototyping), and pgvector (a PostgreSQL extension that adds vector capabilities to your existing Postgres database — increasingly popular because you don't need a separate database).

Why This Exists

Vector databases became essential because of the AI revolution. Before large language models (LLMs), search was about keywords — does the document contain the word "cricket"? With embeddings, search is about meaning — does the document relate to cricket, even if it never uses that word? This shift from keyword matching to semantic understanding is one of the most significant changes in how software works, and vector databases are the infrastructure that makes it possible.

ACID: Why Your Bank Transfer Doesn't Lose Money

Let's say you transfer ₹5,000 from your HDFC account to your friend's ICICI account. This requires two operations:

  1. Subtract ₹5,000 from your account
  2. Add ₹5,000 to your friend's account

What if the server crashes after step 1 but before step 2? Your money is gone from your account but never arrived in your friend's. ₹5,000 vanished into thin air.

This cannot happen. And the reason it cannot happen is because of ACID — a set of four guarantees that relational databases provide:

Analogy

Think of a bank transfer as passing a cricket ball between two players. ACID says: either the ball leaves Player A's hand AND arrives in Player B's hand, or it stays in Player A's hand. It never disappears mid-air. It never duplicates so both players have one. It never corrupts into a tennis ball mid-flight. The throw is a single atomic event — it either completes fully or doesn't happen at all.

Atomicity — "All or nothing." Either the entire transaction succeeds (money leaves your account AND arrives in your friend's) or the entire transaction fails (nothing changes). There is no in-between state. If anything goes wrong at any step, the database rolls back everything — restoring the exact state before the transaction started.

Consistency — "The rules are always enforced." If your account has ₹3,000 and you try to transfer ₹5,000, the database rejects the entire transaction because it would violate the "balance cannot be negative" rule. The database moves from one valid state to another valid state — never through an invalid one.

Isolation — "Transactions don't interfere with each other." If you and your friend both transfer money to a third person at the same time, each transaction sees a consistent view of the data. They don't step on each other. It's as if each transaction has the database to itself, even though thousands of transactions are happening simultaneously.

Durability — "Once confirmed, it's permanent." When the bank says "Transfer successful," that data is written to permanent storage. Even if the server catches fire one second later, your transfer is recorded. This is why databases write to disk (the filing cabinet), not to RAM (the whiteboard).

Real-Life Example

ACID compliance is why the Reserve Bank of India mandates that all banking transactions must use ACID-compliant databases. It's not a preference — it's a regulatory requirement. When your UPI payment shows "Success" on PhonePe, ACID guarantees that the money has durably, atomically, consistently, and in isolation been moved. That green checkmark is backed by decades of database engineering.

NoSQL databases traditionally relaxed some ACID properties in exchange for speed and scalability — a tradeoff formalized as BASE (Basically Available, Soft state, Eventually consistent). However, this is changing: MongoDB added multi-document ACID transactions in version 4.0, and many modern NoSQL databases offer configurable consistency levels.

Indexing: The Textbook Index

Your database now holds millions of records. A user searches for their order. How does the database find it?

Analogy

Imagine a 600-page textbook with no index at the back. Someone asks you, "Where does it talk about photosynthesis?" Your only option: start at page 1, read every page, and check if photosynthesis is mentioned. That's called a full table scan — the database equivalent of reading every page.

Now imagine the textbook has an index at the back: "Photosynthesis: pages 142, 287, 401." You go straight there. That's a database index — a separate data structure that tells the database exactly where to find things.

Here's what the difference looks like in practice:

WITHOUT INDEX (Full Table Scan):
┌─────────────────────────────────────────────┐
│ Looking for order #78432...                  │
│                                              │
│ Check row 1......... not it                  │
│ Check row 2......... not it                  │
│ Check row 3......... not it                  │
│ ...                                          │
│ Check row 78,431.... not it                  │
│ Check row 78,432.... FOUND IT!               │
│ ...but keeps checking all remaining rows     │
│ Check row 78,433.... not it                  │
│ ...                                          │
│ Check row 2,000,000.. not it. Done.          │
│                                              │
│ Scanned: 2,000,000 rows                     │
│ Time: ~850ms                                 │
└─────────────────────────────────────────────┘

WITH INDEX (B-Tree Lookup):
┌─────────────────────────────────────────────┐
│ Looking for order #78432...                  │
│                                              │
│ Index says: "78432 is at disk position 4471" │
│ Jump directly to row → FOUND IT!             │
│                                              │
│ Scanned: 3 index levels + 1 row = 4 lookups │
│ Time: ~2ms                                   │
└─────────────────────────────────────────────┘

The most common index type is a B-tree (balanced tree), which works like a phone book: instead of reading every entry, you narrow down by halving the search space at each level. Looking for "Shravan" in a phone book? You open to the middle — is S before or after M? After. Open to the middle of the second half — before or after T? Before. Keep halving until you find it. A B-tree does the same thing, and it can find any record in a table of one billion rows in roughly 30 comparisons.

Why not index everything? Because indexes have costs:

  • Storage: Each index takes up disk space. A table with 10 indexes essentially stores parts of that data 10 extra times.
  • Write speed: Every time you insert, update, or delete a row, the database must update every affected index. More indexes = slower writes.
  • Maintenance: Indexes can become fragmented over time and need periodic maintenance.

The rule of thumb: index columns that you frequently search by (WHERE clauses) or sort by (ORDER BY). Don't index columns that rarely appear in queries. In production systems, identifying which indexes to add (and which to remove) is one of the highest-impact performance optimizations a team can make.

Real-Life Example

Amazon's product search indexes over 350 million products. Without indexes, each search would take minutes. With carefully designed indexes, results appear in under 100 milliseconds. The engineering team behind Amazon's search has said that adding the right index has sometimes improved query performance by 10,000x — the difference between a 10-second page load and a 1-millisecond response.

Caching: The Sticky Note on Your Fridge

Even with indexes, databases are relatively slow compared to RAM. A database query hits disk storage — fast by human standards, but slow by computer standards. What if you could avoid the database entirely for frequently requested data?

Analogy

Imagine you keep your grocery list in a filing cabinet in the basement. Every time you want to check what you need, you walk downstairs, open the cabinet, find the list, read it, walk back upstairs. It works, but it's slow.

Now imagine you write the most important items on a sticky note and put it on the fridge. Need to check if you need milk? Glance at the fridge. Don't need to go to the basement at all. The sticky note is a cache — a fast, temporary copy of frequently needed data, stored closer to where it's used.

In software, the most popular caching tool is Redis (Remote Dictionary Server). Redis stores data in RAM, which is roughly 100x faster than reading from a database on disk. When an application needs data it fetches frequently, the flow looks like this:

User Request: "Show me the trending products"

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│             │     │             │     │             │
│   App       │────▶│   Redis     │     │  Database   │
│   Server    │     │   (Cache)   │     │ (PostgreSQL)│
│             │◀────│             │     │             │
│             │     └─────────────┘     └─────────────┘
│             │                                │
│             │   Cache MISS (data not found)   │
│             │────────────────────────────────▶│
│             │◀────────────────────────────────│
│             │     Store result in Redis        │
│             │────▶┌─────────────┐              │
│             │     │   Redis     │              │
│             │     │  (updated)  │              │
└─────────────┘     └─────────────┘     └─────────────┘

Step 1: App checks Redis → "Do I have 'trending products' cached?"
Step 2a: Cache HIT → Data is in Redis → Return it instantly (~1ms)
Step 2b: Cache MISS → Data is NOT in Redis → Query the database (~50ms)
Step 3: Store the database result in Redis for next time
Step 4: Set an expiry (e.g., "forget this in 5 minutes")

Cache HIT: The data was in Redis. Response time: 1-2 milliseconds.

Cache MISS: The data wasn't in Redis (first request, or the cache expired). The app queries the database, gets the result, stores it in Redis for next time, and returns it. Response time: 50-200 milliseconds.

For frequently accessed data — trending posts, product pages, user profiles — the cache hit rate can exceed 95%, meaning the database only gets queried 5% of the time. This has a massive impact on performance and cost.

The tricky part: cache invalidation — knowing when the sticky note is out of date. If the trending products change but the cache still holds the old list, users see stale data. This is such a notoriously difficult problem that Phil Karlton (a Netscape engineer) famously said: "There are only two hard things in Computer Science: cache invalidation and naming things."

Common strategies:

  • TTL (Time To Live): "Expire this cache entry after 5 minutes." Simple and effective for data that doesn't need to be perfectly real-time.
  • Write-through: When the database is updated, immediately update the cache too.
  • Cache-aside: The application manages the cache manually — checking it first, falling back to the database, and populating the cache on miss (this is the flow shown above).
Real-Life Example

Twitter (X) uses Redis to cache user timelines. When you open the app, you see tweets within milliseconds — they're coming from Redis, not from a database query that JOINs across millions of follow relationships and billions of tweets. The database is the source of truth; Redis is the fast-access copy that makes the experience feel instant.

Real-World Database Systems

Let's look at how real companies solve database challenges at scale.

Zerodha: 15M+ Orders Per Day on PostgreSQL

Zerodha, India's largest stockbroker, processes over 15 million orders on peak trading days — all running on PostgreSQL. Not Oracle. Not a proprietary database. Open-source PostgreSQL. Their CTO, Kailash Nadh, has spoken extensively about their architecture: they use PostgreSQL for trade data, Redis for caching real-time market data, and keep the system remarkably simple for the scale they handle.

Why does this matter? Because it proves that PostgreSQL can handle extraordinary scale when used well. Zerodha's team focuses on correct indexing, efficient queries, and smart caching rather than throwing expensive hardware or exotic databases at the problem. The lesson: the database you choose matters far less than how well you use it.

Google Spanner: Globally Distributed SQL

Google's Spanner is one of the most technically ambitious databases ever built. It's a globally distributed SQL database — meaning it stores data across data centers on multiple continents while maintaining ACID compliance. This is extremely hard because of the speed of light: if a user in Mumbai writes data and a user in New York reads it 5 milliseconds later, has the write propagated yet?

Spanner solves this with atomic clocks and GPS receivers in every data center, achieving a property called external consistency — every transaction appears to happen at a specific point in time, globally. Google uses Spanner for Google Ads (which handles over 10 million queries per second) and Google Play.

Most of us will never need Spanner. But understanding that it exists reveals the outer boundary of what databases can do: planetary-scale, fully ACID-compliant, with single-digit-millisecond latency. Google Cloud offers Spanner as a service, though at significant cost (starting around $0.90/node/hour).

Supabase: PostgreSQL for the Rest of Us

Supabase is an open-source platform built on top of PostgreSQL that makes database management accessible to anyone who can use a web interface. It provides:

  • A visual table editor (create tables by clicking, like a spreadsheet)
  • Auto-generated REST and GraphQL APIs (your frontend can read/write to the database without you writing backend code)
  • Built-in authentication (user signup/login)
  • Row Level Security (RLS — rules that control which users can see which rows)
  • Real-time subscriptions (get notified instantly when data changes)
  • A generous free tier (two projects, 500MB database, 50,000 monthly active users)

Supabase has become the default choice for indie hackers, startups, and increasingly larger companies. It's what we use for projects throughout this book, including the dashboard you'll build in Part III.

Real-Life Example

Supabase serves over 1 million databases as of 2025. Companies like Mozilla, PwC, and Humata use it in production. Its popularity exploded because it removed the biggest barrier to using PostgreSQL — the operational complexity of setting up, securing, and managing a database server. With Supabase, you get a production-grade Postgres database in under 60 seconds, accessible through a dashboard that feels like editing a spreadsheet.

Martin Kleppmann, in Designing Data-Intensive Applications (widely considered the best book on database systems for practitioners), makes a point that applies here: the choice of database is not about which technology is "best" in the abstract. It's about which tradeoffs align with your application's specific needs. "There is no single tool that can serve all use cases well," he writes. "The task of software engineering is to understand the tradeoffs and choose wisely."[1]

Andy Pavlo, a database researcher at Carnegie Mellon University who maintains the influential CMU Database Group YouTube channel, adds a practical dimension: "The best database is the one your team already knows how to operate. An expertly tuned PostgreSQL will outperform a poorly configured specialized database every time."[2]

Putting It All Together

Here's how a real application uses everything we've covered in this chapter:

Imagine a user opens Zerodha at 9:15 AM on a Monday. The opening bell has rung.

  1. The user's session (who they are, their portfolio) is fetched from Redis (cache) — sub-millisecond.
  2. Live stock prices stream from a time-series data store — these are updated hundreds of times per second and don't need ACID compliance.
  3. The user places a buy order for 10 shares of Reliance. This hits the PostgreSQL database — ACID-compliant, indexed on order ID and user ID, with a transaction that atomically creates the order record and updates the user's available margin.
  4. The portfolio view runs a SQL JOIN across the orders table, holdings table, and current prices to show the user their total portfolio value.
  5. The "most traded stocks" section on the dashboard is served from Redis — computed every 30 seconds and cached, because showing data from 20 seconds ago is acceptable here.

Every concept in this chapter — SQL, NoSQL, caching, indexing, ACID — is at play in a single screen of one application. Understanding these concepts doesn't mean you'll build Zerodha tomorrow. It means that when your engineering team says "we need to add a Redis cache in front of this endpoint" or "this query needs an index on the created_at column," you'll know exactly what they mean and why it matters.

Try It Yourself — Build a Database in 10 Minutes

Let's move from theory to practice. You're going to create a real database, add data to it, and query it.

Step 1: Create a Supabase account

Go to supabase.com and sign up with your GitHub account (you created one in Chapter 0.2). Click "New Project." Give it a name like builders-bible. Choose a region close to you. Set a database password (save it somewhere — you'll need it later). Click "Create new project."

Wait about 60 seconds for your database to provision. You now have a production-grade PostgreSQL database running on the cloud.

Step 2: Create a table

Click "Table Editor" in the left sidebar. Click "Create a new table." Name it bookmarks. Add these columns:

  • id — type: int8, primary key, is identity (these are set by default)
  • created_at — type: timestamptz, default value: now() (set by default)
  • title — type: text
  • url — type: text
  • notes — type: text, nullable (allow empty)

Click "Save." You've created a database table.

Step 3: Insert data using the UI

Click "Insert row" and add 3-4 bookmarks — articles, videos, tools you find useful. Fill in the title, URL, and optionally notes for each.

Step 4: Query with Claude Code

Open your terminal, navigate to your exercises folder, and start Claude Code:

cd ~/Desktop/builders-bible-exercises
mkdir database-exercise && cd database-exercise
claude

Ask Claude Code:

Connect to my Supabase project and query all bookmarks.
My Supabase URL is [your-project-url] and my anon key is [your-anon-key].
(Find both in Supabase → Settings → API)
Use the Supabase JavaScript client.

Claude Code will create a script that connects to your database and fetches your bookmarks. Run it and see your data come back.

Step 5: Try a filtered query

Ask Claude Code:

Now query only bookmarks where the title contains the word "AI".
Also sort them by created_at, newest first.

You've now used a real database with a WHERE clause and ORDER BY — the same fundamentals used by every application on the internet.

Bonus: Ask Claude Code to add a new bookmark by inserting a row programmatically, then query again to see it appear.

What You Now Know

Databases are where software remembers. In this chapter, you learned:

  1. Why databases exist — RAM is volatile (the whiteboard); databases are persistent (the filing cabinet)
  2. SQL databases — structured tables with strict rules, connected by relationships and queried with JOINs
  3. NoSQL databases — flexible, schema-less stores in four flavors: document, key-value, wide-column, and graph
  4. When to use each — SQL for consistency-critical data (money), NoSQL for speed and flexibility (real-time feeds)
  5. Vector databases — organize data by meaning, not by exact values, powering AI search and RAG
  6. ACID properties — the guarantees that prevent your bank transfer from losing money
  7. Indexing — the textbook index that turns a 2-million-row scan into 4 lookups
  8. Caching — the sticky note on your fridge that prevents unnecessary trips to the filing cabinet
  9. Real systems — Zerodha runs 15M+ orders on PostgreSQL, Google Spanner spans the globe, Supabase makes it all accessible

In Chapter 5, we'll learn about the system that tracks every change you (or your AI) make to your code — Git, the version control system that made collaboration at scale possible.

Chapter endnotes

[1] Martin Kleppmann, Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems (O'Reilly Media, 2017). Chapter 2, "Data Models and Query Languages," provides the most accessible treatment of relational vs. document databases available anywhere. If you read one technical book from this entire bibliography, make it this one.

[2] Andy Pavlo's CMU Database Group (db.cs.cmu.edu) offers the "Intro to Database Systems" course for free on YouTube — 25 lectures covering everything from B-trees to query optimization. His practical advice about choosing databases comes from his 2023 talk "The State of Databases in 2023" at the CMU Database Conference.

[3] Supabase documentation (supabase.com/docs) is unusually well-written for a technical product. Their guides on Row Level Security and Edge Functions are worth reading even if you don't use Supabase — the concepts apply to any database.

[4] Kailash Nadh's talks on Zerodha's architecture are available on YouTube. His 2022 talk at Bangalore tech meetups describes running 15M+ daily orders on what he calls a "boring" stack: PostgreSQL, Redis, Go, and a small engineering team of ~30 people. The lesson: scale doesn't require complexity.

[5] Phil Karlton's quote about cache invalidation has been attributed to various sources. The most reliable citation traces it to the mid-1990s at Netscape Communications, where Karlton worked on SSL and other foundational web technologies.