CI/CD, Testing & Production Operations — The Factory That Ships Your Code
You have written the code, the tests pass on your laptop, and the feature looks great in development. Now what? You copy the files to the server, restart the application, cross your fingers, and hope nothing breaks?
That is how software was shipped in 2005. It is how some companies still ship software in 2026, and it is why those companies have outages every Friday afternoon and spend weekends in war rooms.
This chapter is about the machinery that takes your code from "it works on my machine" to "it works for every user, reliably, at 3 AM, when nobody is watching." The machinery is called CI/CD — Continuous Integration and Continuous Deployment — and it is, without exaggeration, the single most important operational investment a software team can make.
Imagine a car factory. The engineers design a new engine. In a bad factory, someone hand-carries the engine blueprint to the assembly floor, a worker eyeballs the dimensions, bolts it together, and the car rolls off the line untested. Sometimes it works. Sometimes the engine explodes on the highway. In a good factory, the blueprint goes through an automated quality check, a robot assembles the engine with micrometer precision, the engine is tested on a dynamometer before installation, the assembled car goes through a 200-point inspection, and only then does it leave the factory. CI/CD is the difference between the bad factory and the good factory.
Part 1: What CI/CD Actually Means
Continuous Integration (CI)
Continuous Integration is the practice of automatically testing and validating every code change the moment it is pushed to the repository. "Continuous" means it happens every time, without exception. "Integration" means you are integrating your change with the rest of the codebase and verifying that nothing breaks.
Without CI: a developer writes code for three weeks, pushes it, and discovers it conflicts with changes three other developers made during the same period. The merge takes two days. Half the tests are broken. Nobody knows whose change caused which failure.
With CI: every push triggers an automated pipeline that runs within minutes. If your code breaks something, you find out immediately — while the change is still fresh in your mind, while the scope is small, while the fix is straightforward.
Continuous Deployment (CD)
Continuous Deployment is the practice of automatically deploying every code change that passes the CI checks to production. The human does not click a "deploy" button. The machine does it.
Continuous Delivery (a related but distinct concept) means the code is always ready to deploy, but a human makes the final decision to push to production. The distinction matters:
| Term | Automated Testing | Automated Deploy to Production |
|---|---|---|
| Continuous Integration | Yes | No |
| Continuous Delivery | Yes | No (human approval) |
| Continuous Deployment | Yes | Yes (fully automated) |
CI ensures your code works. CD ensures your working code reaches users. Together, they close the gap between "developer finished the feature" and "users can use the feature" from weeks to minutes.
Stripe deploys code to production over 100 times per day. Amazon deploys every 11.7 seconds on average. Netflix deploys continuously across thousands of microservices. These are not reckless companies — they are companies with exceptionally rigorous CI/CD pipelines that give them confidence that each deployment is safe. The pipeline is the safety net that makes high-frequency deployment possible.
Part 2: The CI/CD Pipeline — Step by Step
A pipeline is a sequence of automated steps that your code passes through on its way from commit to production. Here is a standard pipeline for a modern web application:
THE CI/CD PIPELINE
┌─────────────────────────────────────────────────────────────┐
│ DEVELOPER PUSHES CODE │
└──────────────────────────┬──────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────┐
│ STAGE 1: LINT │
│ Check code style, formatting, unused variables │
│ Tool: ESLint, Prettier, Biome │
│ Time: 10-30 seconds │
│ Fails if: Code style violations found │
└──────────────────────────┬───────────────────────────────────┘
│ PASS
▼
┌──────────────────────────────────────────────────────────────┐
│ STAGE 2: TYPE CHECK │
│ Verify TypeScript/type annotations are correct │
│ Tool: tsc (TypeScript compiler) │
│ Time: 15-60 seconds │
│ Fails if: Type errors found │
└──────────────────────────┬───────────────────────────────────┘
│ PASS
▼
┌──────────────────────────────────────────────────────────────┐
│ STAGE 3: TEST │
│ Run unit tests, integration tests, end-to-end tests │
│ Tool: Vitest, Jest, Playwright │
│ Time: 1-10 minutes │
│ Fails if: Any test fails │
└──────────────────────────┬───────────────────────────────────┘
│ PASS
▼
┌──────────────────────────────────────────────────────────────┐
│ STAGE 4: BUILD │
│ Compile and bundle the application │
│ Tool: Next.js build, Vite, webpack │
│ Time: 1-5 minutes │
│ Fails if: Build errors, missing dependencies │
└──────────────────────────┬───────────────────────────────────┘
│ PASS
▼
┌──────────────────────────────────────────────────────────────┐
│ STAGE 5: DEPLOY │
│ Push to production (Vercel, AWS, Railway, etc.) │
│ Time: 1-5 minutes │
│ Fails if: Deployment errors, health check fails │
└──────────────────────────┬───────────────────────────────────┘
│ PASS
▼
┌──────────────────────────────────────────────────────────────┐
│ LIVE IN PRODUCTION │
└──────────────────────────────────────────────────────────────┘
If any stage fails, the pipeline stops. The developer is notified. The code does not reach production. This is the core guarantee: broken code cannot ship.
Why This Order Matters
The stages are ordered from fastest to slowest, cheapest to most expensive. Linting takes 10 seconds. If you have a style violation, you find out in 10 seconds — not after waiting 10 minutes for tests to run. Type checking catches a class of bugs in 30 seconds that would take minutes to discover through test failures. Tests catch logic errors before the build stage. The build catches configuration errors before deployment.
Each stage is a progressively finer filter. By the time code reaches the deploy stage, it has been validated at four levels.
GitHub Actions: The Standard Tool
GitHub Actions is the most widely-used CI/CD platform as of March 2026. It runs your pipeline on GitHub's servers every time code is pushed, a pull request is opened, or on a schedule you define.
A GitHub Actions pipeline is defined in a YAML file (.github/workflows/ci.yml):
name: CI/CD Pipeline
on:
push:
branches: [main]
pull_request:
branches: [main]
jobs:
lint:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: 20
- run: npm ci
- run: npm run lint
type-check:
runs-on: ubuntu-latest
needs: lint # Only runs if lint passes
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: 20
- run: npm ci
- run: npm run type-check
test:
runs-on: ubuntu-latest
needs: type-check # Only runs if type-check passes
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: 20
- run: npm ci
- run: npm test
build-and-deploy:
runs-on: ubuntu-latest
needs: test # Only runs if tests pass
if: github.ref == 'refs/heads/main' # Only on main branch
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: 20
- run: npm ci
- run: npm run build
- run: npx vercel --prod --token=${{ secrets.VERCEL_TOKEN }}The needs keyword creates the dependency chain: lint → type-check → test → build-and-deploy. Each step only runs if the previous step succeeded.
Part 3: Infrastructure as Code
The Problem with Manual Infrastructure
You set up your server by clicking through a cloud provider's web interface. You configure the database by running commands in a terminal. You set up the load balancer by adjusting settings in a dashboard. Everything works.
Six months later, you need a second environment for staging. You try to recreate the setup. You forget a firewall rule. You misconfigure the database connection string. The staging environment does not match production. Bugs that exist in staging do not exist in production, and bugs that exist in production do not exist in staging. Your staging environment is useless.
Infrastructure as Code (IaC)
Infrastructure as Code is the practice of defining your entire infrastructure — servers, databases, networks, load balancers, DNS records, everything — in code files that are version-controlled, reviewable, and reproducible.
Imagine building a house by giving verbal instructions to the contractor. "Put the kitchen here. Make the walls this color. The bathroom goes there." Six months later, you want to build an identical house in another city. You cannot — the instructions were verbal, unrepeatable, and incomplete. Now imagine having architectural blueprints. Every measurement, every material, every wire. Hand the blueprints to any contractor anywhere, and they build the same house. Infrastructure as Code is the blueprint for your cloud infrastructure.
The two dominant IaC tools:
| Tool | Language | Best For |
|---|---|---|
| Terraform (by HashiCorp) | HCL (HashiCorp Configuration Language) | Multi-cloud, complex infrastructure |
| Pulumi | TypeScript, Python, Go | Developers who prefer real programming languages |
A Terraform example that creates a database and a server:
# Define a PostgreSQL database
resource "aws_db_instance" "main" {
engine = "postgres"
engine_version = "16.1"
instance_class = "db.t3.medium"
storage = 50
database_name = "myapp"
username = var.db_username
password = var.db_password
}
# Define a server
resource "aws_ecs_service" "api" {
name = "api-service"
cluster = aws_ecs_cluster.main.id
task_definition = aws_ecs_task_definition.api.arn
desired_count = 3 # Run 3 instances
}The power: run terraform apply, and Terraform creates exactly this infrastructure. Run it again in another account, and you get an identical copy. Store the file in Git, and you have a versioned history of every infrastructure change. Review changes in pull requests, and your team can catch misconfigurations before they go live.
Netflix manages thousands of microservices across AWS using Terraform and custom IaC tools. When a team needs a new microservice, they do not file a ticket with the infrastructure team — they create a Terraform configuration from a template, submit a pull request, and the infrastructure is provisioned automatically after review. This is how Netflix operates with a ratio of roughly 20 microservices per infrastructure engineer.
Part 4: Monitoring — Knowing Before Your Users Tell You
Your code is in production. The pipeline is green. Everything is working. How do you know it is still working at 3 AM?
Monitoring is the practice of continuously observing your system's health — performance, error rates, resource usage — and alerting you when something goes wrong before your users notice.
The Three Pillars
| Pillar | What It Captures | Example Tools |
|---|---|---|
| Metrics | Numerical measurements over time (CPU usage, response time, error rate) | Grafana, Datadog, Prometheus |
| Logs | Detailed records of individual events ("User 123 hit endpoint /api/orders at 14:23:07, response 500") | Datadog, Loki, ELK Stack |
| Traces | The journey of a single request across multiple services ("This request hit the API, then the auth service, then the database, then the cache") | Jaeger, Datadog APM, Honeycomb |
THE THREE PILLARS OF OBSERVABILITY
METRICS (the dashboard)
┌──────────────────────────────────────────┐
│ CPU: ████████░░ 78% │
│ Memory: ██████░░░░ 62% │
│ Error Rate: 0.3% ← normal │
│ P95 Latency: 340ms ← normal │
│ Active Users: 4,231 │
└──────────────────────────────────────────┘
LOGS (the diary)
┌──────────────────────────────────────────┐
│ 14:23:07 INFO Request /api/orders 200 │
│ 14:23:08 INFO Request /api/users 200 │
│ 14:23:09 ERROR Request /api/pay 500 │ ← investigate
│ 14:23:09 ERROR PaymentGatewayTimeout │
│ 14:23:10 INFO Request /api/orders 200 │
└──────────────────────────────────────────┘
TRACES (the detective)
┌──────────────────────────────────────────┐
│ Request: POST /api/pay │
│ ├── API Gateway 2ms │
│ ├── Auth Service 15ms │
│ ├── Payment Service ← TIMEOUT │
│ │ └── Stripe API 8,500ms !! │
│ └── Total: 8,517ms (expected: 200ms) │
└──────────────────────────────────────────┘
Metrics tell you something is wrong. Logs tell you what happened. Traces tell you where the problem is in a complex system. You need all three.
SLAs, SLOs, and SLIs
Three terms that every production system needs:
SLI (Service Level Indicator): A measurable metric that represents your system's health. "What percentage of requests return in under 500 milliseconds?" "What percentage of requests return a non-error status code?"
SLO (Service Level Objective): Your internal target for the SLI. "99.9% of requests should return in under 500 milliseconds." This is the bar your team holds itself to.
SLA (Service Level Agreement): A contractual commitment to your customers, usually with financial penalties. "We guarantee 99.95% uptime. If we fall below that, you get credits."
SLA vs. SLO vs. SLI
SLI: "Our uptime this month is 99.93%" ← measurement
SLO: "We target 99.95% uptime" ← internal goal
SLA: "We guarantee 99.9% uptime to ← contract
customers, with 10% credit if
breached"
SLO is stricter than SLA intentionally:
you want to catch problems before they
breach the contract.
┌──────────────────────────────────────┐
│ 100% │
│ ───── SLO (99.95%) ───────── │
│ │
│ ───── SLA (99.9%) ───────── │ ← you owe money
│ │ below this
│ Current: 99.93% ← BELOW SLO │
│ but above SLA │
│ (warning zone) │
└──────────────────────────────────────┘
Google's SRE (Site Reliability Engineering) team popularized the SLI/SLO/SLA framework in their 2016 book "Site Reliability Engineering." Google's internal SLOs are famously aggressive — Gmail targets 99.99% availability (roughly 52 minutes of downtime per year). Google's SRE teams have "error budgets" — if a service has used 80% of its allowed downtime for the quarter, feature development slows down and reliability work takes priority.
Part 5: Alerting — Waking the Right Person at the Right Time
Monitoring without alerting is a dashboard that nobody looks at. Alerting turns passive observation into active response.
Alert Design Principles
1. Alert on symptoms, not causes. Alert when "error rate exceeds 5%" (symptom), not when "CPU exceeds 80%" (potential cause). Users experience symptoms. High CPU might not cause any user-facing issues. Low CPU does not mean everything is fine.
2. Every alert must have a clear action. If the alert fires and the on-call engineer's first thought is "what am I supposed to do about this?" the alert is poorly designed. Each alert should link to a runbook — a step-by-step guide for diagnosing and resolving the issue.
3. Avoid alert fatigue. If your team receives 50 alerts per day, they will start ignoring all of them — including the critical ones. This is the "boy who cried wolf" problem. An alert that fires frequently without requiring action should be tuned or removed.
| Alert Severity | Response Time | Example | Channel |
|---|---|---|---|
| Critical (P0) | Immediate (wake up) | Site is down, data loss | PagerDuty phone call |
| High (P1) | Within 1 hour | Payment processing failing | PagerDuty + Slack |
| Medium (P2) | Within business hours | Elevated error rate | Slack channel |
| Low (P3) | Next sprint | Deprecation warning | Email digest |
The goal of alerting is not to inform you of every anomaly. It is to inform you of anomalies that require human intervention. A spike in CPU that resolves itself in 30 seconds does not need an alert. A spike in 500 errors that persists for 5 minutes does. Design alerts for the situations where a human must act, not for the situations where the system self-heals.
Part 6: Feature Flags — Shipping Without Deploying
A feature flag (also called a feature toggle) is a mechanism that lets you turn a feature on or off in production without deploying new code. The code for the feature is already deployed. The flag controls whether users see it.
// With feature flag
if (featureFlags.isEnabled("new-checkout-flow", user)) {
renderNewCheckout();
} else {
renderOldCheckout();
}
Why Feature Flags Change Everything
1. Decouple deployment from release. You can deploy code on Monday and release it to users on Wednesday. The code sits in production, hidden behind a flag, tested internally, before any user sees it.
2. Gradual rollouts. Release the new checkout to 1% of users. Monitor error rates and conversion. If everything looks good, increase to 10%, then 50%, then 100%. If something breaks, set the flag to 0% — instant rollback without touching the codebase.
3. Kill switches. If a feature causes problems in production, disable it with a flag toggle. No deploy, no rollback, no downtime. The flag change propagates in seconds.
GRADUAL ROLLOUT WITH FEATURE FLAGS
Day 1: [█░░░░░░░░░] 1% of users
Monitor: error rate, conversion, latency
Result: looks good
Day 3: [█████░░░░░] 10% of users
Monitor: same metrics at higher volume
Result: conversion +3%, no errors
Day 5: [█████████░] 50% of users
Monitor: load testing under real traffic
Result: stable
Day 7: [██████████] 100% of users
Remove feature flag from code (cleanup)
Tools: LaunchDarkly (the market leader), Statsig, PostHog, Unleash (open-source). These tools let you define flags, target specific user segments (e.g., "only show to users in India" or "only show to beta testers"), and track the impact of each flag on your metrics.
Facebook has over 10,000 feature flags active at any given time. Every new feature, every UI change, every algorithm modification is gated behind a flag. This is how Facebook ships code changes to 3 billion users without breaking the app — the code deploys continuously, but features are revealed gradually, with monitoring at every stage.
Part 7: Deployment Strategies — Safe Ways to Ship
When you deploy new code, you are replacing a working system with an untested (in production) system. Deployment strategies manage this risk.
Blue-Green Deployment
You run two identical production environments: Blue (current) and Green (new). You deploy to Green. You test Green. When Green is verified, you switch traffic from Blue to Green. If Green has problems, switch back to Blue — instant rollback.
BLUE-GREEN DEPLOYMENT
Users
│
▼
┌──────────────┐
│ Load Balancer │
└──┬────────┬──┘
│ │
▼ ▼
┌─────────┐ ┌─────────┐
│ BLUE │ │ GREEN │
│ (v1.0) │ │ (v1.1) │
│ ACTIVE │ │ STANDBY │
└─────────┘ └─────────┘
Step 1: Deploy v1.1 to Green
Step 2: Test Green (health checks, smoke tests)
Step 3: Switch load balancer: Blue→Green
Step 4: Green is now ACTIVE, Blue is STANDBY
Step 5: If problems, switch back: Green→Blue
Pros: Zero-downtime deployment. Instant rollback. Cons: You need two full production environments — double the infrastructure cost during deployment.
Canary Deployment
Named after the canary in a coal mine. You deploy the new version to a small subset of servers (the "canary"). If the canary is healthy, you gradually expand. If the canary fails, you kill it before it affects most users.
CANARY DEPLOYMENT
Users (100%)
│
┌────┴────┐
│ 95% │ 5%
▼ ▼
┌─────────┐ ┌─────────┐
│ v1.0 │ │ v1.1 │ ← canary
│ (stable)│ │ (new) │
└─────────┘ └─────────┘
If canary is healthy → expand to 20%, 50%, 100%
If canary fails → roll back to 0% on v1.1
Pros: Lower infrastructure cost than blue-green. Real user traffic tests the new version. Cons: The 5% of users hitting the canary experience bugs that the other 95% do not. You need monitoring that can distinguish canary metrics from stable metrics.
| Strategy | Rollback Speed | Infrastructure Cost | Risk Exposure |
|---|---|---|---|
| Blue-green | Instant (seconds) | High (2x) | Zero (tested before switch) |
| Canary | Fast (minutes) | Low (small %) | Small (only canary users) |
| Rolling | Moderate | Low | Moderate (gradual exposure) |
Part 8: Post-Mortems — Learning From Failure
Things will break. The question is not whether you will have an outage — it is whether you will learn from it.
A post-mortem (also called an incident review) is a structured analysis conducted after a production incident. The goal is not to assign blame — it is to identify what happened, why it happened, and what changes will prevent it from happening again.
The Blameless Post-Mortem
Google, Netflix, and Etsy all practice "blameless post-mortems" — a principle that separates the person from the action. If an engineer pushed a bad deploy that caused an outage, the post-mortem asks "why did our system allow a bad deploy to reach production?" not "why did this engineer push bad code?" The engineer is a symptom. The system that allowed it is the cause.
Post-Mortem Template
INCIDENT POST-MORTEM
Date: 2026-03-15
Duration: 47 minutes (14:23 - 15:10 UTC)
Severity: P1 (payment processing down)
Author: [On-call engineer]
## Summary
Payment API returned 500 errors for all requests
due to expired database connection pool after
a connection string rotation.
## Timeline
14:23 Monitoring alert: payment_errors > 5%
14:25 On-call acknowledged, began investigation
14:31 Identified: DB connection pool exhausted
14:35 Root cause: connection string rotated at
14:20 but app not restarted
14:38 Applied fix: restarted payment service
14:42 Error rate dropping
15:10 Error rate at 0%, incident resolved
## Root Cause
Database credentials were rotated by an
automated security script. The payment service
cached the old connection string and did not
pick up the new one without a restart.
## What Went Well
- Alert fired within 3 minutes of the incident
- On-call responded within 2 minutes
- Root cause identified within 10 minutes
## What Went Wrong
- No automated restart after credential rotation
- No connection pool health check
- Payment service has no graceful reconnection logic
## Action Items
1. [ ] Add reconnection logic to DB client (P0)
2. [ ] Add connection pool health check to monitoring
3. [ ] Automate service restart after credential rotation
4. [ ] Add this scenario to runbook
The action items are the most important part. A post-mortem without action items is a storytelling exercise. A post-mortem with action items that are tracked and completed is a system that gets more reliable over time.
Take one of your projects and set up a complete CI/CD pipeline using GitHub Actions.
Step 1: Create the workflow file
Create .github/workflows/ci.yml in your repository. Start with lint and type-check stages.
Step 2: Add testing
Add a test stage that runs your test suite. If you do not have tests, write at least 3 tests first (one unit test, one integration test, one happy-path test).
Step 3: Add build verification
Add a build stage that compiles your project. This catches import errors, missing dependencies, and build configuration issues.
Step 4: Add deployment
If your project is hosted on Vercel, Netlify, or Railway, add a deployment step that triggers only on merges to main.
Step 5: Add a feature flag
Install a feature flag library (LaunchDarkly free tier, or PostHog, or a simple environment variable). Gate one feature behind the flag. Practice toggling it on and off in production.
Step 6: Set up monitoring
Add basic error tracking (Sentry free tier) and uptime monitoring (Better Uptime or UptimeRobot free tier). Configure an alert that notifies you via Slack or email when error rates spike.
Step 7: Write a runbook
Create a RUNBOOK.md in your repository with procedures for:
- "The site is down" — what to check, in what order
- "Error rate is elevated" — how to diagnose, common causes
- "A deployment failed" — how to roll back
Deliverable: A working CI/CD pipeline that runs on every push, a feature flag controlling one feature, and monitoring that alerts you to problems.
This exercise should take 2-3 hours. You will never go back to manual deployments.
Chapter endnotes
[1] Stripe's deployment frequency (100+ deploys per day) is documented in their engineering blog and in talks by Stripe engineers at conferences. Stripe's CI/CD pipeline includes automated testing, canary deployments, and feature flags as standard practice.
[2] Amazon's 11.7-second deployment frequency is from Amazon CTO Werner Vogels' keynote at re:Invent 2023, where he described Amazon's internal deployment infrastructure serving thousands of engineering teams.
[3] The Google SRE book ("Site Reliability Engineering: How Google Runs Production Systems," 2016) by Betsy Beyer, Chris Jones, Jennifer Petoff, and Niall Richard Murphy, is the foundational text for production operations practices. The SLI/SLO/SLA framework, error budgets, and blameless post-mortems are all detailed in this book, which is available free online at sre.google/books.
[4] GitHub Actions documentation and pricing are available at docs.github.com/actions. The free tier includes 2,000 minutes per month for private repositories and unlimited minutes for public repositories.
[5] Terraform by HashiCorp is documented at terraform.io. Terraform is the most widely-adopted IaC tool, used by organizations from startups to Fortune 500 companies. HashiCorp's State of Cloud Strategy Survey (2023) reports that 60% of respondents use Terraform for infrastructure management.
[6] Facebook's 10,000+ feature flags and continuous deployment practices are documented in Facebook engineering blog posts and in talks at @Scale conferences. The "Gatekeeper" system for feature flag management is described in Facebook's engineering publications.
[7] The "blameless post-mortem" practice is attributed to John Allspaw's work at Etsy, documented in his blog post "Blameless PostMortems and a Just Culture" (2012). The practice has been adopted by Google, Netflix, Amazon, and most modern engineering organizations.
[8] Netflix's deployment and monitoring practices are documented in the Netflix Technology Blog, particularly posts about Spinnaker (their deployment platform) and Atlas (their metrics system). Netflix's approach to canary deployments is described in "Automated Canary Analysis at Netflix" (2018).
[9] LaunchDarkly, the leading feature flag platform, reports that the average enterprise customer has 300+ active feature flags. Their documentation at docs.launchdarkly.com provides implementation guides for all major programming languages and frameworks.