Website Architecture & Deployment

A Zero-to-Hero Guide for Backend Developers & System Administrators

Chapter 1 — The Big Picture

Before diving into specific technologies, you need a mental model of what happens when someone visits a website — and what your role is in making that happen reliably, securely, and at scale.

What Happens When a User Visits a Website

User types "example.com" in browser │ ▼ ┌─────────────────┐ │ DNS Resolution │ Browser asks: "What IP is example.com?" │ (Recursive) │ Checks: browser cache → OS cache → resolver → root → TLD → authoritative └────────┬────────┘ │ Returns: 93.184.216.34 ▼ ┌─────────────────┐ │ TCP Connection │ Three-way handshake (SYN → SYN-ACK → ACK) │ + TLS Handshake │ Certificate verification, key exchange, cipher negotiation └────────┬────────┘ │ Encrypted tunnel established ▼ ┌─────────────────┐ │ HTTP Request │ GET / HTTP/2 │ │ Host: example.com │ │ Accept: text/html └────────┬────────┘ │ ▼ ┌─────────────────────────────────────────────────────────┐ │ YOUR DOMAIN │ │ │ │ ┌──────────┐ ┌──────────┐ ┌──────────────────┐ │ │ │ Load │───▶│ Reverse │───▶│ Application │ │ │ │ Balancer │ │ Proxy │ │ Server │ │ │ └──────────┘ └──────────┘ │ (your backend) │ │ │ └────────┬─────────┘ │ │ │ │ │ ┌────────▼─────────┐ │ │ │ Database / Cache │ │ │ └──────────────────┘ │ └─────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────┐ │ HTTP Response │ 200 OK + HTML/JSON/assets │ (via CDN maybe) │ Cached at edge if configured └────────┬────────┘ │ ▼ ┌─────────────────┐ │ Browser Render │ Parse HTML → fetch CSS/JS → render page └─────────────────┘

The Roles in Professional Web Operations

RoleResponsibilityCares About
Frontend DeveloperHTML, CSS, JavaScript, UI/UXUser experience, browser compatibility, performance
Backend DeveloperAPIs, business logic, databasesData integrity, API design, scalability, security
DevOps / SRECI/CD, infrastructure, reliabilityUptime, deployment speed, automation, monitoring
System AdministratorServer management, networking, securityPatching, hardening, backups, capacity
Platform EngineerInternal developer platforms, toolingDeveloper productivity, self-service infrastructure
Your role in this guide: You're the backend developer who also handles deployment, operations, and maintenance. In smaller teams, this is extremely common. In larger organizations, these responsibilities are split across dedicated teams — but understanding the full picture makes you far more effective regardless of team size.

Environments: Dev → Staging → Production

Professional deployments never go straight from a developer's laptop to users. There's a pipeline:

EnvironmentPurposeWho Uses ItData
Local / DevActive development, debuggingIndividual developerFake/seed data
CIAutomated testing on every commitMachines (automated)Test fixtures
StagingPre-production validation, QAQA team, stakeholdersProduction-like (anonymized)
ProductionReal users, real dataEveryone (end users)Real data
Never test in production. This sounds obvious, but the temptation is real when "it works on my machine." Staging exists to catch the things that only break in production-like conditions: different OS versions, network latency, real database sizes, concurrent users.

What "Deploying a Website" Actually Means

Deployment is not just "putting files on a server." It's a repeatable, automated process that includes:

  1. Build — Compile code, bundle assets, run optimizations
  2. Test — Unit tests, integration tests, security scans
  3. Package — Create a deployable artifact (Docker image, binary, archive)
  4. Deploy — Push artifact to target environment
  5. Verify — Health checks, smoke tests, monitoring
  6. Rollback plan — If something breaks, revert instantly

The Infrastructure Stack

┌─────────────────────────────────────────────────────────────┐ │ YOUR APPLICATION │ ├─────────────────────────────────────────────────────────────┤ │ Runtime (Node.js, Python, Go, Java, etc.) │ ├─────────────────────────────────────────────────────────────┤ │ Container / Process Manager (Docker, systemd, PM2) │ ├─────────────────────────────────────────────────────────────┤ │ Operating System (Ubuntu, Debian, Alpine, RHEL) │ ├─────────────────────────────────────────────────────────────┤ │ Virtualization / Bare Metal (KVM, Xen, physical hardware) │ ├─────────────────────────────────────────────────────────────┤ │ Network (VPC, firewall, load balancer, DNS) │ ├─────────────────────────────────────────────────────────────┤ │ Physical Infrastructure (data center, power, cooling) │ └─────────────────────────────────────────────────────────────┘ ▲ More abstraction = less control, less work │ Less abstraction = more control, more responsibility ▼
The key insight: every hosting option in this guide simply draws the line at a different layer. Shared hosting gives you only the top layer. Bare metal gives you everything. PaaS gives you the top two. IaaS gives you the top four. Your job is to pick where to draw that line for each project.

Chapter 2 — Website Architecture Patterns

Before choosing where to host, you need to understand what you're hosting. The architecture pattern determines your infrastructure requirements.

Static Sites

Pre-built HTML/CSS/JS files served as-is. No server-side processing per request.

# Example: Build and deploy a Hugo static site
hugo build                          # Generates ./public/ directory
aws s3 sync ./public s3://my-bucket --delete
aws cloudfront create-invalidation --distribution-id EXXX --paths "/*"

Server-Side Rendering (SSR)

Server generates HTML on every request. The traditional model (PHP, Rails, Django, Express with templates).

Single-Page Applications (SPA)

One HTML file + JavaScript bundle. All rendering happens in the browser. Backend is a separate API.

JAMstack (JavaScript, APIs, Markup)

Pre-rendered static pages enhanced with JavaScript calling APIs at runtime. Best of both worlds.

Monolithic Architecture

Single deployable unit containing all functionality. Frontend, backend, and data access in one codebase.

Service-Oriented Architecture (SOA) / Microservices

Application split into independent services communicating over network (HTTP/gRPC/message queues).

Don't start with microservices. Start monolithic, split when you have clear bounded contexts and the team/traffic justifies the operational complexity. Premature microservices is the #1 architecture mistake in startups.

Comparison Table

PatternServer NeededScalabilityComplexitySEOBest For
StaticNo (CDN)★★★★★★☆☆☆☆★★★★★Content sites, docs
SSRYes★★★☆☆★★★☆☆★★★★★Dynamic content + SEO
SPAAPI only★★★★☆★★★☆☆★★☆☆☆App-like experiences
JAMstackPartial★★★★★★★★☆☆★★★★★Content + interactivity
MonolithYes★★★☆☆★★☆☆☆VariesMVPs, small-medium apps
MicroservicesYes (many)★★★★★★★★★★VariesLarge teams, high scale
How to choose:
• Content-heavy, rarely changes? → Static / JAMstack
• Need SEO + dynamic data? → SSR
• Rich interactive app (logged-in users)? → SPA + API
• Small team, getting started? → Monolith
• Large team, proven bounded contexts, high scale? → Microservices

Chapter 3 — Hosting Taxonomy

This is the complete landscape of where your website can live. Each option trades control for convenience at a different point.

The Control vs. Convenience Spectrum

MORE CONTROL MORE CONVENIENCE MORE WORK LESS WORK ◄─────────────────────────────────────────────────────────────────────────► ┌──────────┬──────────┬──────────┬──────────┬──────────┬──────────┬──────────┐ │ Bare │ Colo- │ Dedi- │ VPS │ IaaS │ PaaS │ Static/ │ │ Metal │ cation │ cated │ │ │ │ Serverless│ │ (own HW) │(your HW, │(rented │(virtual │(cloud │(managed │(fully │ │ │their DC) │physical) │ server) │ VMs) │platform) │managed) │ └──────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┘ You manage: Hardware ✓ ✓ ✗ ✗ ✗ ✗ ✗ Network ✓ ✓ Partial ✗ Partial ✗ ✗ OS ✓ ✓ ✓ ✓ ✓ ✗ ✗ Runtime ✓ ✓ ✓ ✓ ✓ Partial ✗ App ✓ ✓ ✓ ✓ ✓ ✓ ✓

Complete Hosting Types

TypeWhat You GetCost RangeControlComplexityBest For
Shared HostingSpace on a shared server (cPanel)$3-15/mo★☆☆☆☆★☆☆☆☆WordPress blogs, tiny sites
VPSVirtual machine, root access$5-80/mo★★★★☆★★★☆☆Most web apps, APIs
Dedicated ServerEntire physical server rented$80-500/mo★★★★★★★★★☆High-performance, compliance
ColocationYour hardware in their data center$200-2000/mo★★★★★★★★★★Maximum control, large scale
IaaSCloud VMs + managed servicesPay-per-use★★★★☆★★★★☆Variable workloads, scaling
PaaSManaged platform, push code$5-500/mo★★☆☆☆★★☆☆☆Startups, rapid deployment
Serverless/FaaSFunctions triggered by eventsPay-per-invocation★☆☆☆☆★★☆☆☆APIs, event processing
Static HostingCDN-served static files$0-20/mo★☆☆☆☆★☆☆☆☆Static sites, SPAs
Managed K8sKubernetes cluster (managed control plane)$70-1000+/mo★★★★☆★★★★★Microservices at scale

Shared Hosting — The Beginner Trap

Shared hosting (GoDaddy, Bluehost, Hostinger) puts hundreds of sites on one server. You get a cPanel interface, FTP access, and PHP. That's it.

Not suitable for professional work. Shared hosting is fine for a personal blog. For anything with users, SLAs, or custom backend code — skip it entirely. A $5/mo VPS gives you infinitely more capability.

When to Use What — Quick Reference

Static site / SPA frontend? → Static hosting (Netlify, CloudFront, Cloudflare Pages)
Simple web app, small team? → VPS (DigitalOcean, Hetzner) or PaaS (Railway, Render)
Need auto-scaling, variable traffic? → IaaS (AWS, GCP) or containers
Microservices, large team? → Managed Kubernetes (EKS, GKE)
Event-driven, sporadic traffic? → Serverless (Lambda, Workers)
Compliance/performance requirements? → Dedicated or colocation
Learning / side project? → VPS ($5/mo) — best bang for learning

Chapter 4 — Self-Hosting (Bare Metal / Home Server)

Self-hosting means running a web server on hardware you physically control — a spare PC, a Raspberry Pi, or a rack server in your closet. It's the most educational option and gives maximum control.

When Self-Hosting Makes Sense

When it does NOT make sense: Production websites serving external users. Your home internet has no SLA, dynamic IP, limited upload bandwidth, and a single point of failure (power outage = site down). Use self-hosting for learning, internal tools, and development — not for serving customers.

Setting Up Nginx on Linux

# Install Nginx (Ubuntu/Debian)
sudo apt update && sudo apt install -y nginx

# Start and enable
sudo systemctl start nginx
sudo systemctl enable nginx

# Verify it's running
curl http://localhost
# Should return the Nginx welcome page HTML

Virtual Hosts (Serving Multiple Sites)

# /etc/nginx/sites-available/mysite.conf
server {
    listen 80;
    server_name mysite.example.com;
    root /var/www/mysite;
    index index.html;

    location / {
        try_files $uri $uri/ =404;
    }

    # For a backend app (reverse proxy)
    location /api/ {
        proxy_pass http://127.0.0.1:3000;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
    }
}

# Enable the site
sudo ln -s /etc/nginx/sites-available/mysite.conf /etc/nginx/sites-enabled/
sudo nginx -t          # Test configuration
sudo systemctl reload nginx

Dynamic DNS (Solving the Dynamic IP Problem)

Home internet usually gives you a dynamic IP that changes periodically. Dynamic DNS services map a hostname to your current IP.

# Using ddclient with Cloudflare (install: sudo apt install ddclient)
# /etc/ddclient.conf
protocol=cloudflare
zone=example.com
login=your-email@example.com
password=your-cloudflare-api-token
use=web, web=https://api.ipify.org
mysite.example.com

# Or use a cron job with curl
*/5 * * * * curl -s "https://api.cloudflare.com/client/v4/zones/ZONE_ID/dns_records/RECORD_ID" \
  -X PATCH \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  --data "{\"content\":\"$(curl -s https://api.ipify.org)\"}"

Port Forwarding

Your router's NAT blocks incoming connections. You need to forward ports 80 (HTTP) and 443 (HTTPS) to your server's local IP.

  1. Give your server a static local IP (e.g., 192.168.1.100) via DHCP reservation
  2. In your router admin panel: forward external port 80 → 192.168.1.100:80
  3. Forward external port 443 → 192.168.1.100:443
  4. Test from outside your network (use your phone on mobile data)

Let's Encrypt (Free SSL/TLS)

# Install Certbot
sudo apt install -y certbot python3-certbot-nginx

# Obtain certificate (Nginx plugin auto-configures)
sudo certbot --nginx -d mysite.example.com

# Auto-renewal is set up automatically via systemd timer
sudo systemctl status certbot.timer

# Manual renewal test
sudo certbot renew --dry-run

Running Your App as a systemd Service

# /etc/systemd/system/myapp.service
[Unit]
Description=My Web Application
After=network.target

[Service]
Type=simple
User=www-data
WorkingDirectory=/opt/myapp
ExecStart=/usr/bin/node /opt/myapp/server.js
Restart=always
RestartSec=5
Environment=NODE_ENV=production
Environment=PORT=3000

# Security hardening
NoNewPrivileges=true
ProtectSystem=strict
ProtectHome=true
ReadWritePaths=/opt/myapp/data

[Install]
WantedBy=multi-user.target

# Enable and start
sudo systemctl daemon-reload
sudo systemctl enable myapp
sudo systemctl start myapp
sudo systemctl status myapp    # Check it's running
sudo journalctl -u myapp -f   # View logs
systemd is your process manager on Linux. It handles starting your app on boot, restarting on crash, logging, and resource limits. Learn it well — you'll use it on every VPS and server you manage.

Chapter 5 — VPS & Dedicated Servers

A VPS (Virtual Private Server) is the workhorse of professional web hosting. You get a virtual machine with root access, a public IP, and full control — without managing physical hardware.

Provider Comparison

ProviderCheapest VPSData CentersStrengthsBest For
DigitalOcean$4/mo (512MB)14 regionsSimple UI, great docs, managed DBsStartups, learning
Hetzner€3.79/mo (2GB)EU + USBest price/performance ratioPrice-conscious, EU hosting
Linode (Akamai)$5/mo (1GB)11 regionsReliable, good supportGeneral purpose
Vultr$2.50/mo (512MB)32 locationsMost locations, bare metal optionEdge deployments
OVH€3.50/mo (2GB)EU focusedCheap dedicated servers tooEU, budget dedicated

Initial Server Setup (The First 10 Minutes)

Every new VPS should go through this hardening process before deploying anything:

# 1. Connect as root (first time only)
ssh root@YOUR_SERVER_IP

# 2. Update the system
apt update && apt upgrade -y

# 3. Create a non-root user
adduser deploy
usermod -aG sudo deploy

# 4. Set up SSH key authentication for the new user
mkdir -p /home/deploy/.ssh
cp ~/.ssh/authorized_keys /home/deploy/.ssh/
chown -R deploy:deploy /home/deploy/.ssh
chmod 700 /home/deploy/.ssh
chmod 600 /home/deploy/.ssh/authorized_keys

# 5. Harden SSH - edit /etc/ssh/sshd_config
sudo sed -i 's/#PermitRootLogin yes/PermitRootLogin no/' /etc/ssh/sshd_config
sudo sed -i 's/#PasswordAuthentication yes/PasswordAuthentication no/' /etc/ssh/sshd_config
sudo sed -i 's/#Port 22/Port 2222/' /etc/ssh/sshd_config
sudo systemctl restart sshd

# 6. Set up firewall
sudo apt install -y ufw
sudo ufw default deny incoming
sudo ufw default allow outgoing
sudo ufw allow 2222/tcp    # SSH (custom port)
sudo ufw allow 80/tcp      # HTTP
sudo ufw allow 443/tcp     # HTTPS
sudo ufw enable

# 7. Install fail2ban (brute-force protection)
sudo apt install -y fail2ban
sudo systemctl enable fail2ban

# 8. Set up automatic security updates
sudo apt install -y unattended-upgrades
sudo dpkg-reconfigure -plow unattended-upgrades
Test SSH access with the new user BEFORE closing your root session. Open a new terminal, SSH as the deploy user on the new port. If it works, you're safe to close root. If not, you still have root access to fix it.

Deploying an Application

# On your server (as deploy user):

# Install your runtime (example: Node.js via nvm)
curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.39.0/install.sh | bash
source ~/.bashrc
nvm install --lts

# Clone your application
cd /opt
sudo mkdir myapp && sudo chown deploy:deploy myapp
git clone git@github.com:you/myapp.git /opt/myapp
cd /opt/myapp
npm install --production

# Set up environment variables
sudo cp .env.example /etc/myapp.env
sudo chmod 600 /etc/myapp.env
# Edit with your production values

# Create systemd service (as shown in Chapter 4)
# Set up Nginx reverse proxy (as shown in Chapter 4)
# Obtain SSL certificate with Certbot

Deployment Strategies for VPS

Option A: Git Pull (Simple)

# On server:
cd /opt/myapp
git pull origin main
npm install --production
sudo systemctl restart myapp

Option B: rsync (No Git on Server)

# From your local machine:
rsync -avz --delete \
  --exclude='node_modules' \
  --exclude='.env' \
  ./dist/ deploy@server:/opt/myapp/
ssh deploy@server 'cd /opt/myapp && npm install --production && sudo systemctl restart myapp'

Option C: Docker (Recommended for Production)

# Build locally or in CI, push to registry
docker build -t myregistry/myapp:v1.2.3 .
docker push myregistry/myapp:v1.2.3

# On server:
docker pull myregistry/myapp:v1.2.3
docker stop myapp && docker rm myapp
docker run -d --name myapp -p 3000:3000 --env-file /etc/myapp.env myregistry/myapp:v1.2.3
Docker is the professional standard. It ensures your app runs identically everywhere — your laptop, CI, staging, production. Chapter 8 covers this in depth.

Process Managers

ToolLanguageFeaturesWhen to Use
systemdAnyBuilt into Linux, restart policies, loggingAlways (it's already there)
PM2Node.jsCluster mode, zero-downtime reload, monitoringNode.js apps without Docker
SupervisorAnySimple config, process groupsLegacy systems, multiple processes

Chapter 6 — Platform-as-a-Service (PaaS)

PaaS abstracts away the server entirely. You push code, the platform handles building, deploying, scaling, SSL, and infrastructure. You focus purely on your application.

What PaaS Manages For You

┌─────────────────────────────────────────────────────────┐ │ What YOU do: │ │ ┌───────────────────────────────────────────────────┐ │ │ │ Write code → git push → Done │ │ │ └───────────────────────────────────────────────────┘ │ ├─────────────────────────────────────────────────────────┤ │ What the PLATFORM does: │ │ • Detects language/framework (buildpacks) │ │ • Installs dependencies │ │ • Builds your app │ │ • Deploys to containers │ │ • Provisions SSL certificate │ │ • Routes traffic (load balancing) │ │ • Manages logs │ │ • Handles OS updates and security patches │ │ • Scales horizontally (if configured) │ └─────────────────────────────────────────────────────────┘

Provider Comparison

PlatformFree TierPaid FromStrengthsLimitations
Railway$5 credit/mo$5/moModern, fast deploys, good DXNewer, smaller community
RenderStatic free, services spin down$7/moHeroku alternative, auto-deployCold starts on free tier
Fly.io3 shared VMs freePay-per-useEdge deployment, Docker-nativeMore complex than others
HerokuNone (removed)$5/moPioneer, huge ecosystemExpensive at scale, aging
Google App EngineLimited freePay-per-useGoogle infrastructure, auto-scaleVendor lock-in
Azure App ServiceLimited free~$13/moEnterprise, .NET nativeComplex pricing

Example: Deploying to Railway

# Your project needs:
# 1. A start command (in package.json, Procfile, or Dockerfile)
# 2. Listen on the PORT environment variable

# package.json
{
  "scripts": {
    "start": "node server.js"
  }
}

# server.js — must use process.env.PORT
const port = process.env.PORT || 3000;
app.listen(port, '0.0.0.0');

# Deploy:
# Option A: Connect GitHub repo in Railway dashboard (auto-deploys on push)
# Option B: Railway CLI
npm install -g @railway/cli
railway login
railway init
railway up

Procfile (Heroku-style Process Declaration)

# Procfile — tells the platform what processes to run
web: node server.js
worker: node worker.js
release: node migrate.js    # Runs before each deploy

When PaaS is the Right Choice

Use PaaS when:
• Small team (1-5 devs) that wants to focus on product, not infrastructure
• Predictable, moderate traffic (not massive spikes)
• Standard web app (HTTP server + database)
• Fast iteration speed matters more than cost optimization
• You don't need custom system-level software

Avoid PaaS when:
• Cost-sensitive at scale (PaaS markup is 3-10x vs raw compute)
• Need custom networking, kernel modules, or system packages
• Compliance requires specific infrastructure control
• Traffic is highly variable (serverless may be cheaper)
• You need persistent local storage or specific hardware

The PaaS Cost Trap

PaaS is cheap to start but expensive to scale. A $7/mo Render service running a Node.js app is great. But when you need 4 instances + a managed database + Redis + background workers, you're suddenly paying $200/mo for what a $40/mo VPS could handle.

The professional pattern: Start on PaaS for speed. When monthly costs exceed what a VPS + your time would cost, migrate to containers on a VPS or IaaS. This is called "graduating" from PaaS.

Chapter 7 — Infrastructure-as-a-Service (IaaS)

IaaS gives you virtual machines in the cloud with pay-per-use pricing, elastic scaling, and a massive ecosystem of managed services around them. This is where most professional production workloads live.

Core Concepts

Instances (Virtual Machines)

A cloud VM with configurable CPU, RAM, storage, and networking. You choose the OS, install what you want, and pay by the hour/second.

AMIs / Images

Pre-configured OS snapshots. You can use official images (Ubuntu 22.04) or create custom ones with your software pre-installed (golden images).

Security Groups

Virtual firewalls controlling inbound/outbound traffic to your instances. Stateful — if you allow inbound on port 443, the response traffic is automatically allowed.

VPC (Virtual Private Cloud)

Your own isolated network in the cloud. You define subnets (public/private), route tables, and internet gateways. Instances in private subnets can't be reached from the internet directly.

┌─────────────────── VPC (10.0.0.0/16) ───────────────────┐ │ │ │ ┌──── Public Subnet (10.0.1.0/24) ────┐ │ │ │ │ │ │ │ ┌──────────┐ ┌──────────┐ │ │ │ │ │ Web │ │ NAT │ │ │ │ │ │ Server │ │ Gateway │ │ │ │ │ └──────────┘ └────┬─────┘ │ │ │ │ ▲ │ │ │ │ └───────┼────────────────┼─────────────┘ │ │ │ │ │ │ ┌───────┼────────────────┼──── Private Subnet ────┐ │ │ │ │ ▼ (10.0.2.0/24) │ │ │ │ ┌────┴─────┐ ┌──────────┐ │ │ │ │ │ App │ │ Database │ │ │ │ │ │ Server │ │ (RDS) │ │ │ │ │ └──────────┘ └──────────┘ │ │ │ └─────────────────────────────────────────────────┘ │ │ │ └──────────────────────────────────────────────────────────┘ ▲ │ Internet Gateway ▼ ┌──────────┐ │ Internet │ └──────────┘

Hands-On: Launching an EC2 Instance (AWS CLI)

# Prerequisites: AWS CLI installed, credentials configured
# aws configure (set access key, secret, region)

# 1. Create a key pair for SSH access
aws ec2 create-key-pair --key-name myapp-key --query 'KeyMaterial' \
  --output text > ~/.ssh/myapp-key.pem
chmod 400 ~/.ssh/myapp-key.pem

# 2. Create a security group
aws ec2 create-security-group \
  --group-name myapp-sg \
  --description "Web server security group"

# Allow SSH, HTTP, HTTPS
aws ec2 authorize-security-group-ingress --group-name myapp-sg \
  --protocol tcp --port 22 --cidr 0.0.0.0/0
aws ec2 authorize-security-group-ingress --group-name myapp-sg \
  --protocol tcp --port 80 --cidr 0.0.0.0/0
aws ec2 authorize-security-group-ingress --group-name myapp-sg \
  --protocol tcp --port 443 --cidr 0.0.0.0/0

# 3. Launch the instance
aws ec2 run-instances \
  --image-id ami-0c55b159cbfafe1f0 \
  --instance-type t3.micro \
  --key-name myapp-key \
  --security-groups myapp-sg \
  --count 1 \
  --tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=myapp-web}]'

# 4. Get the public IP
aws ec2 describe-instances \
  --filters "Name=tag:Name,Values=myapp-web" \
  --query 'Reservations[0].Instances[0].PublicIpAddress' --output text

# 5. SSH in
ssh -i ~/.ssh/myapp-key.pem ubuntu@INSTANCE_IP

Instance Types (AWS Example)

FamilyOptimized ForExampleUse Case
t3/t4gBurstable general purposet3.micro (2 vCPU, 1GB)Low-traffic web apps, dev
m6i/m7gBalanced compute/memorym6i.large (2 vCPU, 8GB)General web apps
c6i/c7gCompute-intensivec6i.xlarge (4 vCPU, 8GB)API servers, batch processing
r6i/r7gMemory-intensiver6i.large (2 vCPU, 16GB)Caching, in-memory DBs
Graviton (g suffix)ARM-based, 20% cheapert4g.microEverything (if your app supports ARM)

Auto Scaling

Auto Scaling automatically adjusts the number of instances based on demand:

# Conceptual flow:
# 1. Create a Launch Template (defines instance config)
# 2. Create an Auto Scaling Group (min/max/desired instances)
# 3. Attach scaling policies (CPU > 70% → add instance)
# 4. Attach to a Load Balancer (distributes traffic)

# CloudWatch alarm triggers scaling:
# CPU > 70% for 5 minutes → scale out (add instances)
# CPU < 30% for 10 minutes → scale in (remove instances)
Start simple. Don't set up auto-scaling on day one. Start with a single instance, monitor its resource usage, and add auto-scaling when you actually need it. Premature scaling adds complexity without benefit.

Chapter 8 — Containers & Orchestration

Containers package your application with all its dependencies into a portable, reproducible unit. This solves "works on my machine" permanently.

Why Containers

Dockerfile Best Practices

# Multi-stage build — keeps final image small
# Stage 1: Build
FROM node:20-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production
COPY . .
RUN npm run build

# Stage 2: Production image
FROM node:20-alpine
WORKDIR /app

# Don't run as root
RUN addgroup -S appgroup && adduser -S appuser -G appgroup

# Copy only what's needed
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/node_modules ./node_modules
COPY --from=builder /app/package.json ./

USER appuser
EXPOSE 3000
HEALTHCHECK --interval=30s --timeout=3s CMD wget -qO- http://localhost:3000/health || exit 1
CMD ["node", "dist/server.js"]
Key Dockerfile principles:
• Use specific base image tags (node:20-alpine, not node:latest)
• Multi-stage builds to minimize image size
• Copy package.json first (layer caching for dependencies)
• Run as non-root user
• Add HEALTHCHECK for orchestrators
• Use .dockerignore to exclude node_modules, .git, etc.

Docker Compose (Multi-Service Apps)

# docker-compose.yml — typical web app stack
services:
  app:
    build: .
    ports:
      - "3000:3000"
    environment:
      - DATABASE_URL=postgres://user:pass@db:5432/myapp
      - REDIS_URL=redis://cache:6379
    depends_on:
      db:
        condition: service_healthy
      cache:
        condition: service_started
    restart: unless-stopped

  db:
    image: postgres:16-alpine
    volumes:
      - pgdata:/var/lib/postgresql/data
    environment:
      POSTGRES_DB: myapp
      POSTGRES_USER: user
      POSTGRES_PASSWORD: pass
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U user -d myapp"]
      interval: 5s
      timeout: 3s
      retries: 5

  cache:
    image: redis:7-alpine
    restart: unless-stopped

  nginx:
    image: nginx:alpine
    ports:
      - "80:80"
      - "443:443"
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf:ro
      - ./certs:/etc/nginx/certs:ro
    depends_on:
      - app

volumes:
  pgdata:
# Commands
docker compose up -d          # Start all services
docker compose logs -f app    # Follow app logs
docker compose ps             # Status of all services
docker compose down           # Stop and remove
docker compose up -d --build  # Rebuild and restart

Container Registries

RegistryFree TierBest For
Docker Hub1 private repoPublic images, open source
GitHub Container Registry (ghcr.io)Generous freeGitHub-based projects
AWS ECR500MB freeAWS deployments
Google Artifact Registry500MB freeGCP deployments

Kubernetes — When You Need Orchestration

Kubernetes (K8s) manages containers at scale: scheduling, scaling, self-healing, service discovery, rolling updates.

You probably don't need Kubernetes. Docker Compose on a single VPS handles most workloads. K8s is for: multiple services, multiple teams, auto-scaling requirements, or when you need zero-downtime deployments with automated rollbacks. The operational overhead is significant.

Core Kubernetes Concepts

# Pod — smallest deployable unit (one or more containers)
# Deployment — manages replica sets, rolling updates
# Service — stable network endpoint for pods
# Ingress — HTTP routing from outside the cluster

# Example: Deployment + Service + Ingress
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp
spec:
  replicas: 3
  selector:
    matchLabels:
      app: myapp
  template:
    metadata:
      labels:
        app: myapp
    spec:
      containers:
      - name: myapp
        image: ghcr.io/you/myapp:v1.2.3
        ports:
        - containerPort: 3000
        resources:
          requests:
            memory: "128Mi"
            cpu: "100m"
          limits:
            memory: "256Mi"
            cpu: "500m"
        livenessProbe:
          httpGet:
            path: /health
            port: 3000
          initialDelaySeconds: 10
          periodSeconds: 30
---
apiVersion: v1
kind: Service
metadata:
  name: myapp-svc
spec:
  selector:
    app: myapp
  ports:
  - port: 80
    targetPort: 3000
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: myapp-ingress
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
  tls:
  - hosts:
    - myapp.example.com
    secretName: myapp-tls
  rules:
  - host: myapp.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: myapp-svc
            port:
              number: 80

Managed Kubernetes Services

ServiceProviderStarting CostNotes
EKSAWS$0.10/hr control plane + nodesMost popular, complex
GKEGoogle1 free zonal cluster + nodesBest K8s experience (Google made K8s)
AKSAzureFree control plane + nodesGood for Microsoft shops
DOKSDigitalOceanFree control plane + $12/nodeSimplest managed K8s

Chapter 9 — Serverless & Edge

Serverless means you write functions, not servers. The cloud provider handles all infrastructure — you pay only when your code runs.

How Serverless Works

Traditional Server: Serverless: ┌─────────────────────┐ ┌─────────────────────┐ │ Server running 24/7 │ │ No server running │ │ (paying even idle) │ │ (paying $0 idle) │ │ │ │ │ │ Request → Process │ │ Request → Cold Start│ │ Request → Process │ │ → Process │ │ ...idle... │ │ Request → Process │ │ ...idle... │ │ (warm) │ │ ...idle... │ │ ...nothing... │ │ Request → Process │ │ Request → Cold Start│ └─────────────────────┘ └─────────────────────┘ Cost: $$$$ (always on) Cost: $ (per invocation)

AWS Lambda Example

// handler.js — AWS Lambda function
exports.handler = async (event) => {
    const body = JSON.parse(event.body || '{}');

    // Your business logic here
    const result = await processRequest(body);

    return {
        statusCode: 200,
        headers: { 'Content-Type': 'application/json' },
        body: JSON.stringify(result)
    };
};

// Deploy with AWS SAM or Serverless Framework:
# serverless.yml
service: myapi
provider:
  name: aws
  runtime: nodejs20.x
  region: eu-west-1
functions:
  api:
    handler: handler.handler
    events:
      - httpApi:
          path: /api/{proxy+}
          method: ANY
    memorySize: 256
    timeout: 10

Cloudflare Workers (Edge Computing)

// worker.js — runs at 300+ edge locations worldwide
export default {
  async fetch(request, env) {
    const url = new URL(request.url);

    if (url.pathname === '/api/hello') {
      return new Response(JSON.stringify({ message: 'Hello from the edge!' }), {
        headers: { 'Content-Type': 'application/json' }
      });
    }

    // Proxy to origin for other routes
    return fetch(request);
  }
};

Serverless Comparison

PlatformCold StartMax DurationFree TierBest For
AWS Lambda100-500ms15 min1M requests/moFull backend APIs, event processing
Cloudflare Workers~0ms (no cold start)30s (free), 15min (paid)100K requests/dayEdge logic, fast APIs
Vercel Functions~250ms10-60s100GB-hrs/moNext.js apps, frontend teams
Netlify Functions~200ms10-26s125K requests/moJAMstack backends
Google Cloud Functions100-400ms9-60 min2M invocations/moGCP ecosystem, event-driven

When Serverless Fits

Use serverless when:
• Traffic is sporadic/unpredictable (pay-per-use saves money)
• Individual requests are short-lived (<30s)
• You want zero infrastructure management
• Event-driven workloads (file uploads, webhooks, scheduled tasks)
• API endpoints with variable traffic

Avoid serverless when:
• Consistent high traffic (a server is cheaper)
• Long-running processes (video encoding, ML training)
• WebSocket connections needed
• You need local filesystem or persistent state
• Cold starts are unacceptable (real-time systems)

Edge Computing

Edge computing runs your code at CDN points-of-presence (PoPs) close to users, reducing latency from ~100ms to ~10ms.

The hybrid pattern: Use edge functions for authentication, A/B testing, geolocation routing, and caching logic. Keep heavy business logic in a traditional server or Lambda. This gives you the best of both worlds — fast edge responses with powerful backend processing.

Chapter 10 — Cloud Providers Deep Dive

The "Big Three" (AWS, GCP, Azure) plus strong alternatives. Understanding their service ecosystems lets you pick the right provider and avoid vendor lock-in traps.

Service Mapping Across Providers

CategoryAWSGCPAzureDigitalOcean
Compute (VMs)EC2Compute EngineVirtual MachinesDroplets
ContainersECS / EKSCloud Run / GKEACI / AKSDOKS / App Platform
ServerlessLambdaCloud FunctionsAzure FunctionsFunctions (beta)
Object StorageS3Cloud StorageBlob StorageSpaces
SQL DatabaseRDS / AuroraCloud SQL / SpannerAzure SQLManaged Databases
NoSQLDynamoDBFirestore / BigtableCosmos DBMongoDB (managed)
CDNCloudFrontCloud CDNAzure CDNSpaces CDN
DNSRoute 53Cloud DNSAzure DNSDNS (basic)
Load BalancerALB / NLBCloud Load BalancingAzure LBLoad Balancers
SecretsSecrets ManagerSecret ManagerKey Vault
MonitoringCloudWatchCloud MonitoringAzure MonitorBuilt-in metrics
IaCCloudFormationDeployment ManagerARM / BicepTerraform only

Pricing Models

ModelDescriptionSavingsCommitment
On-DemandPay by the hour/second, no commitment0% (baseline)None
Reserved / Committed1-3 year commitment for lower rate30-72%1-3 years
Spot / PreemptibleUnused capacity, can be terminated anytime60-90%None (but unreliable)
Savings PlansCommit to $/hr spend, flexible instance types20-50%1-3 years

Hands-On: Full Stack on AWS

Deploying a web app with EC2 + RDS + S3 + CloudFront:

# Architecture:
# CloudFront (CDN) → ALB → EC2 (app) → RDS (PostgreSQL)
#                         → S3 (static assets/uploads)

# Step 1: Create VPC with public/private subnets
aws ec2 create-vpc --cidr-block 10.0.0.0/16
# (In practice, use Terraform — shown in Chapter 14)

# Step 2: Launch RDS in private subnet
aws rds create-db-instance \
  --db-instance-identifier myapp-db \
  --db-instance-class db.t3.micro \
  --engine postgres \
  --engine-version 16 \
  --master-username admin \
  --master-user-password "$(openssl rand -base64 24)" \
  --allocated-storage 20 \
  --no-publicly-accessible \
  --vpc-security-group-ids sg-xxxxx

# Step 3: Create S3 bucket for assets
aws s3 mb s3://myapp-assets-prod
aws s3api put-bucket-policy --bucket myapp-assets-prod \
  --policy '{"Statement":[{"Effect":"Allow","Principal":"*","Action":"s3:GetObject","Resource":"arn:aws:s3:::myapp-assets-prod/*"}]}'

# Step 4: Create CloudFront distribution
aws cloudfront create-distribution \
  --origin-domain-name myapp-assets-prod.s3.amazonaws.com \
  --default-root-object index.html

# Step 5: Set up ALB + EC2 (use launch template + auto-scaling group)
# Step 6: Configure Route 53 to point domain to CloudFront

Hands-On: Full Stack on DigitalOcean

# Architecture:
# Load Balancer → Droplet(s) → Managed PostgreSQL
#                            → Spaces (S3-compatible storage)

# Step 1: Create a Droplet
doctl compute droplet create myapp-web \
  --image ubuntu-22-04-x64 \
  --size s-1vcpu-2gb \
  --region fra1 \
  --ssh-keys YOUR_KEY_FINGERPRINT

# Step 2: Create managed database
doctl databases create myapp-db \
  --engine pg \
  --version 16 \
  --size db-s-1vcpu-1gb \
  --region fra1 \
  --num-nodes 1

# Step 3: Create Spaces bucket (S3-compatible)
# Done via web console or s3cmd with DO endpoint

# Step 4: Create load balancer
doctl compute load-balancer create \
  --name myapp-lb \
  --region fra1 \
  --forwarding-rules "entry_protocol:https,entry_port:443,target_protocol:http,target_port:3000,certificate_id:YOUR_CERT" \
  --droplet-ids DROPLET_ID

# Step 5: Point domain DNS to load balancer IP

Which Provider to Choose

AWS: Largest ecosystem, most services, best for enterprise. Steep learning curve, complex pricing.
GCP: Best for data/ML, Kubernetes (they invented it), clean APIs. Smaller market share.
Azure: Best for Microsoft/.NET shops, enterprise AD integration. Complex portal.
DigitalOcean: Simplest UX, predictable pricing, great docs. Fewer services, smaller scale.
Hetzner: Best price/performance in EU. Minimal managed services but unbeatable value.

Rule of thumb: Start with DigitalOcean or Hetzner for simplicity. Move to AWS/GCP when you need managed services (ML, analytics, complex networking) that simpler providers don't offer.

Cost Calculator Links

Chapter 11 — Domain Names & DNS

DNS (Domain Name System) translates human-readable names to IP addresses. It's the phone book of the internet, and misconfiguring it is one of the most common causes of "my site is down."

How DNS Resolution Works

User types: www.example.com │ ▼ ┌─────────────────────┐ │ Browser DNS Cache │ ← Checked first (TTL-based) └──────────┬──────────┘ │ Cache miss ▼ ┌─────────────────────┐ │ OS DNS Cache │ ← /etc/hosts, systemd-resolved └──────────┬──────────┘ │ Cache miss ▼ ┌─────────────────────┐ │ Recursive Resolver │ ← Your ISP's DNS or 1.1.1.1 / 8.8.8.8 │ (e.g., Cloudflare) │ └──────────┬──────────┘ │ Queries hierarchy: ▼ ┌─────────────────────┐ │ Root Name Servers │ ← "Who handles .com?" │ (13 clusters, a-m) │ → "Ask the .com TLD servers" └──────────┬──────────┘ ▼ ┌─────────────────────┐ │ TLD Name Servers │ ← "Who handles example.com?" │ (.com, .org, .io) │ → "Ask ns1.cloudflare.com" └──────────┬──────────┘ ▼ ┌─────────────────────┐ │ Authoritative NS │ ← "What's the A record for www.example.com?" │ (your DNS provider) │ → "93.184.216.34, TTL 300" └──────────┬──────────┘ │ ▼ Result cached at each level for TTL duration Browser connects to 93.184.216.34

DNS Record Types

TypePurposeExample
AMaps name to IPv4 addressexample.com → 93.184.216.34
AAAAMaps name to IPv6 addressexample.com → 2606:2800:220:1:...
CNAMEAlias to another namewww.example.com → example.com
MXMail server for the domainexample.com → mail.example.com (priority 10)
TXTArbitrary text (verification, SPF, DKIM)example.com → "v=spf1 include:_spf.google.com ~all"
NSNameservers for the domainexample.com → ns1.cloudflare.com
SRVService location (port + priority)_sip._tcp.example.com → sipserver.example.com:5060
CAAWhich CAs can issue certificatesexample.com → "0 issue letsencrypt.org"

Practical DNS Configuration

# Typical DNS setup for a web app:

# Root domain → your server
example.com.        A       93.184.216.34
example.com.        AAAA    2606:2800:220:1::248

# www subdomain → alias to root
www.example.com.    CNAME   example.com.

# API subdomain → different server or load balancer
api.example.com.    A       10.20.30.40

# Email (Google Workspace example)
example.com.        MX      1  aspmx.l.google.com.
example.com.        MX      5  alt1.aspmx.l.google.com.
example.com.        TXT     "v=spf1 include:_spf.google.com ~all"

# DKIM (email authentication)
google._domainkey.example.com.  TXT  "v=DKIM1; k=rsa; p=MIGfMA0..."

# DMARC (email policy)
_dmarc.example.com. TXT     "v=DMARC1; p=quarantine; rua=mailto:dmarc@example.com"

# Let's Encrypt verification
_acme-challenge.example.com.  TXT  "random-verification-string"

TTL (Time To Live)

TTL tells resolvers how long to cache a record (in seconds):

Before a migration: Lower TTL to 300 seconds 24-48 hours before changing the record. This ensures the old high TTL expires, and when you make the change, it propagates within 5 minutes instead of hours.

DNS Providers

ProviderFree TierBest FeatureBest For
CloudflareUnlimited zonesCDN + DDoS + DNS in oneMost websites (recommended default)
AWS Route 53None ($0.50/zone)Latency/geo routing, health checksAWS-heavy infrastructure
Google Cloud DNSNoneLow latency, DNSSECGCP infrastructure
NS1Limited freeAdvanced traffic managementComplex routing needs

CDN Integration

A CDN (Content Delivery Network) caches your content at edge locations worldwide. DNS is how you route users to the nearest edge:

# Without CDN: Users hit your origin server directly
example.com.  A  YOUR_SERVER_IP

# With Cloudflare (proxy mode): Users hit Cloudflare edge, which proxies to origin
# Just enable the orange cloud icon in Cloudflare dashboard
# DNS resolves to Cloudflare's anycast IPs, not your server

# With AWS CloudFront: Point domain to CloudFront distribution
example.com.  ALIAS  d1234567890.cloudfront.net.
# (ALIAS is AWS-specific; equivalent to CNAME at zone apex)

Chapter 12 — Reverse Proxies & Load Balancing

A reverse proxy sits between the internet and your application servers. It handles SSL termination, load balancing, caching, rate limiting, and request routing — so your app doesn't have to.

Why Use a Reverse Proxy

Nginx — Full Production Configuration

# /etc/nginx/sites-available/myapp.conf
upstream app_backend {
    # Load balancing across multiple app instances
    server 127.0.0.1:3000 weight=3;
    server 127.0.0.1:3001 weight=2;
    server 127.0.0.1:3002 backup;    # Only used if others are down

    # Health checks (Nginx Plus) or use passive checks
    keepalive 32;    # Persistent connections to backend
}

# Redirect HTTP to HTTPS
server {
    listen 80;
    server_name example.com www.example.com;
    return 301 https://$server_name$request_uri;
}

# Main HTTPS server
server {
    listen 443 ssl http2;
    server_name example.com www.example.com;

    # SSL Configuration
    ssl_certificate /etc/letsencrypt/live/example.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/example.com/privkey.pem;
    ssl_protocols TLSv1.2 TLSv1.3;
    ssl_ciphers ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256;
    ssl_prefer_server_ciphers off;
    ssl_session_cache shared:SSL:10m;
    ssl_session_timeout 1d;

    # Security headers
    add_header Strict-Transport-Security "max-age=63072000; includeSubDomains" always;
    add_header X-Frame-Options DENY always;
    add_header X-Content-Type-Options nosniff always;
    add_header X-XSS-Protection "1; mode=block" always;

    # Gzip compression
    gzip on;
    gzip_types text/plain text/css application/json application/javascript text/xml;
    gzip_min_length 1000;

    # Static files — served directly by Nginx (fast)
    location /static/ {
        alias /var/www/myapp/static/;
        expires 30d;
        add_header Cache-Control "public, immutable";
    }

    # API — proxy to backend
    location /api/ {
        proxy_pass http://app_backend;
        proxy_http_version 1.1;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
        proxy_set_header Connection "";

        # Timeouts
        proxy_connect_timeout 5s;
        proxy_send_timeout 60s;
        proxy_read_timeout 60s;

        # Rate limiting
        limit_req zone=api burst=20 nodelay;
    }

    # WebSocket support
    location /ws/ {
        proxy_pass http://app_backend;
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
        proxy_set_header Host $host;
        proxy_read_timeout 86400;    # Keep WebSocket alive
    }
}

# Rate limiting zone (defined in nginx.conf http block)
# limit_req_zone $binary_remote_addr zone=api:10m rate=10r/s;

Caddy — The Modern Alternative (Auto-HTTPS)

# Caddyfile — entire config for production site with auto-SSL
example.com {
    # Automatic HTTPS (Let's Encrypt) — zero configuration needed!

    # Reverse proxy to app
    reverse_proxy /api/* localhost:3000

    # Static files
    root * /var/www/myapp/static
    file_server

    # Compression
    encode gzip zstd

    # Security headers
    header {
        Strict-Transport-Security "max-age=63072000; includeSubDomains"
        X-Frame-Options DENY
        X-Content-Type-Options nosniff
    }

    # Rate limiting (with caddy-ratelimit plugin)
    rate_limit {remote.ip} 10r/s

    # Logging
    log {
        output file /var/log/caddy/access.log
        format json
    }
}
Caddy vs Nginx: Caddy automatically obtains and renews SSL certificates with zero configuration. For new projects, Caddy is often the better choice — less config, automatic HTTPS, modern defaults. Nginx is better when you need maximum performance tuning or have complex routing requirements.

Load Balancing Algorithms

AlgorithmHow It WorksBest For
Round RobinRequests distributed sequentiallyIdentical servers, stateless apps
Weighted Round RobinMore requests to higher-weight serversMixed server capacities
Least ConnectionsSend to server with fewest active connectionsVariable request durations
IP HashSame client IP always goes to same serverSession affinity (sticky sessions)
RandomRandom server selectionLarge clusters, simple

Cloud Load Balancers

TypeAWSLayerUse Case
Application LBALBLayer 7 (HTTP)Web apps, path-based routing, WebSocket
Network LBNLBLayer 4 (TCP/UDP)High performance, static IP, non-HTTP
Gateway LBGWLBLayer 3Network appliances (firewalls, IDS)

Chapter 13 — CI/CD Pipelines

CI/CD (Continuous Integration / Continuous Deployment) automates the path from code commit to production. No more manual deployments, no more "I forgot to run the tests."

CI vs CD

TermWhat It DoesTriggered By
Continuous Integration (CI)Automatically build and test every commit/PREvery push or pull request
Continuous Delivery (CD)Automatically prepare releases (deploy to staging)Merge to main branch
Continuous Deployment (CD)Automatically deploy to productionAfter all checks pass
Developer pushes code │ ▼ ┌─────────────────────────── CI Pipeline ───────────────────────────┐ │ │ │ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐│ │ │ Lint │──▶│ Test │──▶│ Build │──▶│ Scan │──▶│Artifact││ │ │ │ │(unit+ │ │(compile│ │(security│ │(Docker ││ │ │ │ │integr.)│ │bundle) │ │ vulns) │ │ image) ││ │ └────────┘ └────────┘ └────────┘ └────────┘ └────────┘│ │ │ └────────────────────────────────┬───────────────────────────────────┘ │ All green ✓ ▼ ┌─────────────────────────── CD Pipeline ───────────────────────────┐ │ │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────────┐ │ │ │ Deploy │──▶│ Smoke │──▶│ Deploy │──▶│ Monitor │ │ │ │ Staging │ │ Tests │ │Production│ │ + Auto-rollback│ │ │ └──────────┘ └──────────┘ └──────────┘ └──────────────┘ │ │ │ └────────────────────────────────────────────────────────────────────┘

GitHub Actions — Complete Workflow

# .github/workflows/deploy.yml
name: CI/CD Pipeline

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

env:
  REGISTRY: ghcr.io
  IMAGE_NAME: ${{ github.repository }}

jobs:
  # ─── CI: Test & Build ───────────────────────────────────
  test:
    runs-on: ubuntu-latest
    services:
      postgres:
        image: postgres:16
        env:
          POSTGRES_DB: test
          POSTGRES_PASSWORD: test
        ports: ['5432:5432']
        options: >-
          --health-cmd pg_isready
          --health-interval 10s
          --health-timeout 5s
          --health-retries 5
    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-node@v4
        with:
          node-version: 20
          cache: 'npm'

      - run: npm ci
      - run: npm run lint
      - run: npm run test
        env:
          DATABASE_URL: postgres://postgres:test@localhost:5432/test

  # ─── Build & Push Docker Image ─────────────────────────
  build:
    needs: test
    runs-on: ubuntu-latest
    if: github.event_name == 'push' && github.ref == 'refs/heads/main'
    permissions:
      contents: read
      packages: write
    outputs:
      image-tag: ${{ steps.meta.outputs.tags }}
    steps:
      - uses: actions/checkout@v4

      - uses: docker/login-action@v3
        with:
          registry: ${{ env.REGISTRY }}
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}

      - id: meta
        uses: docker/metadata-action@v5
        with:
          images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
          tags: |
            type=sha,prefix=
            type=raw,value=latest

      - uses: docker/build-push-action@v5
        with:
          context: .
          push: true
          tags: ${{ steps.meta.outputs.tags }}
          cache-from: type=gha
          cache-to: type=gha,mode=max

  # ─── Deploy to Production ──────────────────────────────
  deploy:
    needs: build
    runs-on: ubuntu-latest
    if: github.event_name == 'push' && github.ref == 'refs/heads/main'
    environment: production    # Requires approval if configured
    steps:
      - name: Deploy to server via SSH
        uses: appleboy/ssh-action@v1
        with:
          host: ${{ secrets.SERVER_HOST }}
          username: deploy
          key: ${{ secrets.SSH_PRIVATE_KEY }}
          script: |
            docker pull ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:latest
            docker stop myapp || true
            docker rm myapp || true
            docker run -d \
              --name myapp \
              --restart unless-stopped \
              -p 3000:3000 \
              --env-file /etc/myapp.env \
              ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:latest
            # Health check
            sleep 5
            curl -f http://localhost:3000/health || (docker logs myapp && exit 1)

Deployment Strategies

StrategyHow It WorksDowntimeRiskRollback Speed
RecreateStop old, start newYes (seconds-minutes)Low complexityRedeploy old version
RollingReplace instances one by oneNoMedium (mixed versions briefly)Continue rolling with old
Blue-GreenRun two identical environments, switch trafficNoLow (instant switch)Switch back instantly
CanaryRoute small % of traffic to new versionNoLowest (limited blast radius)Route 100% back to old
Blue-Green Deployment: Before: [Load Balancer] ──100%──▶ [Blue v1.0] (active) [Green v1.1] (idle, being tested) Switch: [Load Balancer] ──100%──▶ [Green v1.1] (now active) [Blue v1.0] (idle, rollback ready) Canary Deployment: Step 1: [Load Balancer] ──95%───▶ [v1.0] (stable) ──5%────▶ [v1.1] (canary) Step 2: [Load Balancer] ──50%───▶ [v1.0] ──50%───▶ [v1.1] (looking good) Step 3: [Load Balancer] ──100%──▶ [v1.1] (fully rolled out)

GitLab CI Example

# .gitlab-ci.yml
stages:
  - test
  - build
  - deploy

test:
  stage: test
  image: node:20
  services:
    - postgres:16
  variables:
    DATABASE_URL: postgres://postgres:test@postgres:5432/test
  script:
    - npm ci
    - npm run lint
    - npm run test

build:
  stage: build
  image: docker:24
  services:
    - docker:24-dind
  script:
    - docker build -t $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA .
    - docker push $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA
  only:
    - main

deploy:
  stage: deploy
  script:
    - ssh deploy@$SERVER "docker pull $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA && docker-compose up -d"
  only:
    - main
  environment:
    name: production

Chapter 14 — Infrastructure as Code

Infrastructure as Code (IaC) means defining your servers, networks, databases, and all cloud resources in version-controlled configuration files — not clicking through web consoles.

Why IaC

Terraform — The Industry Standard

# main.tf — Deploy a web app on AWS

terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
  # Store state remotely (never in git!)
  backend "s3" {
    bucket = "myapp-terraform-state"
    key    = "prod/terraform.tfstate"
    region = "eu-west-1"
  }
}

provider "aws" {
  region = var.region
}

# ─── VPC ──────────────────────────────────────────────────
module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "5.0"

  name = "${var.app_name}-vpc"
  cidr = "10.0.0.0/16"

  azs             = ["${var.region}a", "${var.region}b"]
  public_subnets  = ["10.0.1.0/24", "10.0.2.0/24"]
  private_subnets = ["10.0.10.0/24", "10.0.11.0/24"]

  enable_nat_gateway = true
  single_nat_gateway = true    # Cost saving for non-prod
}

# ─── Database ─────────────────────────────────────────────
resource "aws_db_instance" "main" {
  identifier     = "${var.app_name}-db"
  engine         = "postgres"
  engine_version = "16"
  instance_class = "db.t3.micro"

  allocated_storage = 20
  storage_encrypted = true

  db_name  = var.app_name
  username = "admin"
  password = var.db_password    # From secrets, never hardcoded

  vpc_security_group_ids = [aws_security_group.db.id]
  db_subnet_group_name   = aws_db_subnet_group.main.name

  skip_final_snapshot = false
  final_snapshot_identifier = "${var.app_name}-final-snapshot"

  backup_retention_period = 7
  multi_az               = var.environment == "prod" ? true : false
}

# ─── EC2 Instance ─────────────────────────────────────────
resource "aws_instance" "web" {
  ami           = data.aws_ami.ubuntu.id
  instance_type = var.instance_type

  subnet_id              = module.vpc.public_subnets[0]
  vpc_security_group_ids = [aws_security_group.web.id]
  key_name               = aws_key_pair.deploy.key_name

  user_data = templatefile("${path.module}/userdata.sh", {
    app_name = var.app_name
    db_host  = aws_db_instance.main.endpoint
  })

  tags = {
    Name        = "${var.app_name}-web"
    Environment = var.environment
  }
}

# ─── Variables ────────────────────────────────────────────
# variables.tf
variable "app_name" { default = "myapp" }
variable "region" { default = "eu-west-1" }
variable "environment" { default = "prod" }
variable "instance_type" { default = "t3.small" }
variable "db_password" { sensitive = true }
# Terraform workflow:
terraform init          # Download providers, initialize backend
terraform plan          # Preview changes (ALWAYS review this)
terraform apply         # Apply changes (creates/modifies resources)
terraform destroy       # Tear down everything (careful!)

Ansible — Configuration Management

Terraform creates infrastructure. Ansible configures it (installs software, deploys apps, manages configs).

# playbook.yml — Configure a web server
---
- name: Configure web server
  hosts: webservers
  become: yes
  vars:
    app_name: myapp
    app_port: 3000
    node_version: "20"

  tasks:
    - name: Update apt cache
      apt:
        update_cache: yes
        cache_valid_time: 3600

    - name: Install required packages
      apt:
        name: [nginx, certbot, python3-certbot-nginx, ufw]
        state: present

    - name: Configure UFW firewall
      ufw:
        rule: allow
        port: "{{ item }}"
        proto: tcp
      loop: ['22', '80', '443']

    - name: Enable UFW
      ufw:
        state: enabled
        policy: deny

    - name: Install Node.js
      shell: |
        curl -fsSL https://deb.nodesource.com/setup_{{ node_version }}.x | bash -
        apt-get install -y nodejs
      args:
        creates: /usr/bin/node

    - name: Deploy application
      git:
        repo: "https://github.com/you/{{ app_name }}.git"
        dest: "/opt/{{ app_name }}"
        version: main
      notify: restart app

    - name: Install dependencies
      npm:
        path: "/opt/{{ app_name }}"
        production: yes

    - name: Create systemd service
      template:
        src: templates/app.service.j2
        dest: "/etc/systemd/system/{{ app_name }}.service"
      notify: restart app

    - name: Configure Nginx
      template:
        src: templates/nginx.conf.j2
        dest: "/etc/nginx/sites-available/{{ app_name }}"
      notify: reload nginx

  handlers:
    - name: restart app
      systemd:
        name: "{{ app_name }}"
        state: restarted
        daemon_reload: yes

    - name: reload nginx
      systemd:
        name: nginx
        state: reloaded
# Run Ansible:
ansible-playbook -i inventory.yml playbook.yml

# inventory.yml
webservers:
  hosts:
    web1:
      ansible_host: 93.184.216.34
      ansible_user: deploy

GitOps Principles

Terraform vs Ansible vs CloudFormation:
Terraform: Multi-cloud, creates infrastructure (VMs, networks, DBs). Industry standard.
Ansible: Configures existing servers (install software, deploy apps). Agentless, SSH-based.
CloudFormation: AWS-only IaC. Use if you're 100% AWS and want native integration.
Pulumi: IaC in real programming languages (TypeScript, Python, Go). Good for developers who dislike HCL.

Common combo: Terraform to create infrastructure + Ansible to configure it. Or Terraform + Docker (no config management needed — the container IS the config).

Chapter 15 — Monitoring, Logging & Observability

If you can't see what's happening in production, you can't fix it. Observability is the ability to understand your system's internal state from its external outputs.

The Three Pillars

PillarWhatToolsAnswers
MetricsNumeric measurements over timePrometheus, CloudWatch, Datadog"How much?" "How fast?" "How often?"
LogsDiscrete events with contextLoki, ELK, CloudWatch Logs"What happened?" "Why did it fail?"
TracesRequest flow across servicesJaeger, Zipkin, AWS X-Ray"Where is the bottleneck?" "Which service is slow?"

Prometheus + Grafana (The Open-Source Standard)

# docker-compose.yml — Monitoring stack
services:
  prometheus:
    image: prom/prometheus:latest
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
    ports:
      - "9090:9090"
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.retention.time=30d'

  grafana:
    image: grafana/grafana:latest
    volumes:
      - grafana_data:/var/lib/grafana
    ports:
      - "3001:3000"
    environment:
      GF_SECURITY_ADMIN_PASSWORD: changeme

  node-exporter:
    image: prom/node-exporter:latest
    ports:
      - "9100:9100"
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro

volumes:
  prometheus_data:
  grafana_data:
# prometheus.yml — Scrape configuration
global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "alerts.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

scrape_configs:
  - job_name: 'node'
    static_configs:
      - targets: ['node-exporter:9100']

  - job_name: 'myapp'
    static_configs:
      - targets: ['myapp:3000']
    metrics_path: '/metrics'

Application Metrics (What to Measure)

The RED method for services and USE method for resources:

MethodMetricWhat It Tells You
RED (Services)RateRequests per second
ErrorsFailed requests per second
DurationResponse time (p50, p95, p99)
USE (Resources)Utilization% of resource capacity used
SaturationQueue depth, waiting work
ErrorsError count on the resource
# Example: Exposing metrics in Node.js (using prom-client)
const client = require('prom-client');

// Default metrics (CPU, memory, event loop)
client.collectDefaultMetrics();

// Custom metrics
const httpRequestDuration = new client.Histogram({
  name: 'http_request_duration_seconds',
  help: 'Duration of HTTP requests in seconds',
  labelNames: ['method', 'route', 'status_code'],
  buckets: [0.01, 0.05, 0.1, 0.5, 1, 5]
});

// Middleware to track requests
app.use((req, res, next) => {
  const end = httpRequestDuration.startTimer();
  res.on('finish', () => {
    end({ method: req.method, route: req.route?.path || req.path, status_code: res.statusCode });
  });
  next();
});

// Metrics endpoint for Prometheus to scrape
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', client.register.contentType);
  res.end(await client.register.metrics());
});

Alerting Rules

# alerts.yml — Prometheus alerting rules
groups:
  - name: webapp
    rules:
      - alert: HighErrorRate
        expr: rate(http_requests_total{status_code=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate (> 5%)"
          description: "{{ $labels.instance }} has {{ $value | humanizePercentage }} error rate"

      - alert: HighLatency
        expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High p95 latency (> 2s)"

      - alert: InstanceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Instance {{ $labels.instance }} is down"

SLIs, SLOs, and SLAs

TermDefinitionExample
SLI (Indicator)A measurable metric of service quality99.2% of requests complete in <500ms
SLO (Objective)Target value for an SLI (internal goal)"99.9% availability over 30 days"
SLA (Agreement)Contract with customers (with consequences)"99.9% uptime or we credit your bill"
Error budgets: If your SLO is 99.9% uptime (43 minutes downtime/month), you have a 0.1% "error budget." Use it for deployments, experiments, and maintenance. When the budget is exhausted, freeze deployments and focus on reliability.

Chapter 16 — Security & Hardening

Security is not a feature you add later — it's a practice woven into every layer. This chapter covers the essential security measures for any production website.

SSL/TLS — Encrypting Traffic

How TLS Works (Simplified)

Client Server │ │ │──── ClientHello (supported ciphers) ───▶│ │◀─── ServerHello (chosen cipher) ────────│ │◀─── Certificate (public key) ───────────│ │ │ │ Client verifies certificate chain: │ │ Server cert → Intermediate CA → Root CA│ │ │ │──── Key Exchange (encrypted) ──────────▶│ │ │ │◀═══ Encrypted communication begins ═══▶│ │ (symmetric encryption, fast) │

Certificate Options

ProviderCostValidationBest For
Let's EncryptFreeDomain Validation (DV)Everything (90-day auto-renewal)
CloudflareFree (with proxy)DVSites behind Cloudflare
AWS ACMFree (with AWS services)DVAWS ALB/CloudFront
Commercial CAs$10-1000/yrOV/EVEnterprise, legal requirements

HSTS (HTTP Strict Transport Security)

# Force browsers to always use HTTPS (add to Nginx/response headers)
Strict-Transport-Security: max-age=63072000; includeSubDomains; preload

# Once set, browsers will NEVER make HTTP requests to your domain
# Submit to HSTS preload list: https://hstspreload.org/

Firewall Configuration

# UFW (Uncomplicated Firewall) — Ubuntu/Debian
sudo ufw default deny incoming
sudo ufw default allow outgoing
sudo ufw allow 22/tcp      # SSH (or your custom port)
sudo ufw allow 80/tcp      # HTTP
sudo ufw allow 443/tcp     # HTTPS
sudo ufw enable
sudo ufw status verbose

# iptables (lower level, more control)
# Allow established connections
iptables -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT
# Allow loopback
iptables -A INPUT -i lo -j ACCEPT
# Allow SSH, HTTP, HTTPS
iptables -A INPUT -p tcp --dport 22 -j ACCEPT
iptables -A INPUT -p tcp --dport 80 -j ACCEPT
iptables -A INPUT -p tcp --dport 443 -j ACCEPT
# Drop everything else
iptables -A INPUT -j DROP

# Save rules (persist across reboot)
sudo apt install iptables-persistent
sudo netfilter-persistent save

SSH Hardening

# /etc/ssh/sshd_config — Production SSH configuration
Port 2222                          # Non-standard port (reduces noise)
PermitRootLogin no                 # Never allow root SSH
PasswordAuthentication no          # Keys only
PubkeyAuthentication yes
MaxAuthTries 3
ClientAliveInterval 300
ClientAliveCountMax 2
AllowUsers deploy                  # Whitelist specific users
Protocol 2

# Optional: Restrict to specific IPs (if you have static IP)
# AllowUsers deploy@YOUR_IP

Security Headers

# Add to Nginx server block or application responses:
add_header X-Frame-Options "DENY" always;
add_header X-Content-Type-Options "nosniff" always;
add_header X-XSS-Protection "1; mode=block" always;
add_header Referrer-Policy "strict-origin-when-cross-origin" always;
add_header Permissions-Policy "camera=(), microphone=(), geolocation=()" always;
add_header Content-Security-Policy "default-src 'self'; script-src 'self'; style-src 'self' 'unsafe-inline';" always;

# Test your headers: https://securityheaders.com/

Secrets Management

Never store secrets in: Git repositories, Dockerfiles, environment variables in CI logs, plain text config files committed to version control. Even if the repo is private — credentials in Git history are permanent.
ToolTypeBest For
Environment variablesRuntime injectionSimple apps (loaded from secure source)
AWS Secrets ManagerCloud-managedAWS workloads, auto-rotation
HashiCorp VaultSelf-hosted/cloudMulti-cloud, dynamic secrets, PKI
SOPSEncrypted files in GitGitOps workflows, small teams
Doppler / 1PasswordSaaSTeams wanting simple UI
# Example: Using SOPS to encrypt secrets in Git
# Install: brew install sops age

# Generate an age key
age-keygen -o keys.txt
# Public key: age1xxxxxxx...

# Create .sops.yaml in repo root
creation_rules:
  - path_regex: \.enc\.yaml$
    age: age1xxxxxxx...

# Encrypt a secrets file
sops --encrypt secrets.yaml > secrets.enc.yaml
# secrets.enc.yaml is safe to commit — encrypted at rest

# Decrypt at deploy time
sops --decrypt secrets.enc.yaml > /etc/myapp.env

Backup Strategy — The 3-2-1 Rule

# Automated database backup script
#!/bin/bash
# /opt/scripts/backup-db.sh
set -euo pipefail

TIMESTAMP=$(date +%Y%m%d_%H%M%S)
BACKUP_DIR="/opt/backups"
S3_BUCKET="s3://myapp-backups-prod"

# Dump database
pg_dump -h localhost -U myapp myapp_prod | gzip > "$BACKUP_DIR/db_$TIMESTAMP.sql.gz"

# Upload to S3 (offsite copy)
aws s3 cp "$BACKUP_DIR/db_$TIMESTAMP.sql.gz" "$S3_BUCKET/db/$TIMESTAMP.sql.gz"

# Retain only last 7 local backups
ls -t $BACKUP_DIR/db_*.sql.gz | tail -n +8 | xargs rm -f

# Cron: Run daily at 3 AM
# 0 3 * * * /opt/scripts/backup-db.sh >> /var/log/backup.log 2>&1
Test your backups! A backup you've never restored is not a backup — it's a hope. Schedule monthly restore tests to a separate environment. Automate this if possible.

DDoS Mitigation

Production Security Checklist

Before going live, verify:
☐ HTTPS everywhere (HSTS enabled)
☐ SSH: key-only auth, non-standard port, root disabled
☐ Firewall: only required ports open
☐ Security headers configured
☐ Secrets not in Git (use secrets manager)
☐ Database not publicly accessible
☐ Automatic security updates enabled
☐ Fail2ban or equivalent running
☐ Backups automated and tested
☐ Dependencies scanned for vulnerabilities (npm audit, Snyk)
☐ Application logs don't contain sensitive data
☐ Rate limiting on authentication endpoints
☐ CORS configured correctly (not wildcard in production)

Chapter 17 — Maintenance & Operations

Launching is just the beginning. Day-2 operations — keeping the system running, updated, and healthy — is where most of the work lives.

OS & Dependency Updates

# Automatic security updates (Ubuntu)
sudo apt install unattended-upgrades
sudo dpkg-reconfigure -plow unattended-upgrades

# /etc/apt/apt.conf.d/50unattended-upgrades
Unattended-Upgrade::Allowed-Origins {
    "${distro_id}:${distro_codename}-security";
};
Unattended-Upgrade::Automatic-Reboot "true";
Unattended-Upgrade::Automatic-Reboot-Time "03:00";

# Application dependency updates — use Dependabot or Renovate
# .github/dependabot.yml
version: 2
updates:
  - package-ecosystem: "npm"
    directory: "/"
    schedule:
      interval: "weekly"
    open-pull-requests-limit: 5

Zero-Downtime Deployments

# Method 1: Rolling restart with multiple instances behind LB
# Deploy to instance 1, health check passes, deploy to instance 2...

# Method 2: Docker with Nginx upstream reload
# deploy.sh
docker pull myregistry/myapp:$NEW_VERSION
docker run -d --name myapp-new -p 3001:3000 myregistry/myapp:$NEW_VERSION

# Wait for health check
until curl -sf http://localhost:3001/health; do sleep 1; done

# Switch Nginx upstream
sed -i 's/127.0.0.1:3000/127.0.0.1:3001/' /etc/nginx/conf.d/upstream.conf
nginx -s reload

# Stop old container
docker stop myapp-old && docker rm myapp-old
docker rename myapp-new myapp-old

Rollback Procedures

# Docker rollback — instant (previous image still cached)
docker stop myapp
docker run -d --name myapp -p 3000:3000 myregistry/myapp:PREVIOUS_VERSION

# Kubernetes rollback
kubectl rollout undo deployment/myapp
kubectl rollout status deployment/myapp    # Watch progress

# Database rollback — this is the hard part
# Always make migrations reversible:
# - migration_001_add_column.up.sql
# - migration_001_add_column.down.sql
# Run: migrate down 1
Database migrations are the #1 cause of failed rollbacks. Rules:
• Never drop columns in the same deploy that removes the code using them
• Use expand-contract pattern: add new column → deploy code using both → remove old column
• Always write down migrations
• Test migrations against a production-size dataset before deploying

Disaster Recovery Plan

MetricDefinitionYour Target
RTO (Recovery Time Objective)Max acceptable downtimee.g., 1 hour
RPO (Recovery Point Objective)Max acceptable data losse.g., 15 minutes
# Disaster Recovery Runbook Template:

## Scenario: Complete server failure
1. Spin up new server from Terraform (5 min)
2. Restore latest database backup (10 min)
3. Deploy latest Docker image (2 min)
4. Update DNS to new server IP (5 min + propagation)
5. Verify application health
6. Notify stakeholders

## Scenario: Database corruption
1. Stop application (prevent further writes)
2. Identify last good backup
3. Restore to point-in-time (RDS: use PITR)
4. Verify data integrity
5. Restart application
6. Post-mortem: identify root cause

## Scenario: Security breach
1. Isolate affected systems (revoke access, block IPs)
2. Preserve evidence (don't destroy logs)
3. Rotate ALL credentials and secrets
4. Assess scope of breach
5. Patch vulnerability
6. Notify affected users (legal requirement in many jurisdictions)
7. Post-mortem and remediation plan

Runbooks & On-Call

Chapter 18 — Cost Management

Cloud bills can spiral out of control fast. Professional operations include cost awareness as a first-class concern.

Cloud Pricing Models

ModelHow It WorksBest ForSavings vs On-Demand
On-DemandPay per hour/second of useVariable workloads, testing0% (baseline)
Reserved Instances1-3 year commitment, fixed rateSteady-state production30-72%
Savings PlansCommit to $/hr spend (flexible)Predictable spend, flexible instances20-50%
Spot/PreemptibleBid on unused capacity (can be terminated)Batch jobs, CI runners, stateless workers60-90%

Cost Optimization Strategies

  1. Right-sizing: Monitor actual CPU/memory usage. Most instances are over-provisioned. A t3.medium using 10% CPU should be a t3.small.
  2. Auto-scaling: Scale down during off-hours. Many apps have 10x traffic difference between peak and trough.
  3. Spot instances: Use for CI/CD runners, batch processing, and stateless workers. Save 60-90%.
  4. Reserved capacity: For databases and always-on servers, commit for 1-3 years.
  5. Storage tiering: Move old data to cheaper storage (S3 Glacier, cold storage).
  6. Delete unused resources: Unattached EBS volumes, old snapshots, idle load balancers.
  7. Use ARM instances: Graviton (AWS) / T2A (GCP) are 20% cheaper with same or better performance.

TCO Comparison: Self-Hosted vs Cloud

FactorVPS ($40/mo)AWS (equivalent)Self-Hosted
Compute$40/mo$70-150/mo$500 one-time + power
DatabaseIncluded (self-managed)$30-100/mo (RDS)Included
BandwidthUsually generous$0.09/GB out (adds up!)ISP cost
Your timeMedium (manage server)Low (managed services)High (manage everything)
ScalingManual (resize/add VPS)AutomaticBuy more hardware
Reliability99.9% SLA typical99.99% possibleDepends on you

Billing Alerts

# AWS: Set up billing alarm via CLI
aws cloudwatch put-metric-alarm \
  --alarm-name "MonthlyBillingAlarm" \
  --alarm-description "Alert when bill exceeds $100" \
  --metric-name EstimatedCharges \
  --namespace AWS/Billing \
  --statistic Maximum \
  --period 21600 \
  --threshold 100 \
  --comparison-operator GreaterThanThreshold \
  --evaluation-periods 1 \
  --alarm-actions arn:aws:sns:us-east-1:ACCOUNT:billing-alerts \
  --dimensions Name=Currency,Value=USD

# Also set alerts at 50%, 80%, 100% of budget
# AWS Budgets is more powerful than CloudWatch for this
The hidden costs of cloud:
Data transfer out: AWS charges $0.09/GB. A site serving 1TB/month = $90 just in bandwidth.
NAT Gateway: $0.045/hr + $0.045/GB processed. Can easily be $30-100/mo.
Load Balancer: $16-25/mo minimum even with zero traffic.
Managed databases: 3-5x the cost of self-managed on a VPS.

Mitigation: Use Cloudflare (free CDN, absorbs bandwidth), minimize NAT Gateway usage, consider Hetzner/DO for predictable pricing.

Chapter 19 — Decision Framework

This chapter synthesizes everything into actionable decision-making tools. When you get a new project, use these frameworks to systematically choose the right architecture and hosting.

The Decision Flowchart

┌─────────────────────────┐ │ New Website Project │ └────────────┬────────────┘ │ ┌────────────▼────────────┐ │ Does it need a backend? │ └────────────┬────────────┘ │ │ NO YES │ │ ▼ ▼ ┌────────────┐ ┌──────────────────┐ │Static Host │ │ How many users? │ │(CDN/Netlify│ └────────┬─────────┘ │ CloudFlare)│ │ │ │ └────────────┘ <1000 1K-100K >100K │ │ │ ▼ ▼ ▼ ┌────────┐ ┌────────┐ ┌──────────┐ │Single │ │VPS or │ │Cloud IaaS│ │VPS or │ │PaaS │ │or K8s │ │PaaS │ │ │ │ │ └───┬────┘ └───┬────┘ └────┬─────┘ │ │ │ ┌─────────────▼──────────▼───────────▼──────────┐ │ What's your budget? │ └─────────────┬──────────┬───────────┬──────────┘ Minimal Medium Large │ │ │ ▼ ▼ ▼ ┌────────┐ ┌────────┐ ┌──────────┐ │Hetzner │ │DO/AWS │ │AWS/GCP │ │VPS + │ │+ managed│ │full stack│ │Docker │ │services│ │+ support │ └────────┘ └────────┘ └──────────┘

Scoring Matrix

Rate each factor 1-5 for your project, then match to the hosting type that scores highest:

FactorStaticVPSPaaSIaaSK8sServerless
Low budget priority543214
Need auto-scaling513555
Minimal ops work525215
Maximum control151441
Fast time-to-market535214
Compliance needs242553
Team > 5 devs323453
Variable traffic513445

Migration Paths

Typical growth path: PaaS (Railway/Render) ──────────────────────────────────────────────▶ Time │ │ │ "We're paying $200/mo for what a $40 VPS could do" │ ▼ │ VPS + Docker Compose ───────────────────────────────────────────────▶ │ │ │ │ "We need auto-scaling, multiple services, zero-downtime deploys" │ ▼ │ Cloud IaaS (AWS/GCP) + Containers ──────────────────────────────────▶ │ │ │ │ "We have 10+ services, 5+ teams, complex networking" │ ▼ │ Managed Kubernetes ─────────────────────────────────────────────────▶ │ ▼

Key Decision Questions

Ask these for every project:

1. What's the expected traffic? (requests/sec, concurrent users)
2. What's the budget? (monthly hosting budget, one-time setup budget)
3. What's the team size? (who maintains this after launch?)
4. What's the SLA requirement? (99.9%? 99.99%? "best effort"?)
5. Are there compliance requirements? (GDPR, HIPAA, PCI-DSS, data residency)
6. How variable is the traffic? (steady vs. spiky vs. seasonal)
7. What's the time-to-market pressure? (launch in a week vs. 6 months)
8. What's the data sensitivity? (public content vs. financial/health data)
9. Do you need specific geographic presence? (latency requirements, legal)
10. What's the expected growth? (10x in a year? Stable? Unknown?)

Anti-Patterns to Avoid

Chapter 20 — Real-World Scenarios

Let's apply everything to five concrete projects, showing how the decision framework leads to different architectures.

Scenario 1: Personal Technical Blog

AspectDecision
Traffic~1000 visitors/day, spikes when posts hit HN/Reddit
Budget$0-20/month
TeamJust you
SLABest effort (downtime is annoying, not catastrophic)

Architecture

┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ Markdown │────▶│ Hugo/Astro │────▶│ Cloudflare │ │ files in │ │ (build step)│ │ Pages (CDN) │ │ Git repo │ │ │ │ FREE │ └──────────────┘ └──────────────┘ └──────────────┘ │ GitHub Actions (auto-build on push)

Scenario 2: Startup SaaS MVP

AspectDecision
Traffic~500 users, growing. API-heavy (user auth, CRUD, real-time)
Budget$50-200/month
Team2-3 developers
SLA99.9% (paying customers)

Architecture

┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ React SPA │────▶│ Vercel/ │ │ Railway │ │ (frontend) │ │ Cloudflare │ │ or Render │ └──────────────┘ │ Pages │ │ (backend) │ └──────────────┘ └──────┬───────┘ │ ┌───────▼───────┐ │ Managed │ │ PostgreSQL │ │ + Redis │ └───────────────┘

Scenario 3: E-Commerce Site

AspectDecision
Traffic~10K daily visitors, 5x spikes during sales
Budget$200-500/month
Team3-5 developers + 1 DevOps
SLA99.95% (downtime = lost revenue)
CompliancePCI-DSS (handling payments)

Architecture

┌───────────┐ ┌───────────┐ ┌───────────────────────────────┐ │CloudFront │───▶│ ALB │───▶│ ECS Fargate (containers) │ │ (CDN) │ │(HTTPS/LB) │ │ ┌─────────┐ ┌───────────┐ │ └───────────┘ └───────────┘ │ │ Web │ │ Worker │ │ │ │ (x2-4) │ │ (x1-2) │ │ │ └────┬────┘ └─────┬─────┘ │ └───────┼─────────────┼────────┘ │ │ ┌───────▼─────────────▼────────┐ │ RDS PostgreSQL (Multi-AZ) │ │ ElastiCache Redis │ │ S3 (product images) │ └──────────────────────────────┘

Scenario 4: Enterprise Internal Tool

AspectDecision
Traffic~200 internal users, business hours only
Budget$100-300/month
Team1-2 developers (part-time maintenance)
SLA99.5% (business hours)
ComplianceData must stay in EU, SSO required

Architecture

┌───────────────┐ ┌──────────────────────────────────┐ │ Corporate │────▶│ Hetzner VPS (EU) │ │ VPN / SSO │ │ ┌────────────┐ ┌────────────┐ │ │ (Okta/Azure │ │ │ Caddy │ │ App │ │ │ AD) │ │ │ (proxy) │──▶│ (Docker) │ │ └───────────────┘ │ └────────────┘ └─────┬──────┘ │ │ │ │ │ ┌──────▼──────┐ │ │ │ PostgreSQL │ │ │ │ (Docker) │ │ │ └─────────────┘ │ └──────────────────────────────────┘

Scenario 5: High-Traffic Media/Content Site

AspectDecision
Traffic~1M daily visitors, global audience, viral spikes
Budget$2000-10000/month
Team8-15 developers, 2-3 SRE/DevOps
SLA99.99% (ad revenue depends on uptime)

Architecture

┌──────────────────────────────────────────────────────────────────┐ │ Cloudflare (CDN + WAF + DDoS) │ └────────────────────────────────┬─────────────────────────────────┘ │ ┌────────────────────────────────▼─────────────────────────────────┐ │ AWS Multi-Region │ │ │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐ │ │ │ EKS Cluster │ │ EKS Cluster │ │ Shared Services │ │ │ │ (Region 1) │ │ (Region 2) │ │ • S3 (media storage) │ │ │ │ │ │ │ │ • CloudFront (assets) │ │ │ │ Services: │ │ Services: │ │ • ElasticSearch │ │ │ │ • CMS API │ │ • CMS API │ │ • SQS (async jobs) │ │ │ │ • Auth │ │ • Auth │ │ • Lambda (image proc) │ │ │ │ • Search │ │ • Search │ └─────────────────────────┘ │ │ └──────┬───────┘ └──────┬───────┘ │ │ │ │ │ │ ┌──────▼───────┐ ┌──────▼───────┐ │ │ │ Aurora Global │ │ Aurora Read │ │ │ │ (Primary) │ │ (Replica) │ │ │ └──────────────┘ └──────────────┘ │ └───────────────────────────────────────────────────────────────────┘
Notice the pattern: Complexity scales with requirements, not ambition. The blog costs $0/mo and takes 30 minutes to set up. The media site costs $5000/mo and takes months to architect. Both are correct for their context. The worst mistake is using Scenario 5's architecture for Scenario 1's requirements.

Further Reading & Resources

Books

BookAuthorCovers
The Phoenix ProjectGene KimDevOps culture, IT operations (novel format)
Site Reliability EngineeringGoogle (free online)SRE practices, monitoring, incident response
Infrastructure as CodeKief MorrisIaC principles, patterns, practices
Terraform: Up & RunningYevgeniy BrikmanPractical Terraform (updated regularly)
Docker Deep DiveNigel PoultonDocker from basics to production
Kubernetes in ActionMarko LukšaK8s concepts and hands-on
Web Scalability for Startup EngineersArtur EjsmontScaling web apps pragmatically
Designing Data-Intensive ApplicationsMartin KleppmannDistributed systems, databases, architecture

Official Documentation

Free Courses & Tutorials

Tools Reference

CategoryTools
Web ServersNginx, Caddy, Apache, Traefik
ContainersDocker, Podman, containerd
OrchestrationKubernetes, Docker Swarm, Nomad
CI/CDGitHub Actions, GitLab CI, Jenkins, CircleCI, ArgoCD
IaCTerraform, Ansible, Pulumi, CloudFormation
MonitoringPrometheus, Grafana, Datadog, New Relic
LoggingLoki, ELK Stack, Fluentd, Vector
SecretsVault, AWS Secrets Manager, SOPS, Doppler
DNS/CDNCloudflare, Route 53, Fastly
SSLLet's Encrypt, Certbot, cert-manager (K8s)

Communities

Knowledge Check

🧠 Test Your Understanding

Q1: A startup with 2 developers needs to launch an MVP in 2 weeks. They expect ~200 users initially. What's the best hosting choice?

Q2: Your website serves static HTML/CSS/JS with no backend logic. What's the most cost-effective and performant hosting?

Q3: What does the "3-2-1 backup rule" mean?

Q4: When should you consider migrating from PaaS to VPS/IaaS?

Q5: What is the primary purpose of a reverse proxy?