Website Architecture & Deployment

A Zero-to-Hero Guide for Backend Developers & System Administrators

Chapter 1 — The Big Picture

Before diving into specific technologies, you need a mental model of what happens when someone visits a website — and what your role is in making that happen reliably, securely, and at scale.

What Happens When a User Visits a Website

User types "example.com" in browser │ ▼ ┌─────────────────┐ │ DNS Resolution │ Browser asks: "What IP is example.com?" │ (Recursive) │ Checks: browser cache → OS cache → resolver → root → TLD → authoritative └────────┬────────┘ │ Returns: 93.184.216.34 ▼ ┌─────────────────┐ │ TCP Connection │ Three-way handshake (SYN → SYN-ACK → ACK) │ + TLS Handshake │ Certificate verification, key exchange, cipher negotiation └────────┬────────┘ │ Encrypted tunnel established ▼ ┌─────────────────┐ │ HTTP Request │ GET / HTTP/2 │ │ Host: example.com │ │ Accept: text/html └────────┬────────┘ │ ▼ ┌─────────────────────────────────────────────────────────┐ │ YOUR DOMAIN │ │ │ │ ┌──────────┐ ┌──────────┐ ┌──────────────────┐ │ │ │ Load │───▶│ Reverse │───▶│ Application │ │ │ │ Balancer │ │ Proxy │ │ Server │ │ │ └──────────┘ └──────────┘ │ (your backend) │ │ │ └────────┬─────────┘ │ │ │ │ │ ┌────────▼─────────┐ │ │ │ Database / Cache │ │ │ └──────────────────┘ │ └─────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────┐ │ HTTP Response │ 200 OK + HTML/JSON/assets │ (via CDN maybe) │ Cached at edge if configured └────────┬────────┘ │ ▼ ┌─────────────────┐ │ Browser Render │ Parse HTML → fetch CSS/JS → render page └─────────────────┘

The Roles in Professional Web Operations

Role	Responsibility	Cares About
Frontend Developer	HTML, CSS, JavaScript, UI/UX	User experience, browser compatibility, performance
Backend Developer	APIs, business logic, databases	Data integrity, API design, scalability, security
DevOps / SRE	CI/CD, infrastructure, reliability	Uptime, deployment speed, automation, monitoring
System Administrator	Server management, networking, security	Patching, hardening, backups, capacity
Platform Engineer	Internal developer platforms, tooling	Developer productivity, self-service infrastructure

Your role in this guide: You're the backend developer who also handles deployment, operations, and maintenance. In smaller teams, this is extremely common. In larger organizations, these responsibilities are split across dedicated teams — but understanding the full picture makes you far more effective regardless of team size.

Environments: Dev → Staging → Production

Professional deployments never go straight from a developer's laptop to users. There's a pipeline:

Environment	Purpose	Who Uses It	Data
Local / Dev	Active development, debugging	Individual developer	Fake/seed data
CI	Automated testing on every commit	Machines (automated)	Test fixtures
Staging	Pre-production validation, QA	QA team, stakeholders	Production-like (anonymized)
Production	Real users, real data	Everyone (end users)	Real data

Never test in production. This sounds obvious, but the temptation is real when "it works on my machine." Staging exists to catch the things that only break in production-like conditions: different OS versions, network latency, real database sizes, concurrent users.

What "Deploying a Website" Actually Means

Deployment is not just "putting files on a server." It's a repeatable, automated process that includes:

Build — Compile code, bundle assets, run optimizations
Test — Unit tests, integration tests, security scans
Package — Create a deployable artifact (Docker image, binary, archive)
Deploy — Push artifact to target environment
Verify — Health checks, smoke tests, monitoring
Rollback plan — If something breaks, revert instantly

The Infrastructure Stack

┌─────────────────────────────────────────────────────────────┐ │ YOUR APPLICATION │ ├─────────────────────────────────────────────────────────────┤ │ Runtime (Node.js, Python, Go, Java, etc.) │ ├─────────────────────────────────────────────────────────────┤ │ Container / Process Manager (Docker, systemd, PM2) │ ├─────────────────────────────────────────────────────────────┤ │ Operating System (Ubuntu, Debian, Alpine, RHEL) │ ├─────────────────────────────────────────────────────────────┤ │ Virtualization / Bare Metal (KVM, Xen, physical hardware) │ ├─────────────────────────────────────────────────────────────┤ │ Network (VPC, firewall, load balancer, DNS) │ ├─────────────────────────────────────────────────────────────┤ │ Physical Infrastructure (data center, power, cooling) │ └─────────────────────────────────────────────────────────────┘ ▲ More abstraction = less control, less work │ Less abstraction = more control, more responsibility ▼

The key insight: every hosting option in this guide simply draws the line at a different layer. Shared hosting gives you only the top layer. Bare metal gives you everything. PaaS gives you the top two. IaaS gives you the top four. Your job is to pick where to draw that line for each project.

Chapter 2 — Website Architecture Patterns

Before choosing where to host, you need to understand what you're hosting. The architecture pattern determines your infrastructure requirements.

Static Sites

Pre-built HTML/CSS/JS files served as-is. No server-side processing per request.

How it works: Build step generates HTML files → upload to web server or CDN → served directly
Tools: Hugo, Jekyll, Eleventy, Astro (static mode), Next.js (export)
Examples: Documentation sites, blogs, portfolios, landing pages
Hosting: Cheapest option — CDN, S3+CloudFront, Netlify, GitHub Pages

# Example: Build and deploy a Hugo static site
hugo build                          # Generates ./public/ directory
aws s3 sync ./public s3://my-bucket --delete
aws cloudfront create-invalidation --distribution-id EXXX --paths "/*"

Server-Side Rendering (SSR)

Server generates HTML on every request. The traditional model (PHP, Rails, Django, Express with templates).

How it works: Request → server processes → queries DB → renders HTML → sends response
Tools: Next.js (SSR mode), Nuxt.js, Django, Rails, Laravel, Express+EJS
Examples: E-commerce product pages, news sites, dashboards
Hosting: Needs a running server process — VPS, PaaS, containers

Single-Page Applications (SPA)

One HTML file + JavaScript bundle. All rendering happens in the browser. Backend is a separate API.

How it works: Browser loads JS bundle → JS fetches data from API → renders UI client-side
Tools: React, Vue, Angular, Svelte
Examples: Gmail, Trello, Figma
Hosting: Static files (CDN) for frontend + API server for backend

JAMstack (JavaScript, APIs, Markup)

Pre-rendered static pages enhanced with JavaScript calling APIs at runtime. Best of both worlds.

How it works: Build-time rendering + client-side API calls for dynamic content
Tools: Next.js (ISR/SSG), Gatsby, Astro, Remix
Examples: Marketing sites with dynamic forms, blogs with comments
Hosting: CDN for pages + serverless functions or API for dynamic parts

Monolithic Architecture

Single deployable unit containing all functionality. Frontend, backend, and data access in one codebase.

How it works: One application handles everything — routing, business logic, data, rendering
Tools: Rails, Django, Laravel, Spring Boot, ASP.NET
Pros: Simple to develop, test, deploy. One thing to monitor.
Cons: Scales as a unit (can't scale just the hot path). Deployment = all or nothing.

Service-Oriented Architecture (SOA) / Microservices

Application split into independent services communicating over network (HTTP/gRPC/message queues).

How it works: Each service owns its domain, data, and deployment lifecycle
Tools: Any language per service. Kubernetes for orchestration. Service mesh (Istio, Linkerd).
Pros: Independent scaling, independent deployment, technology diversity
Cons: Network complexity, distributed debugging, operational overhead

Don't start with microservices. Start monolithic, split when you have clear bounded contexts and the team/traffic justifies the operational complexity. Premature microservices is the #1 architecture mistake in startups.

Comparison Table

Pattern	Server Needed	Scalability	Complexity	SEO	Best For
Static	No (CDN)	★★★★★	★☆☆☆☆	★★★★★	Content sites, docs
SSR	Yes	★★★☆☆	★★★☆☆	★★★★★	Dynamic content + SEO
SPA	API only	★★★★☆	★★★☆☆	★★☆☆☆	App-like experiences
JAMstack	Partial	★★★★★	★★★☆☆	★★★★★	Content + interactivity
Monolith	Yes	★★★☆☆	★★☆☆☆	Varies	MVPs, small-medium apps
Microservices	Yes (many)	★★★★★	★★★★★	Varies	Large teams, high scale

How to choose:
• Content-heavy, rarely changes? → Static / JAMstack
• Need SEO + dynamic data? → SSR
• Rich interactive app (logged-in users)? → SPA + API
• Small team, getting started? → Monolith
• Large team, proven bounded contexts, high scale? → Microservices

Chapter 3 — Hosting Taxonomy

This is the complete landscape of where your website can live. Each option trades control for convenience at a different point.

The Control vs. Convenience Spectrum

MORE CONTROL MORE CONVENIENCE MORE WORK LESS WORK ◄─────────────────────────────────────────────────────────────────────────► ┌──────────┬──────────┬──────────┬──────────┬──────────┬──────────┬──────────┐ │ Bare │ Colo- │ Dedi- │ VPS │ IaaS │ PaaS │ Static/ │ │ Metal │ cation │ cated │ │ │ │ Serverless│ │ (own HW) │(your HW, │(rented │(virtual │(cloud │(managed │(fully │ │ │their DC) │physical) │ server) │ VMs) │platform) │managed) │ └──────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┘ You manage: Hardware ✓ ✓ ✗ ✗ ✗ ✗ ✗ Network ✓ ✓ Partial ✗ Partial ✗ ✗ OS ✓ ✓ ✓ ✓ ✓ ✗ ✗ Runtime ✓ ✓ ✓ ✓ ✓ Partial ✗ App ✓ ✓ ✓ ✓ ✓ ✓ ✓

Complete Hosting Types

Type	What You Get	Cost Range	Control	Complexity	Best For
Shared Hosting	Space on a shared server (cPanel)	$3-15/mo	★☆☆☆☆	★☆☆☆☆	WordPress blogs, tiny sites
VPS	Virtual machine, root access	$5-80/mo	★★★★☆	★★★☆☆	Most web apps, APIs
Dedicated Server	Entire physical server rented	$80-500/mo	★★★★★	★★★★☆	High-performance, compliance
Colocation	Your hardware in their data center	$200-2000/mo	★★★★★	★★★★★	Maximum control, large scale
IaaS	Cloud VMs + managed services	Pay-per-use	★★★★☆	★★★★☆	Variable workloads, scaling
PaaS	Managed platform, push code	$5-500/mo	★★☆☆☆	★★☆☆☆	Startups, rapid deployment
Serverless/FaaS	Functions triggered by events	Pay-per-invocation	★☆☆☆☆	★★☆☆☆	APIs, event processing
Static Hosting	CDN-served static files	$0-20/mo	★☆☆☆☆	★☆☆☆☆	Static sites, SPAs
Managed K8s	Kubernetes cluster (managed control plane)	$70-1000+/mo	★★★★☆	★★★★★	Microservices at scale

Shared Hosting — The Beginner Trap

Shared hosting (GoDaddy, Bluehost, Hostinger) puts hundreds of sites on one server. You get a cPanel interface, FTP access, and PHP. That's it.

Pros: Cheapest, zero server management, one-click WordPress
Cons: No root access, limited languages (usually PHP only), noisy neighbors affect performance, can't install custom software, terrible for anything beyond WordPress

Not suitable for professional work. Shared hosting is fine for a personal blog. For anything with users, SLAs, or custom backend code — skip it entirely. A $5/mo VPS gives you infinitely more capability.

When to Use What — Quick Reference

Static site / SPA frontend? → Static hosting (Netlify, CloudFront, Cloudflare Pages)
Simple web app, small team? → VPS (DigitalOcean, Hetzner) or PaaS (Railway, Render)
Need auto-scaling, variable traffic? → IaaS (AWS, GCP) or containers
Microservices, large team? → Managed Kubernetes (EKS, GKE)
Event-driven, sporadic traffic? → Serverless (Lambda, Workers)
Compliance/performance requirements? → Dedicated or colocation
Learning / side project? → VPS ($5/mo) — best bang for learning

Chapter 4 — Self-Hosting (Bare Metal / Home Server)

Self-hosting means running a web server on hardware you physically control — a spare PC, a Raspberry Pi, or a rack server in your closet. It's the most educational option and gives maximum control.

When Self-Hosting Makes Sense

Learning and experimentation (best way to understand the full stack)
Internal tools (home automation, media server, dev environments)
Data sovereignty requirements (data never leaves your premises)
One-time cost preference over recurring cloud bills

When it does NOT make sense: Production websites serving external users. Your home internet has no SLA, dynamic IP, limited upload bandwidth, and a single point of failure (power outage = site down). Use self-hosting for learning, internal tools, and development — not for serving customers.

Setting Up Nginx on Linux

# Install Nginx (Ubuntu/Debian)
sudo apt update && sudo apt install -y nginx

# Start and enable
sudo systemctl start nginx
sudo systemctl enable nginx

# Verify it's running
curl http://localhost
# Should return the Nginx welcome page HTML

Virtual Hosts (Serving Multiple Sites)

# /etc/nginx/sites-available/mysite.conf
server {
    listen 80;
    server_name mysite.example.com;
    root /var/www/mysite;
    index index.html;

    location / {
        try_files $uri $uri/ =404;
    }

    # For a backend app (reverse proxy)
    location /api/ {
        proxy_pass http://127.0.0.1:3000;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
    }
}

# Enable the site
sudo ln -s /etc/nginx/sites-available/mysite.conf /etc/nginx/sites-enabled/
sudo nginx -t          # Test configuration
sudo systemctl reload nginx

Dynamic DNS (Solving the Dynamic IP Problem)

Home internet usually gives you a dynamic IP that changes periodically. Dynamic DNS services map a hostname to your current IP.

# Using ddclient with Cloudflare (install: sudo apt install ddclient)
# /etc/ddclient.conf
protocol=cloudflare
zone=example.com
login=your-email@example.com
password=your-cloudflare-api-token
use=web, web=https://api.ipify.org
mysite.example.com

# Or use a cron job with curl
*/5 * * * * curl -s "https://api.cloudflare.com/client/v4/zones/ZONE_ID/dns_records/RECORD_ID" \
  -X PATCH \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  --data "{\"content\":\"$(curl -s https://api.ipify.org)\"}"

Port Forwarding

Your router's NAT blocks incoming connections. You need to forward ports 80 (HTTP) and 443 (HTTPS) to your server's local IP.

Give your server a static local IP (e.g., 192.168.1.100) via DHCP reservation
In your router admin panel: forward external port 80 → 192.168.1.100:80
Forward external port 443 → 192.168.1.100:443
Test from outside your network (use your phone on mobile data)

Let's Encrypt (Free SSL/TLS)

# Install Certbot
sudo apt install -y certbot python3-certbot-nginx

# Obtain certificate (Nginx plugin auto-configures)
sudo certbot --nginx -d mysite.example.com

# Auto-renewal is set up automatically via systemd timer
sudo systemctl status certbot.timer

# Manual renewal test
sudo certbot renew --dry-run

Running Your App as a systemd Service

# /etc/systemd/system/myapp.service
[Unit]
Description=My Web Application
After=network.target

[Service]
Type=simple
User=www-data
WorkingDirectory=/opt/myapp
ExecStart=/usr/bin/node /opt/myapp/server.js
Restart=always
RestartSec=5
Environment=NODE_ENV=production
Environment=PORT=3000

# Security hardening
NoNewPrivileges=true
ProtectSystem=strict
ProtectHome=true
ReadWritePaths=/opt/myapp/data

[Install]
WantedBy=multi-user.target

# Enable and start
sudo systemctl daemon-reload
sudo systemctl enable myapp
sudo systemctl start myapp
sudo systemctl status myapp    # Check it's running
sudo journalctl -u myapp -f   # View logs

systemd is your process manager on Linux. It handles starting your app on boot, restarting on crash, logging, and resource limits. Learn it well — you'll use it on every VPS and server you manage.

Chapter 5 — VPS & Dedicated Servers

A VPS (Virtual Private Server) is the workhorse of professional web hosting. You get a virtual machine with root access, a public IP, and full control — without managing physical hardware.

Provider Comparison

Provider	Cheapest VPS	Data Centers	Strengths	Best For
DigitalOcean	$4/mo (512MB)	14 regions	Simple UI, great docs, managed DBs	Startups, learning
Hetzner	€3.79/mo (2GB)	EU + US	Best price/performance ratio	Price-conscious, EU hosting
Linode (Akamai)	$5/mo (1GB)	11 regions	Reliable, good support	General purpose
Vultr	$2.50/mo (512MB)	32 locations	Most locations, bare metal option	Edge deployments
OVH	€3.50/mo (2GB)	EU focused	Cheap dedicated servers too	EU, budget dedicated

Initial Server Setup (The First 10 Minutes)

Every new VPS should go through this hardening process before deploying anything:

# 1. Connect as root (first time only)
ssh root@YOUR_SERVER_IP

# 2. Update the system
apt update && apt upgrade -y

# 3. Create a non-root user
adduser deploy
usermod -aG sudo deploy

# 4. Set up SSH key authentication for the new user
mkdir -p /home/deploy/.ssh
cp ~/.ssh/authorized_keys /home/deploy/.ssh/
chown -R deploy:deploy /home/deploy/.ssh
chmod 700 /home/deploy/.ssh
chmod 600 /home/deploy/.ssh/authorized_keys

# 5. Harden SSH - edit /etc/ssh/sshd_config
sudo sed -i 's/#PermitRootLogin yes/PermitRootLogin no/' /etc/ssh/sshd_config
sudo sed -i 's/#PasswordAuthentication yes/PasswordAuthentication no/' /etc/ssh/sshd_config
sudo sed -i 's/#Port 22/Port 2222/' /etc/ssh/sshd_config
sudo systemctl restart sshd

# 6. Set up firewall
sudo apt install -y ufw
sudo ufw default deny incoming
sudo ufw default allow outgoing
sudo ufw allow 2222/tcp    # SSH (custom port)
sudo ufw allow 80/tcp      # HTTP
sudo ufw allow 443/tcp     # HTTPS
sudo ufw enable

# 7. Install fail2ban (brute-force protection)
sudo apt install -y fail2ban
sudo systemctl enable fail2ban

# 8. Set up automatic security updates
sudo apt install -y unattended-upgrades
sudo dpkg-reconfigure -plow unattended-upgrades

Test SSH access with the new user BEFORE closing your root session. Open a new terminal, SSH as the deploy user on the new port. If it works, you're safe to close root. If not, you still have root access to fix it.

Deploying an Application

# On your server (as deploy user):

# Install your runtime (example: Node.js via nvm)
curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.39.0/install.sh | bash
source ~/.bashrc
nvm install --lts

# Clone your application
cd /opt
sudo mkdir myapp && sudo chown deploy:deploy myapp
git clone git@github.com:you/myapp.git /opt/myapp
cd /opt/myapp
npm install --production

# Set up environment variables
sudo cp .env.example /etc/myapp.env
sudo chmod 600 /etc/myapp.env
# Edit with your production values

# Create systemd service (as shown in Chapter 4)
# Set up Nginx reverse proxy (as shown in Chapter 4)
# Obtain SSL certificate with Certbot

Deployment Strategies for VPS

Option A: Git Pull (Simple)

# On server:
cd /opt/myapp
git pull origin main
npm install --production
sudo systemctl restart myapp

Option B: rsync (No Git on Server)

# From your local machine:
rsync -avz --delete \
  --exclude='node_modules' \
  --exclude='.env' \
  ./dist/ deploy@server:/opt/myapp/
ssh deploy@server 'cd /opt/myapp && npm install --production && sudo systemctl restart myapp'

Option C: Docker (Recommended for Production)

# Build locally or in CI, push to registry
docker build -t myregistry/myapp:v1.2.3 .
docker push myregistry/myapp:v1.2.3

# On server:
docker pull myregistry/myapp:v1.2.3
docker stop myapp && docker rm myapp
docker run -d --name myapp -p 3000:3000 --env-file /etc/myapp.env myregistry/myapp:v1.2.3

Docker is the professional standard. It ensures your app runs identically everywhere — your laptop, CI, staging, production. Chapter 8 covers this in depth.

Process Managers

Tool	Language	Features	When to Use
systemd	Any	Built into Linux, restart policies, logging	Always (it's already there)
PM2	Node.js	Cluster mode, zero-downtime reload, monitoring	Node.js apps without Docker
Supervisor	Any	Simple config, process groups	Legacy systems, multiple processes

Chapter 6 — Platform-as-a-Service (PaaS)

PaaS abstracts away the server entirely. You push code, the platform handles building, deploying, scaling, SSL, and infrastructure. You focus purely on your application.

What PaaS Manages For You

┌─────────────────────────────────────────────────────────┐ │ What YOU do: │ │ ┌───────────────────────────────────────────────────┐ │ │ │ Write code → git push → Done │ │ │ └───────────────────────────────────────────────────┘ │ ├─────────────────────────────────────────────────────────┤ │ What the PLATFORM does: │ │ • Detects language/framework (buildpacks) │ │ • Installs dependencies │ │ • Builds your app │ │ • Deploys to containers │ │ • Provisions SSL certificate │ │ • Routes traffic (load balancing) │ │ • Manages logs │ │ • Handles OS updates and security patches │ │ • Scales horizontally (if configured) │ └─────────────────────────────────────────────────────────┘

Provider Comparison

Platform	Free Tier	Paid From	Strengths	Limitations
Railway	$5 credit/mo	$5/mo	Modern, fast deploys, good DX	Newer, smaller community
Render	Static free, services spin down	$7/mo	Heroku alternative, auto-deploy	Cold starts on free tier
Fly.io	3 shared VMs free	Pay-per-use	Edge deployment, Docker-native	More complex than others
Heroku	None (removed)	$5/mo	Pioneer, huge ecosystem	Expensive at scale, aging
Google App Engine	Limited free	Pay-per-use	Google infrastructure, auto-scale	Vendor lock-in
Azure App Service	Limited free	~$13/mo	Enterprise, .NET native	Complex pricing

Example: Deploying to Railway

# Your project needs:
# 1. A start command (in package.json, Procfile, or Dockerfile)
# 2. Listen on the PORT environment variable

# package.json
{
  "scripts": {
    "start": "node server.js"
  }
}

# server.js — must use process.env.PORT
const port = process.env.PORT || 3000;
app.listen(port, '0.0.0.0');

# Deploy:
# Option A: Connect GitHub repo in Railway dashboard (auto-deploys on push)
# Option B: Railway CLI
npm install -g @railway/cli
railway login
railway init
railway up

Procfile (Heroku-style Process Declaration)

# Procfile — tells the platform what processes to run
web: node server.js
worker: node worker.js
release: node migrate.js    # Runs before each deploy

When PaaS is the Right Choice

Use PaaS when:
• Small team (1-5 devs) that wants to focus on product, not infrastructure
• Predictable, moderate traffic (not massive spikes)
• Standard web app (HTTP server + database)
• Fast iteration speed matters more than cost optimization
• You don't need custom system-level software

Avoid PaaS when:
• Cost-sensitive at scale (PaaS markup is 3-10x vs raw compute)
• Need custom networking, kernel modules, or system packages
• Compliance requires specific infrastructure control
• Traffic is highly variable (serverless may be cheaper)
• You need persistent local storage or specific hardware

The PaaS Cost Trap

PaaS is cheap to start but expensive to scale. A $7/mo Render service running a Node.js app is great. But when you need 4 instances + a managed database + Redis + background workers, you're suddenly paying $200/mo for what a $40/mo VPS could handle.

The professional pattern: Start on PaaS for speed. When monthly costs exceed what a VPS + your time would cost, migrate to containers on a VPS or IaaS. This is called "graduating" from PaaS.

Chapter 7 — Infrastructure-as-a-Service (IaaS)

IaaS gives you virtual machines in the cloud with pay-per-use pricing, elastic scaling, and a massive ecosystem of managed services around them. This is where most professional production workloads live.

Core Concepts

Instances (Virtual Machines)

A cloud VM with configurable CPU, RAM, storage, and networking. You choose the OS, install what you want, and pay by the hour/second.

AMIs / Images

Pre-configured OS snapshots. You can use official images (Ubuntu 22.04) or create custom ones with your software pre-installed (golden images).

Security Groups

Virtual firewalls controlling inbound/outbound traffic to your instances. Stateful — if you allow inbound on port 443, the response traffic is automatically allowed.

VPC (Virtual Private Cloud)

Your own isolated network in the cloud. You define subnets (public/private), route tables, and internet gateways. Instances in private subnets can't be reached from the internet directly.

┌─────────────────── VPC (10.0.0.0/16) ───────────────────┐ │ │ │ ┌──── Public Subnet (10.0.1.0/24) ────┐ │ │ │ │ │ │ │ ┌──────────┐ ┌──────────┐ │ │ │ │ │ Web │ │ NAT │ │ │ │ │ │ Server │ │ Gateway │ │ │ │ │ └──────────┘ └────┬─────┘ │ │ │ │ ▲ │ │ │ │ └───────┼────────────────┼─────────────┘ │ │ │ │ │ │ ┌───────┼────────────────┼──── Private Subnet ────┐ │ │ │ │ ▼ (10.0.2.0/24) │ │ │ │ ┌────┴─────┐ ┌──────────┐ │ │ │ │ │ App │ │ Database │ │ │ │ │ │ Server │ │ (RDS) │ │ │ │ │ └──────────┘ └──────────┘ │ │ │ └─────────────────────────────────────────────────┘ │ │ │ └──────────────────────────────────────────────────────────┘ ▲ │ Internet Gateway ▼ ┌──────────┐ │ Internet │ └──────────┘

Hands-On: Launching an EC2 Instance (AWS CLI)

# Prerequisites: AWS CLI installed, credentials configured
# aws configure (set access key, secret, region)

# 1. Create a key pair for SSH access
aws ec2 create-key-pair --key-name myapp-key --query 'KeyMaterial' \
  --output text > ~/.ssh/myapp-key.pem
chmod 400 ~/.ssh/myapp-key.pem

# 2. Create a security group
aws ec2 create-security-group \
  --group-name myapp-sg \
  --description "Web server security group"

# Allow SSH, HTTP, HTTPS
aws ec2 authorize-security-group-ingress --group-name myapp-sg \
  --protocol tcp --port 22 --cidr 0.0.0.0/0
aws ec2 authorize-security-group-ingress --group-name myapp-sg \
  --protocol tcp --port 80 --cidr 0.0.0.0/0
aws ec2 authorize-security-group-ingress --group-name myapp-sg \
  --protocol tcp --port 443 --cidr 0.0.0.0/0

# 3. Launch the instance
aws ec2 run-instances \
  --image-id ami-0c55b159cbfafe1f0 \
  --instance-type t3.micro \
  --key-name myapp-key \
  --security-groups myapp-sg \
  --count 1 \
  --tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=myapp-web}]'

# 4. Get the public IP
aws ec2 describe-instances \
  --filters "Name=tag:Name,Values=myapp-web" \
  --query 'Reservations[0].Instances[0].PublicIpAddress' --output text

# 5. SSH in
ssh -i ~/.ssh/myapp-key.pem ubuntu@INSTANCE_IP

Instance Types (AWS Example)

Family	Optimized For	Example	Use Case
t3/t4g	Burstable general purpose	t3.micro (2 vCPU, 1GB)	Low-traffic web apps, dev
m6i/m7g	Balanced compute/memory	m6i.large (2 vCPU, 8GB)	General web apps
c6i/c7g	Compute-intensive	c6i.xlarge (4 vCPU, 8GB)	API servers, batch processing
r6i/r7g	Memory-intensive	r6i.large (2 vCPU, 16GB)	Caching, in-memory DBs
Graviton (g suffix)	ARM-based, 20% cheaper	t4g.micro	Everything (if your app supports ARM)

Auto Scaling

Auto Scaling automatically adjusts the number of instances based on demand:

# Conceptual flow:
# 1. Create a Launch Template (defines instance config)
# 2. Create an Auto Scaling Group (min/max/desired instances)
# 3. Attach scaling policies (CPU > 70% → add instance)
# 4. Attach to a Load Balancer (distributes traffic)

# CloudWatch alarm triggers scaling:
# CPU > 70% for 5 minutes → scale out (add instances)
# CPU < 30% for 10 minutes → scale in (remove instances)

Start simple. Don't set up auto-scaling on day one. Start with a single instance, monitor its resource usage, and add auto-scaling when you actually need it. Premature scaling adds complexity without benefit.

Chapter 8 — Containers & Orchestration

Containers package your application with all its dependencies into a portable, reproducible unit. This solves "works on my machine" permanently.

Why Containers

Reproducibility: Same image runs identically everywhere
Isolation: Each container has its own filesystem, network, processes
Efficiency: Lighter than VMs (shared kernel, no guest OS)
Speed: Start in seconds, not minutes
Immutability: Never patch a running container — replace it

Dockerfile Best Practices

# Multi-stage build — keeps final image small
# Stage 1: Build
FROM node:20-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production
COPY . .
RUN npm run build

# Stage 2: Production image
FROM node:20-alpine
WORKDIR /app

# Don't run as root
RUN addgroup -S appgroup && adduser -S appuser -G appgroup

# Copy only what's needed
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/node_modules ./node_modules
COPY --from=builder /app/package.json ./

USER appuser
EXPOSE 3000
HEALTHCHECK --interval=30s --timeout=3s CMD wget -qO- http://localhost:3000/health || exit 1
CMD ["node", "dist/server.js"]

Key Dockerfile principles:
• Use specific base image tags (node:20-alpine, not node:latest)
• Multi-stage builds to minimize image size
• Copy package.json first (layer caching for dependencies)
• Run as non-root user
• Add HEALTHCHECK for orchestrators
• Use .dockerignore to exclude node_modules, .git, etc.

Docker Compose (Multi-Service Apps)

# docker-compose.yml — typical web app stack
services:
  app:
    build: .
    ports:
      - "3000:3000"
    environment:
      - DATABASE_URL=postgres://user:pass@db:5432/myapp
      - REDIS_URL=redis://cache:6379
    depends_on:
      db:
        condition: service_healthy
      cache:
        condition: service_started
    restart: unless-stopped

  db:
    image: postgres:16-alpine
    volumes:
      - pgdata:/var/lib/postgresql/data
    environment:
      POSTGRES_DB: myapp
      POSTGRES_USER: user
      POSTGRES_PASSWORD: pass
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U user -d myapp"]
      interval: 5s
      timeout: 3s
      retries: 5

  cache:
    image: redis:7-alpine
    restart: unless-stopped

  nginx:
    image: nginx:alpine
    ports:
      - "80:80"
      - "443:443"
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf:ro
      - ./certs:/etc/nginx/certs:ro
    depends_on:
      - app

volumes:
  pgdata:

# Commands
docker compose up -d          # Start all services
docker compose logs -f app    # Follow app logs
docker compose ps             # Status of all services
docker compose down           # Stop and remove
docker compose up -d --build  # Rebuild and restart

Container Registries

Registry	Free Tier	Best For
Docker Hub	1 private repo	Public images, open source
GitHub Container Registry (ghcr.io)	Generous free	GitHub-based projects
AWS ECR	500MB free	AWS deployments
Google Artifact Registry	500MB free	GCP deployments

Kubernetes — When You Need Orchestration

Kubernetes (K8s) manages containers at scale: scheduling, scaling, self-healing, service discovery, rolling updates.

You probably don't need Kubernetes. Docker Compose on a single VPS handles most workloads. K8s is for: multiple services, multiple teams, auto-scaling requirements, or when you need zero-downtime deployments with automated rollbacks. The operational overhead is significant.

Core Kubernetes Concepts

# Pod — smallest deployable unit (one or more containers)
# Deployment — manages replica sets, rolling updates
# Service — stable network endpoint for pods
# Ingress — HTTP routing from outside the cluster

# Example: Deployment + Service + Ingress
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp
spec:
  replicas: 3
  selector:
    matchLabels:
      app: myapp
  template:
    metadata:
      labels:
        app: myapp
    spec:
      containers:
      - name: myapp
        image: ghcr.io/you/myapp:v1.2.3
        ports:
        - containerPort: 3000
        resources:
          requests:
            memory: "128Mi"
            cpu: "100m"
          limits:
            memory: "256Mi"
            cpu: "500m"
        livenessProbe:
          httpGet:
            path: /health
            port: 3000
          initialDelaySeconds: 10
          periodSeconds: 30
---
apiVersion: v1
kind: Service
metadata:
  name: myapp-svc
spec:
  selector:
    app: myapp
  ports:
  - port: 80
    targetPort: 3000
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: myapp-ingress
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
  tls:
  - hosts:
    - myapp.example.com
    secretName: myapp-tls
  rules:
  - host: myapp.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: myapp-svc
            port:
              number: 80

Managed Kubernetes Services

Service	Provider	Starting Cost	Notes
EKS	AWS	$0.10/hr control plane + nodes	Most popular, complex
GKE	Google	1 free zonal cluster + nodes	Best K8s experience (Google made K8s)
AKS	Azure	Free control plane + nodes	Good for Microsoft shops
DOKS	DigitalOcean	Free control plane + $12/node	Simplest managed K8s

Chapter 9 — Serverless & Edge

Serverless means you write functions, not servers. The cloud provider handles all infrastructure — you pay only when your code runs.

How Serverless Works

Traditional Server: Serverless: ┌─────────────────────┐ ┌─────────────────────┐ │ Server running 24/7 │ │ No server running │ │ (paying even idle) │ │ (paying $0 idle) │ │ │ │ │ │ Request → Process │ │ Request → Cold Start│ │ Request → Process │ │ → Process │ │ ...idle... │ │ Request → Process │ │ ...idle... │ │ (warm) │ │ ...idle... │ │ ...nothing... │ │ Request → Process │ │ Request → Cold Start│ └─────────────────────┘ └─────────────────────┘ Cost: $$$$ (always on) Cost: $ (per invocation)

AWS Lambda Example

// handler.js — AWS Lambda function
exports.handler = async (event) => {
    const body = JSON.parse(event.body || '{}');

    // Your business logic here
    const result = await processRequest(body);

    return {
        statusCode: 200,
        headers: { 'Content-Type': 'application/json' },
        body: JSON.stringify(result)
    };
};

// Deploy with AWS SAM or Serverless Framework:
# serverless.yml
service: myapi
provider:
  name: aws
  runtime: nodejs20.x
  region: eu-west-1
functions:
  api:
    handler: handler.handler
    events:
      - httpApi:
          path: /api/{proxy+}
          method: ANY
    memorySize: 256
    timeout: 10

Cloudflare Workers (Edge Computing)

// worker.js — runs at 300+ edge locations worldwide
export default {
  async fetch(request, env) {
    const url = new URL(request.url);

    if (url.pathname === '/api/hello') {
      return new Response(JSON.stringify({ message: 'Hello from the edge!' }), {
        headers: { 'Content-Type': 'application/json' }
      });
    }

    // Proxy to origin for other routes
    return fetch(request);
  }
};

Serverless Comparison

Platform	Cold Start	Max Duration	Free Tier	Best For
AWS Lambda	100-500ms	15 min	1M requests/mo	Full backend APIs, event processing
Cloudflare Workers	~0ms (no cold start)	30s (free), 15min (paid)	100K requests/day	Edge logic, fast APIs
Vercel Functions	~250ms	10-60s	100GB-hrs/mo	Next.js apps, frontend teams
Netlify Functions	~200ms	10-26s	125K requests/mo	JAMstack backends
Google Cloud Functions	100-400ms	9-60 min	2M invocations/mo	GCP ecosystem, event-driven

When Serverless Fits

Use serverless when:
• Traffic is sporadic/unpredictable (pay-per-use saves money)
• Individual requests are short-lived (<30s)
• You want zero infrastructure management
• Event-driven workloads (file uploads, webhooks, scheduled tasks)
• API endpoints with variable traffic

Avoid serverless when:
• Consistent high traffic (a server is cheaper)
• Long-running processes (video encoding, ML training)
• WebSocket connections needed
• You need local filesystem or persistent state
• Cold starts are unacceptable (real-time systems)

Edge Computing

Edge computing runs your code at CDN points-of-presence (PoPs) close to users, reducing latency from ~100ms to ~10ms.

Cloudflare Workers: 300+ locations, V8 isolates (not containers), near-zero cold start
AWS CloudFront Functions: Lightweight, runs at CloudFront edge locations
Deno Deploy: 35+ regions, TypeScript-native
Vercel Edge Functions: Built on Cloudflare Workers

The hybrid pattern: Use edge functions for authentication, A/B testing, geolocation routing, and caching logic. Keep heavy business logic in a traditional server or Lambda. This gives you the best of both worlds — fast edge responses with powerful backend processing.

Chapter 10 — Cloud Providers Deep Dive

The "Big Three" (AWS, GCP, Azure) plus strong alternatives. Understanding their service ecosystems lets you pick the right provider and avoid vendor lock-in traps.

Service Mapping Across Providers

Category	AWS	GCP	Azure	DigitalOcean
Compute (VMs)	EC2	Compute Engine	Virtual Machines	Droplets
Containers	ECS / EKS	Cloud Run / GKE	ACI / AKS	DOKS / App Platform
Serverless	Lambda	Cloud Functions	Azure Functions	Functions (beta)
Object Storage	S3	Cloud Storage	Blob Storage	Spaces
SQL Database	RDS / Aurora	Cloud SQL / Spanner	Azure SQL	Managed Databases
NoSQL	DynamoDB	Firestore / Bigtable	Cosmos DB	MongoDB (managed)
CDN	CloudFront	Cloud CDN	Azure CDN	Spaces CDN
DNS	Route 53	Cloud DNS	Azure DNS	DNS (basic)
Load Balancer	ALB / NLB	Cloud Load Balancing	Azure LB	Load Balancers
Secrets	Secrets Manager	Secret Manager	Key Vault	—
Monitoring	CloudWatch	Cloud Monitoring	Azure Monitor	Built-in metrics
IaC	CloudFormation	Deployment Manager	ARM / Bicep	Terraform only

Pricing Models

Model	Description	Savings	Commitment
On-Demand	Pay by the hour/second, no commitment	0% (baseline)	None
Reserved / Committed	1-3 year commitment for lower rate	30-72%	1-3 years
Spot / Preemptible	Unused capacity, can be terminated anytime	60-90%	None (but unreliable)
Savings Plans	Commit to $/hr spend, flexible instance types	20-50%	1-3 years

Hands-On: Full Stack on AWS

Deploying a web app with EC2 + RDS + S3 + CloudFront:

# Architecture:
# CloudFront (CDN) → ALB → EC2 (app) → RDS (PostgreSQL)
#                         → S3 (static assets/uploads)

# Step 1: Create VPC with public/private subnets
aws ec2 create-vpc --cidr-block 10.0.0.0/16
# (In practice, use Terraform — shown in Chapter 14)

# Step 2: Launch RDS in private subnet
aws rds create-db-instance \
  --db-instance-identifier myapp-db \
  --db-instance-class db.t3.micro \
  --engine postgres \
  --engine-version 16 \
  --master-username admin \
  --master-user-password "$(openssl rand -base64 24)" \
  --allocated-storage 20 \
  --no-publicly-accessible \
  --vpc-security-group-ids sg-xxxxx

# Step 3: Create S3 bucket for assets
aws s3 mb s3://myapp-assets-prod
aws s3api put-bucket-policy --bucket myapp-assets-prod \
  --policy '{"Statement":[{"Effect":"Allow","Principal":"*","Action":"s3:GetObject","Resource":"arn:aws:s3:::myapp-assets-prod/*"}]}'

# Step 4: Create CloudFront distribution
aws cloudfront create-distribution \
  --origin-domain-name myapp-assets-prod.s3.amazonaws.com \
  --default-root-object index.html

# Step 5: Set up ALB + EC2 (use launch template + auto-scaling group)
# Step 6: Configure Route 53 to point domain to CloudFront

Hands-On: Full Stack on DigitalOcean

# Architecture:
# Load Balancer → Droplet(s) → Managed PostgreSQL
#                            → Spaces (S3-compatible storage)

# Step 1: Create a Droplet
doctl compute droplet create myapp-web \
  --image ubuntu-22-04-x64 \
  --size s-1vcpu-2gb \
  --region fra1 \
  --ssh-keys YOUR_KEY_FINGERPRINT

# Step 2: Create managed database
doctl databases create myapp-db \
  --engine pg \
  --version 16 \
  --size db-s-1vcpu-1gb \
  --region fra1 \
  --num-nodes 1

# Step 3: Create Spaces bucket (S3-compatible)
# Done via web console or s3cmd with DO endpoint

# Step 4: Create load balancer
doctl compute load-balancer create \
  --name myapp-lb \
  --region fra1 \
  --forwarding-rules "entry_protocol:https,entry_port:443,target_protocol:http,target_port:3000,certificate_id:YOUR_CERT" \
  --droplet-ids DROPLET_ID

# Step 5: Point domain DNS to load balancer IP

Which Provider to Choose

AWS: Largest ecosystem, most services, best for enterprise. Steep learning curve, complex pricing.
GCP: Best for data/ML, Kubernetes (they invented it), clean APIs. Smaller market share.
Azure: Best for Microsoft/.NET shops, enterprise AD integration. Complex portal.
DigitalOcean: Simplest UX, predictable pricing, great docs. Fewer services, smaller scale.
Hetzner: Best price/performance in EU. Minimal managed services but unbeatable value.

Rule of thumb: Start with DigitalOcean or Hetzner for simplicity. Move to AWS/GCP when you need managed services (ML, analytics, complex networking) that simpler providers don't offer.

Cost Calculator Links

AWS Pricing Calculator
GCP Pricing Calculator
Azure Pricing Calculator
DigitalOcean Pricing (simple flat rates)

Chapter 11 — Domain Names & DNS

DNS (Domain Name System) translates human-readable names to IP addresses. It's the phone book of the internet, and misconfiguring it is one of the most common causes of "my site is down."

How DNS Resolution Works

User types: www.example.com │ ▼ ┌─────────────────────┐ │ Browser DNS Cache │ ← Checked first (TTL-based) └──────────┬──────────┘ │ Cache miss ▼ ┌─────────────────────┐ │ OS DNS Cache │ ← /etc/hosts, systemd-resolved └──────────┬──────────┘ │ Cache miss ▼ ┌─────────────────────┐ │ Recursive Resolver │ ← Your ISP's DNS or 1.1.1.1 / 8.8.8.8 │ (e.g., Cloudflare) │ └──────────┬──────────┘ │ Queries hierarchy: ▼ ┌─────────────────────┐ │ Root Name Servers │ ← "Who handles .com?" │ (13 clusters, a-m) │ → "Ask the .com TLD servers" └──────────┬──────────┘ ▼ ┌─────────────────────┐ │ TLD Name Servers │ ← "Who handles example.com?" │ (.com, .org, .io) │ → "Ask ns1.cloudflare.com" └──────────┬──────────┘ ▼ ┌─────────────────────┐ │ Authoritative NS │ ← "What's the A record for www.example.com?" │ (your DNS provider) │ → "93.184.216.34, TTL 300" └──────────┬──────────┘ │ ▼ Result cached at each level for TTL duration Browser connects to 93.184.216.34

DNS Record Types

Type	Purpose	Example
A	Maps name to IPv4 address	example.com → 93.184.216.34
AAAA	Maps name to IPv6 address	example.com → 2606:2800:220:1:...
CNAME	Alias to another name	www.example.com → example.com
MX	Mail server for the domain	example.com → mail.example.com (priority 10)
TXT	Arbitrary text (verification, SPF, DKIM)	example.com → "v=spf1 include:_spf.google.com ~all"
NS	Nameservers for the domain	example.com → ns1.cloudflare.com
SRV	Service location (port + priority)	_sip._tcp.example.com → sipserver.example.com:5060
CAA	Which CAs can issue certificates	example.com → "0 issue letsencrypt.org"

Practical DNS Configuration

# Typical DNS setup for a web app:

# Root domain → your server
example.com.        A       93.184.216.34
example.com.        AAAA    2606:2800:220:1::248

# www subdomain → alias to root
www.example.com.    CNAME   example.com.

# API subdomain → different server or load balancer
api.example.com.    A       10.20.30.40

# Email (Google Workspace example)
example.com.        MX      1  aspmx.l.google.com.
example.com.        MX      5  alt1.aspmx.l.google.com.
example.com.        TXT     "v=spf1 include:_spf.google.com ~all"

# DKIM (email authentication)
google._domainkey.example.com.  TXT  "v=DKIM1; k=rsa; p=MIGfMA0..."

# DMARC (email policy)
_dmarc.example.com. TXT     "v=DMARC1; p=quarantine; rua=mailto:dmarc@example.com"

# Let's Encrypt verification
_acme-challenge.example.com.  TXT  "random-verification-string"

TTL (Time To Live)

TTL tells resolvers how long to cache a record (in seconds):

300 (5 min): Good for records that might change (during migrations)
3600 (1 hr): Standard for most records
86400 (24 hr): Stable records (MX, NS)

Before a migration: Lower TTL to 300 seconds 24-48 hours before changing the record. This ensures the old high TTL expires, and when you make the change, it propagates within 5 minutes instead of hours.

DNS Providers

Provider	Free Tier	Best Feature	Best For
Cloudflare	Unlimited zones	CDN + DDoS + DNS in one	Most websites (recommended default)
AWS Route 53	None ($0.50/zone)	Latency/geo routing, health checks	AWS-heavy infrastructure
Google Cloud DNS	None	Low latency, DNSSEC	GCP infrastructure
NS1	Limited free	Advanced traffic management	Complex routing needs

CDN Integration

A CDN (Content Delivery Network) caches your content at edge locations worldwide. DNS is how you route users to the nearest edge:

# Without CDN: Users hit your origin server directly
example.com.  A  YOUR_SERVER_IP

# With Cloudflare (proxy mode): Users hit Cloudflare edge, which proxies to origin
# Just enable the orange cloud icon in Cloudflare dashboard
# DNS resolves to Cloudflare's anycast IPs, not your server

# With AWS CloudFront: Point domain to CloudFront distribution
example.com.  ALIAS  d1234567890.cloudfront.net.
# (ALIAS is AWS-specific; equivalent to CNAME at zone apex)

Chapter 12 — Reverse Proxies & Load Balancing

A reverse proxy sits between the internet and your application servers. It handles SSL termination, load balancing, caching, rate limiting, and request routing — so your app doesn't have to.

Why Use a Reverse Proxy

SSL termination: Handle HTTPS at the proxy, app speaks plain HTTP internally
Load balancing: Distribute requests across multiple app instances
Static file serving: Serve assets directly without hitting your app
Caching: Cache responses to reduce backend load
Security: Hide backend topology, add rate limiting, filter bad requests
Compression: Gzip/Brotli responses automatically

Nginx — Full Production Configuration

# /etc/nginx/sites-available/myapp.conf
upstream app_backend {
    # Load balancing across multiple app instances
    server 127.0.0.1:3000 weight=3;
    server 127.0.0.1:3001 weight=2;
    server 127.0.0.1:3002 backup;    # Only used if others are down

    # Health checks (Nginx Plus) or use passive checks
    keepalive 32;    # Persistent connections to backend
}

# Redirect HTTP to HTTPS
server {
    listen 80;
    server_name example.com www.example.com;
    return 301 https://$server_name$request_uri;
}

# Main HTTPS server
server {
    listen 443 ssl http2;
    server_name example.com www.example.com;

    # SSL Configuration
    ssl_certificate /etc/letsencrypt/live/example.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/example.com/privkey.pem;
    ssl_protocols TLSv1.2 TLSv1.3;
    ssl_ciphers ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256;
    ssl_prefer_server_ciphers off;
    ssl_session_cache shared:SSL:10m;
    ssl_session_timeout 1d;

    # Security headers
    add_header Strict-Transport-Security "max-age=63072000; includeSubDomains" always;
    add_header X-Frame-Options DENY always;
    add_header X-Content-Type-Options nosniff always;
    add_header X-XSS-Protection "1; mode=block" always;

    # Gzip compression
    gzip on;
    gzip_types text/plain text/css application/json application/javascript text/xml;
    gzip_min_length 1000;

    # Static files — served directly by Nginx (fast)
    location /static/ {
        alias /var/www/myapp/static/;
        expires 30d;
        add_header Cache-Control "public, immutable";
    }

    # API — proxy to backend
    location /api/ {
        proxy_pass http://app_backend;
        proxy_http_version 1.1;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
        proxy_set_header Connection "";

        # Timeouts
        proxy_connect_timeout 5s;
        proxy_send_timeout 60s;
        proxy_read_timeout 60s;

        # Rate limiting
        limit_req zone=api burst=20 nodelay;
    }

    # WebSocket support
    location /ws/ {
        proxy_pass http://app_backend;
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
        proxy_set_header Host $host;
        proxy_read_timeout 86400;    # Keep WebSocket alive
    }
}

# Rate limiting zone (defined in nginx.conf http block)
# limit_req_zone $binary_remote_addr zone=api:10m rate=10r/s;

Caddy — The Modern Alternative (Auto-HTTPS)

# Caddyfile — entire config for production site with auto-SSL
example.com {
    # Automatic HTTPS (Let's Encrypt) — zero configuration needed!

    # Reverse proxy to app
    reverse_proxy /api/* localhost:3000

    # Static files
    root * /var/www/myapp/static
    file_server

    # Compression
    encode gzip zstd

    # Security headers
    header {
        Strict-Transport-Security "max-age=63072000; includeSubDomains"
        X-Frame-Options DENY
        X-Content-Type-Options nosniff
    }

    # Rate limiting (with caddy-ratelimit plugin)
    rate_limit {remote.ip} 10r/s

    # Logging
    log {
        output file /var/log/caddy/access.log
        format json
    }
}

Caddy vs Nginx: Caddy automatically obtains and renews SSL certificates with zero configuration. For new projects, Caddy is often the better choice — less config, automatic HTTPS, modern defaults. Nginx is better when you need maximum performance tuning or have complex routing requirements.

Load Balancing Algorithms

Algorithm	How It Works	Best For
Round Robin	Requests distributed sequentially	Identical servers, stateless apps
Weighted Round Robin	More requests to higher-weight servers	Mixed server capacities
Least Connections	Send to server with fewest active connections	Variable request durations
IP Hash	Same client IP always goes to same server	Session affinity (sticky sessions)
Random	Random server selection	Large clusters, simple

Cloud Load Balancers

Type	AWS	Layer	Use Case
Application LB	ALB	Layer 7 (HTTP)	Web apps, path-based routing, WebSocket
Network LB	NLB	Layer 4 (TCP/UDP)	High performance, static IP, non-HTTP
Gateway LB	GWLB	Layer 3	Network appliances (firewalls, IDS)

Chapter 13 — CI/CD Pipelines

CI/CD (Continuous Integration / Continuous Deployment) automates the path from code commit to production. No more manual deployments, no more "I forgot to run the tests."

CI vs CD

Term	What It Does	Triggered By
Continuous Integration (CI)	Automatically build and test every commit/PR	Every push or pull request
Continuous Delivery (CD)	Automatically prepare releases (deploy to staging)	Merge to main branch
Continuous Deployment (CD)	Automatically deploy to production	After all checks pass

Developer pushes code │ ▼ ┌─────────────────────────── CI Pipeline ───────────────────────────┐ │ │ │ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐│ │ │ Lint │──▶│ Test │──▶│ Build │──▶│ Scan │──▶│Artifact││ │ │ │ │(unit+ │ │(compile│ │(security│ │(Docker ││ │ │ │ │integr.)│ │bundle) │ │ vulns) │ │ image) ││ │ └────────┘ └────────┘ └────────┘ └────────┘ └────────┘│ │ │ └────────────────────────────────┬───────────────────────────────────┘ │ All green ✓ ▼ ┌─────────────────────────── CD Pipeline ───────────────────────────┐ │ │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────────┐ │ │ │ Deploy │──▶│ Smoke │──▶│ Deploy │──▶│ Monitor │ │ │ │ Staging │ │ Tests │ │Production│ │ + Auto-rollback│ │ │ └──────────┘ └──────────┘ └──────────┘ └──────────────┘ │ │ │ └────────────────────────────────────────────────────────────────────┘

GitHub Actions — Complete Workflow

# .github/workflows/deploy.yml
name: CI/CD Pipeline

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

env:
  REGISTRY: ghcr.io
  IMAGE_NAME: ${{ github.repository }}

jobs:
  # ─── CI: Test & Build ───────────────────────────────────
  test:
    runs-on: ubuntu-latest
    services:
      postgres:
        image: postgres:16
        env:
          POSTGRES_DB: test
          POSTGRES_PASSWORD: test
        ports: ['5432:5432']
        options: >-
          --health-cmd pg_isready
          --health-interval 10s
          --health-timeout 5s
          --health-retries 5
    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-node@v4
        with:
          node-version: 20
          cache: 'npm'

      - run: npm ci
      - run: npm run lint
      - run: npm run test
        env:
          DATABASE_URL: postgres://postgres:test@localhost:5432/test

  # ─── Build & Push Docker Image ─────────────────────────
  build:
    needs: test
    runs-on: ubuntu-latest
    if: github.event_name == 'push' && github.ref == 'refs/heads/main'
    permissions:
      contents: read
      packages: write
    outputs:
      image-tag: ${{ steps.meta.outputs.tags }}
    steps:
      - uses: actions/checkout@v4

      - uses: docker/login-action@v3
        with:
          registry: ${{ env.REGISTRY }}
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}

      - id: meta
        uses: docker/metadata-action@v5
        with:
          images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
          tags: |
            type=sha,prefix=
            type=raw,value=latest

      - uses: docker/build-push-action@v5
        with:
          context: .
          push: true
          tags: ${{ steps.meta.outputs.tags }}
          cache-from: type=gha
          cache-to: type=gha,mode=max

  # ─── Deploy to Production ──────────────────────────────
  deploy:
    needs: build
    runs-on: ubuntu-latest
    if: github.event_name == 'push' && github.ref == 'refs/heads/main'
    environment: production    # Requires approval if configured
    steps:
      - name: Deploy to server via SSH
        uses: appleboy/ssh-action@v1
        with:
          host: ${{ secrets.SERVER_HOST }}
          username: deploy
          key: ${{ secrets.SSH_PRIVATE_KEY }}
          script: |
            docker pull ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:latest
            docker stop myapp || true
            docker rm myapp || true
            docker run -d \
              --name myapp \
              --restart unless-stopped \
              -p 3000:3000 \
              --env-file /etc/myapp.env \
              ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:latest
            # Health check
            sleep 5
            curl -f http://localhost:3000/health || (docker logs myapp && exit 1)

Deployment Strategies

Strategy	How It Works	Downtime	Risk	Rollback Speed
Recreate	Stop old, start new	Yes (seconds-minutes)	Low complexity	Redeploy old version
Rolling	Replace instances one by one	No	Medium (mixed versions briefly)	Continue rolling with old
Blue-Green	Run two identical environments, switch traffic	No	Low (instant switch)	Switch back instantly
Canary	Route small % of traffic to new version	No	Lowest (limited blast radius)	Route 100% back to old

Blue-Green Deployment: Before: [Load Balancer] ──100%──▶ [Blue v1.0] (active) [Green v1.1] (idle, being tested) Switch: [Load Balancer] ──100%──▶ [Green v1.1] (now active) [Blue v1.0] (idle, rollback ready) Canary Deployment: Step 1: [Load Balancer] ──95%───▶ [v1.0] (stable) ──5%────▶ [v1.1] (canary) Step 2: [Load Balancer] ──50%───▶ [v1.0] ──50%───▶ [v1.1] (looking good) Step 3: [Load Balancer] ──100%──▶ [v1.1] (fully rolled out)

GitLab CI Example

# .gitlab-ci.yml
stages:
  - test
  - build
  - deploy

test:
  stage: test
  image: node:20
  services:
    - postgres:16
  variables:
    DATABASE_URL: postgres://postgres:test@postgres:5432/test
  script:
    - npm ci
    - npm run lint
    - npm run test

build:
  stage: build
  image: docker:24
  services:
    - docker:24-dind
  script:
    - docker build -t $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA .
    - docker push $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA
  only:
    - main

deploy:
  stage: deploy
  script:
    - ssh deploy@$SERVER "docker pull $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA && docker-compose up -d"
  only:
    - main
  environment:
    name: production

Chapter 14 — Infrastructure as Code

Infrastructure as Code (IaC) means defining your servers, networks, databases, and all cloud resources in version-controlled configuration files — not clicking through web consoles.

Why IaC

Reproducibility: Spin up identical environments (dev/staging/prod) from the same code
Version control: Track every infrastructure change in Git (who changed what, when, why)
Review process: Infrastructure changes go through pull requests like code
Disaster recovery: Rebuild entire infrastructure from scratch in minutes
Documentation: The code IS the documentation of your infrastructure

Terraform — The Industry Standard

# main.tf — Deploy a web app on AWS

terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
  # Store state remotely (never in git!)
  backend "s3" {
    bucket = "myapp-terraform-state"
    key    = "prod/terraform.tfstate"
    region = "eu-west-1"
  }
}

provider "aws" {
  region = var.region
}

# ─── VPC ──────────────────────────────────────────────────
module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "5.0"

  name = "${var.app_name}-vpc"
  cidr = "10.0.0.0/16"

  azs             = ["${var.region}a", "${var.region}b"]
  public_subnets  = ["10.0.1.0/24", "10.0.2.0/24"]
  private_subnets = ["10.0.10.0/24", "10.0.11.0/24"]

  enable_nat_gateway = true
  single_nat_gateway = true    # Cost saving for non-prod
}

# ─── Database ─────────────────────────────────────────────
resource "aws_db_instance" "main" {
  identifier     = "${var.app_name}-db"
  engine         = "postgres"
  engine_version = "16"
  instance_class = "db.t3.micro"

  allocated_storage = 20
  storage_encrypted = true

  db_name  = var.app_name
  username = "admin"
  password = var.db_password    # From secrets, never hardcoded

  vpc_security_group_ids = [aws_security_group.db.id]
  db_subnet_group_name   = aws_db_subnet_group.main.name

  skip_final_snapshot = false
  final_snapshot_identifier = "${var.app_name}-final-snapshot"

  backup_retention_period = 7
  multi_az               = var.environment == "prod" ? true : false
}

# ─── EC2 Instance ─────────────────────────────────────────
resource "aws_instance" "web" {
  ami           = data.aws_ami.ubuntu.id
  instance_type = var.instance_type

  subnet_id              = module.vpc.public_subnets[0]
  vpc_security_group_ids = [aws_security_group.web.id]
  key_name               = aws_key_pair.deploy.key_name

  user_data = templatefile("${path.module}/userdata.sh", {
    app_name = var.app_name
    db_host  = aws_db_instance.main.endpoint
  })

  tags = {
    Name        = "${var.app_name}-web"
    Environment = var.environment
  }
}

# ─── Variables ────────────────────────────────────────────
# variables.tf
variable "app_name" { default = "myapp" }
variable "region" { default = "eu-west-1" }
variable "environment" { default = "prod" }
variable "instance_type" { default = "t3.small" }
variable "db_password" { sensitive = true }

# Terraform workflow:
terraform init          # Download providers, initialize backend
terraform plan          # Preview changes (ALWAYS review this)
terraform apply         # Apply changes (creates/modifies resources)
terraform destroy       # Tear down everything (careful!)

Ansible — Configuration Management

Terraform creates infrastructure. Ansible configures it (installs software, deploys apps, manages configs).

# playbook.yml — Configure a web server
---
- name: Configure web server
  hosts: webservers
  become: yes
  vars:
    app_name: myapp
    app_port: 3000
    node_version: "20"

  tasks:
    - name: Update apt cache
      apt:
        update_cache: yes
        cache_valid_time: 3600

    - name: Install required packages
      apt:
        name: [nginx, certbot, python3-certbot-nginx, ufw]
        state: present

    - name: Configure UFW firewall
      ufw:
        rule: allow
        port: "{{ item }}"
        proto: tcp
      loop: ['22', '80', '443']

    - name: Enable UFW
      ufw:
        state: enabled
        policy: deny

    - name: Install Node.js
      shell: |
        curl -fsSL https://deb.nodesource.com/setup_{{ node_version }}.x | bash -
        apt-get install -y nodejs
      args:
        creates: /usr/bin/node

    - name: Deploy application
      git:
        repo: "https://github.com/you/{{ app_name }}.git"
        dest: "/opt/{{ app_name }}"
        version: main
      notify: restart app

    - name: Install dependencies
      npm:
        path: "/opt/{{ app_name }}"
        production: yes

    - name: Create systemd service
      template:
        src: templates/app.service.j2
        dest: "/etc/systemd/system/{{ app_name }}.service"
      notify: restart app

    - name: Configure Nginx
      template:
        src: templates/nginx.conf.j2
        dest: "/etc/nginx/sites-available/{{ app_name }}"
      notify: reload nginx

  handlers:
    - name: restart app
      systemd:
        name: "{{ app_name }}"
        state: restarted
        daemon_reload: yes

    - name: reload nginx
      systemd:
        name: nginx
        state: reloaded

# Run Ansible:
ansible-playbook -i inventory.yml playbook.yml

# inventory.yml
webservers:
  hosts:
    web1:
      ansible_host: 93.184.216.34
      ansible_user: deploy

GitOps Principles

Declarative: Describe desired state, not steps to get there
Versioned: All infrastructure definitions in Git
Automated: Changes applied automatically when Git changes
Observable: Drift detection — alert when actual state ≠ desired state

Terraform vs Ansible vs CloudFormation:
• Terraform: Multi-cloud, creates infrastructure (VMs, networks, DBs). Industry standard.
• Ansible: Configures existing servers (install software, deploy apps). Agentless, SSH-based.
• CloudFormation: AWS-only IaC. Use if you're 100% AWS and want native integration.
• Pulumi: IaC in real programming languages (TypeScript, Python, Go). Good for developers who dislike HCL.

Common combo: Terraform to create infrastructure + Ansible to configure it. Or Terraform + Docker (no config management needed — the container IS the config).

Chapter 15 — Monitoring, Logging & Observability

If you can't see what's happening in production, you can't fix it. Observability is the ability to understand your system's internal state from its external outputs.

The Three Pillars

Pillar	What	Tools	Answers
Metrics	Numeric measurements over time	Prometheus, CloudWatch, Datadog	"How much?" "How fast?" "How often?"
Logs	Discrete events with context	Loki, ELK, CloudWatch Logs	"What happened?" "Why did it fail?"
Traces	Request flow across services	Jaeger, Zipkin, AWS X-Ray	"Where is the bottleneck?" "Which service is slow?"

Prometheus + Grafana (The Open-Source Standard)

# docker-compose.yml — Monitoring stack
services:
  prometheus:
    image: prom/prometheus:latest
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
    ports:
      - "9090:9090"
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.retention.time=30d'

  grafana:
    image: grafana/grafana:latest
    volumes:
      - grafana_data:/var/lib/grafana
    ports:
      - "3001:3000"
    environment:
      GF_SECURITY_ADMIN_PASSWORD: changeme

  node-exporter:
    image: prom/node-exporter:latest
    ports:
      - "9100:9100"
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro

volumes:
  prometheus_data:
  grafana_data:

# prometheus.yml — Scrape configuration
global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "alerts.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

scrape_configs:
  - job_name: 'node'
    static_configs:
      - targets: ['node-exporter:9100']

  - job_name: 'myapp'
    static_configs:
      - targets: ['myapp:3000']
    metrics_path: '/metrics'

Application Metrics (What to Measure)

The RED method for services and USE method for resources:

Method	Metric	What It Tells You
RED (Services)	Rate	Requests per second
	Errors	Failed requests per second
	Duration	Response time (p50, p95, p99)
USE (Resources)	Utilization	% of resource capacity used
	Saturation	Queue depth, waiting work
	Errors	Error count on the resource

# Example: Exposing metrics in Node.js (using prom-client)
const client = require('prom-client');

// Default metrics (CPU, memory, event loop)
client.collectDefaultMetrics();

// Custom metrics
const httpRequestDuration = new client.Histogram({
  name: 'http_request_duration_seconds',
  help: 'Duration of HTTP requests in seconds',
  labelNames: ['method', 'route', 'status_code'],
  buckets: [0.01, 0.05, 0.1, 0.5, 1, 5]
});

// Middleware to track requests
app.use((req, res, next) => {
  const end = httpRequestDuration.startTimer();
  res.on('finish', () => {
    end({ method: req.method, route: req.route?.path || req.path, status_code: res.statusCode });
  });
  next();
});

// Metrics endpoint for Prometheus to scrape
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', client.register.contentType);
  res.end(await client.register.metrics());
});

Alerting Rules

# alerts.yml — Prometheus alerting rules
groups:
  - name: webapp
    rules:
      - alert: HighErrorRate
        expr: rate(http_requests_total{status_code=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate (> 5%)"
          description: "{{ $labels.instance }} has {{ $value | humanizePercentage }} error rate"

      - alert: HighLatency
        expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High p95 latency (> 2s)"

      - alert: InstanceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Instance {{ $labels.instance }} is down"

SLIs, SLOs, and SLAs

Term	Definition	Example
SLI (Indicator)	A measurable metric of service quality	99.2% of requests complete in <500ms
SLO (Objective)	Target value for an SLI (internal goal)	"99.9% availability over 30 days"
SLA (Agreement)	Contract with customers (with consequences)	"99.9% uptime or we credit your bill"

Error budgets: If your SLO is 99.9% uptime (43 minutes downtime/month), you have a 0.1% "error budget." Use it for deployments, experiments, and maintenance. When the budget is exhausted, freeze deployments and focus on reliability.

Chapter 16 — Security & Hardening

Security is not a feature you add later — it's a practice woven into every layer. This chapter covers the essential security measures for any production website.

SSL/TLS — Encrypting Traffic

How TLS Works (Simplified)

Client Server │ │ │──── ClientHello (supported ciphers) ───▶│ │◀─── ServerHello (chosen cipher) ────────│ │◀─── Certificate (public key) ───────────│ │ │ │ Client verifies certificate chain: │ │ Server cert → Intermediate CA → Root CA│ │ │ │──── Key Exchange (encrypted) ──────────▶│ │ │ │◀═══ Encrypted communication begins ═══▶│ │ (symmetric encryption, fast) │

Certificate Options

Provider	Cost	Validation	Best For
Let's Encrypt	Free	Domain Validation (DV)	Everything (90-day auto-renewal)
Cloudflare	Free (with proxy)	DV	Sites behind Cloudflare
AWS ACM	Free (with AWS services)	DV	AWS ALB/CloudFront
Commercial CAs	$10-1000/yr	OV/EV	Enterprise, legal requirements

HSTS (HTTP Strict Transport Security)

# Force browsers to always use HTTPS (add to Nginx/response headers)
Strict-Transport-Security: max-age=63072000; includeSubDomains; preload

# Once set, browsers will NEVER make HTTP requests to your domain
# Submit to HSTS preload list: https://hstspreload.org/

Firewall Configuration

# UFW (Uncomplicated Firewall) — Ubuntu/Debian
sudo ufw default deny incoming
sudo ufw default allow outgoing
sudo ufw allow 22/tcp      # SSH (or your custom port)
sudo ufw allow 80/tcp      # HTTP
sudo ufw allow 443/tcp     # HTTPS
sudo ufw enable
sudo ufw status verbose

# iptables (lower level, more control)
# Allow established connections
iptables -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT
# Allow loopback
iptables -A INPUT -i lo -j ACCEPT
# Allow SSH, HTTP, HTTPS
iptables -A INPUT -p tcp --dport 22 -j ACCEPT
iptables -A INPUT -p tcp --dport 80 -j ACCEPT
iptables -A INPUT -p tcp --dport 443 -j ACCEPT
# Drop everything else
iptables -A INPUT -j DROP

# Save rules (persist across reboot)
sudo apt install iptables-persistent
sudo netfilter-persistent save

SSH Hardening

# /etc/ssh/sshd_config — Production SSH configuration
Port 2222                          # Non-standard port (reduces noise)
PermitRootLogin no                 # Never allow root SSH
PasswordAuthentication no          # Keys only
PubkeyAuthentication yes
MaxAuthTries 3
ClientAliveInterval 300
ClientAliveCountMax 2
AllowUsers deploy                  # Whitelist specific users
Protocol 2

# Optional: Restrict to specific IPs (if you have static IP)
# AllowUsers deploy@YOUR_IP

Security Headers

# Add to Nginx server block or application responses:
add_header X-Frame-Options "DENY" always;
add_header X-Content-Type-Options "nosniff" always;
add_header X-XSS-Protection "1; mode=block" always;
add_header Referrer-Policy "strict-origin-when-cross-origin" always;
add_header Permissions-Policy "camera=(), microphone=(), geolocation=()" always;
add_header Content-Security-Policy "default-src 'self'; script-src 'self'; style-src 'self' 'unsafe-inline';" always;

# Test your headers: https://securityheaders.com/

Secrets Management

Never store secrets in: Git repositories, Dockerfiles, environment variables in CI logs, plain text config files committed to version control. Even if the repo is private — credentials in Git history are permanent.

Tool	Type	Best For
Environment variables	Runtime injection	Simple apps (loaded from secure source)
AWS Secrets Manager	Cloud-managed	AWS workloads, auto-rotation
HashiCorp Vault	Self-hosted/cloud	Multi-cloud, dynamic secrets, PKI
SOPS	Encrypted files in Git	GitOps workflows, small teams
Doppler / 1Password	SaaS	Teams wanting simple UI

# Example: Using SOPS to encrypt secrets in Git
# Install: brew install sops age

# Generate an age key
age-keygen -o keys.txt
# Public key: age1xxxxxxx...

# Create .sops.yaml in repo root
creation_rules:
  - path_regex: \.enc\.yaml$
    age: age1xxxxxxx...

# Encrypt a secrets file
sops --encrypt secrets.yaml > secrets.enc.yaml
# secrets.enc.yaml is safe to commit — encrypted at rest

# Decrypt at deploy time
sops --decrypt secrets.enc.yaml > /etc/myapp.env

Backup Strategy — The 3-2-1 Rule

3 copies of your data
2 different storage media/types
1 copy offsite (different geographic location)

# Automated database backup script
#!/bin/bash
# /opt/scripts/backup-db.sh
set -euo pipefail

TIMESTAMP=$(date +%Y%m%d_%H%M%S)
BACKUP_DIR="/opt/backups"
S3_BUCKET="s3://myapp-backups-prod"

# Dump database
pg_dump -h localhost -U myapp myapp_prod | gzip > "$BACKUP_DIR/db_$TIMESTAMP.sql.gz"

# Upload to S3 (offsite copy)
aws s3 cp "$BACKUP_DIR/db_$TIMESTAMP.sql.gz" "$S3_BUCKET/db/$TIMESTAMP.sql.gz"

# Retain only last 7 local backups
ls -t $BACKUP_DIR/db_*.sql.gz | tail -n +8 | xargs rm -f

# Cron: Run daily at 3 AM
# 0 3 * * * /opt/scripts/backup-db.sh >> /var/log/backup.log 2>&1

Test your backups! A backup you've never restored is not a backup — it's a hope. Schedule monthly restore tests to a separate environment. Automate this if possible.

DDoS Mitigation

Cloudflare (free tier): Proxies traffic, absorbs L3/L4 attacks, rate limiting
AWS Shield: Standard (free, basic protection) or Advanced ($3000/mo, dedicated response team)
Rate limiting: Nginx limit_req, application-level throttling
Fail2ban: Automatically ban IPs with suspicious patterns

Production Security Checklist

Before going live, verify:
☐ HTTPS everywhere (HSTS enabled)
☐ SSH: key-only auth, non-standard port, root disabled
☐ Firewall: only required ports open
☐ Security headers configured
☐ Secrets not in Git (use secrets manager)
☐ Database not publicly accessible
☐ Automatic security updates enabled
☐ Fail2ban or equivalent running
☐ Backups automated and tested
☐ Dependencies scanned for vulnerabilities (npm audit, Snyk)
☐ Application logs don't contain sensitive data
☐ Rate limiting on authentication endpoints
☐ CORS configured correctly (not wildcard in production)

Chapter 17 — Maintenance & Operations

Launching is just the beginning. Day-2 operations — keeping the system running, updated, and healthy — is where most of the work lives.

OS & Dependency Updates

# Automatic security updates (Ubuntu)
sudo apt install unattended-upgrades
sudo dpkg-reconfigure -plow unattended-upgrades

# /etc/apt/apt.conf.d/50unattended-upgrades
Unattended-Upgrade::Allowed-Origins {
    "${distro_id}:${distro_codename}-security";
};
Unattended-Upgrade::Automatic-Reboot "true";
Unattended-Upgrade::Automatic-Reboot-Time "03:00";

# Application dependency updates — use Dependabot or Renovate
# .github/dependabot.yml
version: 2
updates:
  - package-ecosystem: "npm"
    directory: "/"
    schedule:
      interval: "weekly"
    open-pull-requests-limit: 5

Zero-Downtime Deployments

# Method 1: Rolling restart with multiple instances behind LB
# Deploy to instance 1, health check passes, deploy to instance 2...

# Method 2: Docker with Nginx upstream reload
# deploy.sh
docker pull myregistry/myapp:$NEW_VERSION
docker run -d --name myapp-new -p 3001:3000 myregistry/myapp:$NEW_VERSION

# Wait for health check
until curl -sf http://localhost:3001/health; do sleep 1; done

# Switch Nginx upstream
sed -i 's/127.0.0.1:3000/127.0.0.1:3001/' /etc/nginx/conf.d/upstream.conf
nginx -s reload

# Stop old container
docker stop myapp-old && docker rm myapp-old
docker rename myapp-new myapp-old

Rollback Procedures

# Docker rollback — instant (previous image still cached)
docker stop myapp
docker run -d --name myapp -p 3000:3000 myregistry/myapp:PREVIOUS_VERSION

# Kubernetes rollback
kubectl rollout undo deployment/myapp
kubectl rollout status deployment/myapp    # Watch progress

# Database rollback — this is the hard part
# Always make migrations reversible:
# - migration_001_add_column.up.sql
# - migration_001_add_column.down.sql
# Run: migrate down 1

Database migrations are the #1 cause of failed rollbacks. Rules:
• Never drop columns in the same deploy that removes the code using them
• Use expand-contract pattern: add new column → deploy code using both → remove old column
• Always write down migrations
• Test migrations against a production-size dataset before deploying

Disaster Recovery Plan

Metric	Definition	Your Target
RTO (Recovery Time Objective)	Max acceptable downtime	e.g., 1 hour
RPO (Recovery Point Objective)	Max acceptable data loss	e.g., 15 minutes

# Disaster Recovery Runbook Template:

## Scenario: Complete server failure
1. Spin up new server from Terraform (5 min)
2. Restore latest database backup (10 min)
3. Deploy latest Docker image (2 min)
4. Update DNS to new server IP (5 min + propagation)
5. Verify application health
6. Notify stakeholders

## Scenario: Database corruption
1. Stop application (prevent further writes)
2. Identify last good backup
3. Restore to point-in-time (RDS: use PITR)
4. Verify data integrity
5. Restart application
6. Post-mortem: identify root cause

## Scenario: Security breach
1. Isolate affected systems (revoke access, block IPs)
2. Preserve evidence (don't destroy logs)
3. Rotate ALL credentials and secrets
4. Assess scope of breach
5. Patch vulnerability
6. Notify affected users (legal requirement in many jurisdictions)
7. Post-mortem and remediation plan

Runbooks & On-Call

Runbooks: Step-by-step procedures for common incidents. Written for 3 AM brain — clear, no ambiguity.
On-call rotation: Tools like PagerDuty, Opsgenie, or Grafana OnCall route alerts to the right person.
Escalation policy: If primary doesn't acknowledge in 5 min → page secondary → page manager.
Post-mortems: Blameless analysis after every incident. Focus on systemic fixes, not individual blame.

Chapter 18 — Cost Management

Cloud bills can spiral out of control fast. Professional operations include cost awareness as a first-class concern.

Cloud Pricing Models

Model	How It Works	Best For	Savings vs On-Demand
On-Demand	Pay per hour/second of use	Variable workloads, testing	0% (baseline)
Reserved Instances	1-3 year commitment, fixed rate	Steady-state production	30-72%
Savings Plans	Commit to $/hr spend (flexible)	Predictable spend, flexible instances	20-50%
Spot/Preemptible	Bid on unused capacity (can be terminated)	Batch jobs, CI runners, stateless workers	60-90%

Cost Optimization Strategies

Right-sizing: Monitor actual CPU/memory usage. Most instances are over-provisioned. A t3.medium using 10% CPU should be a t3.small.
Auto-scaling: Scale down during off-hours. Many apps have 10x traffic difference between peak and trough.
Spot instances: Use for CI/CD runners, batch processing, and stateless workers. Save 60-90%.
Reserved capacity: For databases and always-on servers, commit for 1-3 years.
Storage tiering: Move old data to cheaper storage (S3 Glacier, cold storage).
Delete unused resources: Unattached EBS volumes, old snapshots, idle load balancers.
Use ARM instances: Graviton (AWS) / T2A (GCP) are 20% cheaper with same or better performance.

TCO Comparison: Self-Hosted vs Cloud

Factor	VPS ($40/mo)	AWS (equivalent)	Self-Hosted
Compute	$40/mo	$70-150/mo	$500 one-time + power
Database	Included (self-managed)	$30-100/mo (RDS)	Included
Bandwidth	Usually generous	$0.09/GB out (adds up!)	ISP cost
Your time	Medium (manage server)	Low (managed services)	High (manage everything)
Scaling	Manual (resize/add VPS)	Automatic	Buy more hardware
Reliability	99.9% SLA typical	99.99% possible	Depends on you

Billing Alerts

# AWS: Set up billing alarm via CLI
aws cloudwatch put-metric-alarm \
  --alarm-name "MonthlyBillingAlarm" \
  --alarm-description "Alert when bill exceeds $100" \
  --metric-name EstimatedCharges \
  --namespace AWS/Billing \
  --statistic Maximum \
  --period 21600 \
  --threshold 100 \
  --comparison-operator GreaterThanThreshold \
  --evaluation-periods 1 \
  --alarm-actions arn:aws:sns:us-east-1:ACCOUNT:billing-alerts \
  --dimensions Name=Currency,Value=USD

# Also set alerts at 50%, 80%, 100% of budget
# AWS Budgets is more powerful than CloudWatch for this

The hidden costs of cloud:
• Data transfer out: AWS charges $0.09/GB. A site serving 1TB/month = $90 just in bandwidth.
• NAT Gateway: $0.045/hr + $0.045/GB processed. Can easily be $30-100/mo.
• Load Balancer: $16-25/mo minimum even with zero traffic.
• Managed databases: 3-5x the cost of self-managed on a VPS.

Mitigation: Use Cloudflare (free CDN, absorbs bandwidth), minimize NAT Gateway usage, consider Hetzner/DO for predictable pricing.

Chapter 19 — Decision Framework

This chapter synthesizes everything into actionable decision-making tools. When you get a new project, use these frameworks to systematically choose the right architecture and hosting.

The Decision Flowchart

┌─────────────────────────┐ │ New Website Project │ └────────────┬────────────┘ │ ┌────────────▼────────────┐ │ Does it need a backend? │ └────────────┬────────────┘ │ │ NO YES │ │ ▼ ▼ ┌────────────┐ ┌──────────────────┐ │Static Host │ │ How many users? │ │(CDN/Netlify│ └────────┬─────────┘ │ CloudFlare)│ │ │ │ └────────────┘ <1000 1K-100K >100K │ │ │ ▼ ▼ ▼ ┌────────┐ ┌────────┐ ┌──────────┐ │Single │ │VPS or │ │Cloud IaaS│ │VPS or │ │PaaS │ │or K8s │ │PaaS │ │ │ │ │ └───┬────┘ └───┬────┘ └────┬─────┘ │ │ │ ┌─────────────▼──────────▼───────────▼──────────┐ │ What's your budget? │ └─────────────┬──────────┬───────────┬──────────┘ Minimal Medium Large │ │ │ ▼ ▼ ▼ ┌────────┐ ┌────────┐ ┌──────────┐ │Hetzner │ │DO/AWS │ │AWS/GCP │ │VPS + │ │+ managed│ │full stack│ │Docker │ │services│ │+ support │ └────────┘ └────────┘ └──────────┘

Scoring Matrix

Rate each factor 1-5 for your project, then match to the hosting type that scores highest:

Factor	Static	VPS	PaaS	IaaS	K8s	Serverless
Low budget priority	5	4	3	2	1	4
Need auto-scaling	5	1	3	5	5	5
Minimal ops work	5	2	5	2	1	5
Maximum control	1	5	1	4	4	1
Fast time-to-market	5	3	5	2	1	4
Compliance needs	2	4	2	5	5	3
Team > 5 devs	3	2	3	4	5	3
Variable traffic	5	1	3	4	4	5

Migration Paths

Typical growth path: PaaS (Railway/Render) ──────────────────────────────────────────────▶ Time │ │ │ "We're paying $200/mo for what a $40 VPS could do" │ ▼ │ VPS + Docker Compose ───────────────────────────────────────────────▶ │ │ │ │ "We need auto-scaling, multiple services, zero-downtime deploys" │ ▼ │ Cloud IaaS (AWS/GCP) + Containers ──────────────────────────────────▶ │ │ │ │ "We have 10+ services, 5+ teams, complex networking" │ ▼ │ Managed Kubernetes ─────────────────────────────────────────────────▶ │ ▼

Key Decision Questions

Ask these for every project:

1. What's the expected traffic? (requests/sec, concurrent users)
2. What's the budget? (monthly hosting budget, one-time setup budget)
3. What's the team size? (who maintains this after launch?)
4. What's the SLA requirement? (99.9%? 99.99%? "best effort"?)
5. Are there compliance requirements? (GDPR, HIPAA, PCI-DSS, data residency)
6. How variable is the traffic? (steady vs. spiky vs. seasonal)
7. What's the time-to-market pressure? (launch in a week vs. 6 months)
8. What's the data sensitivity? (public content vs. financial/health data)
9. Do you need specific geographic presence? (latency requirements, legal)
10. What's the expected growth? (10x in a year? Stable? Unknown?)

Anti-Patterns to Avoid

Premature Kubernetes: Using K8s for a single service with 100 users. Use a VPS.
Premature microservices: Splitting a monolith before you understand the domain boundaries.
Over-engineering for scale: Building for 1M users when you have 100. Optimize when you have data.
Ignoring managed services: Running your own PostgreSQL/Redis/Elasticsearch when managed versions exist and your time is expensive.
Vendor lock-in without awareness: Using proprietary services is fine — just know the exit cost.
No staging environment: Testing in production is not a strategy.
Manual deployments: If deploying requires SSH and running commands, you'll eventually make a mistake at 2 AM.

Chapter 20 — Real-World Scenarios

Let's apply everything to five concrete projects, showing how the decision framework leads to different architectures.

Scenario 1: Personal Technical Blog

Aspect	Decision
Traffic	~1000 visitors/day, spikes when posts hit HN/Reddit
Budget	$0-20/month
Team	Just you
SLA	Best effort (downtime is annoying, not catastrophic)

Architecture

┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ Markdown │────▶│ Hugo/Astro │────▶│ Cloudflare │ │ files in │ │ (build step)│ │ Pages (CDN) │ │ Git repo │ │ │ │ FREE │ └──────────────┘ └──────────────┘ └──────────────┘ │ GitHub Actions (auto-build on push)

Pattern: Static site (JAMstack)
Hosting: Cloudflare Pages (free) or Netlify (free tier)
CI/CD: GitHub Actions builds on push to main
Cost: $0/month (domain: $10/year)
Why: Zero server management, handles traffic spikes effortlessly (CDN), free

Scenario 2: Startup SaaS MVP

Aspect	Decision
Traffic	~500 users, growing. API-heavy (user auth, CRUD, real-time)
Budget	$50-200/month
Team	2-3 developers
SLA	99.9% (paying customers)

Architecture

┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ React SPA │────▶│ Vercel/ │ │ Railway │ │ (frontend) │ │ Cloudflare │ │ or Render │ └──────────────┘ │ Pages │ │ (backend) │ └──────────────┘ └──────┬───────┘ │ ┌───────▼───────┐ │ Managed │ │ PostgreSQL │ │ + Redis │ └───────────────┘

Pattern: SPA + API backend (monolith)
Hosting: PaaS (Railway/Render) for backend, Vercel for frontend
Database: Managed PostgreSQL (Railway or Neon)
CI/CD: Auto-deploy on push (PaaS built-in)
Cost: ~$50-100/month
Why: Maximum development speed, minimal ops, easy to iterate. Graduate to VPS/cloud when costs justify it.

Scenario 3: E-Commerce Site

Aspect	Decision
Traffic	~10K daily visitors, 5x spikes during sales
Budget	$200-500/month
Team	3-5 developers + 1 DevOps
SLA	99.95% (downtime = lost revenue)
Compliance	PCI-DSS (handling payments)

Architecture

┌───────────┐ ┌───────────┐ ┌───────────────────────────────┐ │CloudFront │───▶│ ALB │───▶│ ECS Fargate (containers) │ │ (CDN) │ │(HTTPS/LB) │ │ ┌─────────┐ ┌───────────┐ │ └───────────┘ └───────────┘ │ │ Web │ │ Worker │ │ │ │ (x2-4) │ │ (x1-2) │ │ │ └────┬────┘ └─────┬─────┘ │ └───────┼─────────────┼────────┘ │ │ ┌───────▼─────────────▼────────┐ │ RDS PostgreSQL (Multi-AZ) │ │ ElastiCache Redis │ │ S3 (product images) │ └──────────────────────────────┘

Pattern: SSR monolith (Next.js or Django) with background workers
Hosting: AWS ECS Fargate (containers without managing servers)
Database: RDS PostgreSQL Multi-AZ (automatic failover)
Payments: Stripe (PCI compliance handled by them)
CI/CD: GitHub Actions → ECR → ECS rolling deployment
Cost: ~$300-500/month
Why: Auto-scaling for sales spikes, managed services reduce ops burden, Multi-AZ for reliability.

Scenario 4: Enterprise Internal Tool

Aspect	Decision
Traffic	~200 internal users, business hours only
Budget	$100-300/month
Team	1-2 developers (part-time maintenance)
SLA	99.5% (business hours)
Compliance	Data must stay in EU, SSO required

Architecture

┌───────────────┐ ┌──────────────────────────────────┐ │ Corporate │────▶│ Hetzner VPS (EU) │ │ VPN / SSO │ │ ┌────────────┐ ┌────────────┐ │ │ (Okta/Azure │ │ │ Caddy │ │ App │ │ │ AD) │ │ │ (proxy) │──▶│ (Docker) │ │ └───────────────┘ │ └────────────┘ └─────┬──────┘ │ │ │ │ │ ┌──────▼──────┐ │ │ │ PostgreSQL │ │ │ │ (Docker) │ │ │ └─────────────┘ │ └──────────────────────────────────┘

Pattern: Monolith (Django/Rails/Next.js)
Hosting: Single Hetzner VPS in EU (€20/mo for 4GB RAM)
Auth: SAML/OIDC integration with corporate IdP
CI/CD: GitHub Actions → Docker → SSH deploy
Cost: ~€40/month (VPS + backups)
Why: Simple, cheap, EU data residency, VPN restricts access. No need for cloud complexity for 200 users.

Scenario 5: High-Traffic Media/Content Site

Aspect	Decision
Traffic	~1M daily visitors, global audience, viral spikes
Budget	$2000-10000/month
Team	8-15 developers, 2-3 SRE/DevOps
SLA	99.99% (ad revenue depends on uptime)

Architecture

┌──────────────────────────────────────────────────────────────────┐ │ Cloudflare (CDN + WAF + DDoS) │ └────────────────────────────────┬─────────────────────────────────┘ │ ┌────────────────────────────────▼─────────────────────────────────┐ │ AWS Multi-Region │ │ │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐ │ │ │ EKS Cluster │ │ EKS Cluster │ │ Shared Services │ │ │ │ (Region 1) │ │ (Region 2) │ │ • S3 (media storage) │ │ │ │ │ │ │ │ • CloudFront (assets) │ │ │ │ Services: │ │ Services: │ │ • ElasticSearch │ │ │ │ • CMS API │ │ • CMS API │ │ • SQS (async jobs) │ │ │ │ • Auth │ │ • Auth │ │ • Lambda (image proc) │ │ │ │ • Search │ │ • Search │ └─────────────────────────┘ │ │ └──────┬───────┘ └──────┬───────┘ │ │ │ │ │ │ ┌──────▼───────┐ ┌──────▼───────┐ │ │ │ Aurora Global │ │ Aurora Read │ │ │ │ (Primary) │ │ (Replica) │ │ │ └──────────────┘ └──────────────┘ │ └───────────────────────────────────────────────────────────────────┘

Pattern: Microservices (CMS, Auth, Search, Media Processing)
Hosting: AWS EKS (Kubernetes) multi-region
Database: Aurora Global Database (cross-region replication)
CDN: Cloudflare + CloudFront (multi-layer caching)
CI/CD: GitLab CI → ECR → ArgoCD (GitOps) → EKS
Monitoring: Datadog or Prometheus + Grafana + PagerDuty
Cost: ~$5000-8000/month
Why: Global audience needs multi-region. Viral spikes need auto-scaling. Multiple teams need independent deployments (microservices). Revenue justifies the infrastructure investment.

Notice the pattern: Complexity scales with requirements, not ambition. The blog costs $0/mo and takes 30 minutes to set up. The media site costs $5000/mo and takes months to architect. Both are correct for their context. The worst mistake is using Scenario 5's architecture for Scenario 1's requirements.

Book	Author	Covers
The Phoenix Project	Gene Kim	DevOps culture, IT operations (novel format)
Site Reliability Engineering	Google (free online)	SRE practices, monitoring, incident response
Infrastructure as Code	Kief Morris	IaC principles, patterns, practices
Terraform: Up & Running	Yevgeniy Brikman	Practical Terraform (updated regularly)
Docker Deep Dive	Nigel Poulton	Docker from basics to production
Kubernetes in Action	Marko Lukša	K8s concepts and hands-on
Web Scalability for Startup Engineers	Artur Ejsmont	Scaling web apps pragmatically
Designing Data-Intensive Applications	Martin Kleppmann	Distributed systems, databases, architecture

Category	Tools
Web Servers	Nginx, Caddy, Apache, Traefik
Containers	Docker, Podman, containerd
Orchestration	Kubernetes, Docker Swarm, Nomad
CI/CD	GitHub Actions, GitLab CI, Jenkins, CircleCI, ArgoCD
IaC	Terraform, Ansible, Pulumi, CloudFormation
Monitoring	Prometheus, Grafana, Datadog, New Relic
Logging	Loki, ELK Stack, Fluentd, Vector
Secrets	Vault, AWS Secrets Manager, SOPS, Doppler
DNS/CDN	Cloudflare, Route 53, Fastly
SSL	Let's Encrypt, Certbot, cert-manager (K8s)