Website Architecture & Deployment
A Zero-to-Hero Guide for Backend Developers & System Administrators
Chapter 1 — The Big Picture
Before diving into specific technologies, you need a mental model of what happens when someone visits a website — and what your role is in making that happen reliably, securely, and at scale.
What Happens When a User Visits a Website
The Roles in Professional Web Operations
| Role | Responsibility | Cares About |
|---|---|---|
| Frontend Developer | HTML, CSS, JavaScript, UI/UX | User experience, browser compatibility, performance |
| Backend Developer | APIs, business logic, databases | Data integrity, API design, scalability, security |
| DevOps / SRE | CI/CD, infrastructure, reliability | Uptime, deployment speed, automation, monitoring |
| System Administrator | Server management, networking, security | Patching, hardening, backups, capacity |
| Platform Engineer | Internal developer platforms, tooling | Developer productivity, self-service infrastructure |
Environments: Dev → Staging → Production
Professional deployments never go straight from a developer's laptop to users. There's a pipeline:
| Environment | Purpose | Who Uses It | Data |
|---|---|---|---|
| Local / Dev | Active development, debugging | Individual developer | Fake/seed data |
| CI | Automated testing on every commit | Machines (automated) | Test fixtures |
| Staging | Pre-production validation, QA | QA team, stakeholders | Production-like (anonymized) |
| Production | Real users, real data | Everyone (end users) | Real data |
What "Deploying a Website" Actually Means
Deployment is not just "putting files on a server." It's a repeatable, automated process that includes:
- Build — Compile code, bundle assets, run optimizations
- Test — Unit tests, integration tests, security scans
- Package — Create a deployable artifact (Docker image, binary, archive)
- Deploy — Push artifact to target environment
- Verify — Health checks, smoke tests, monitoring
- Rollback plan — If something breaks, revert instantly
The Infrastructure Stack
Chapter 2 — Website Architecture Patterns
Before choosing where to host, you need to understand what you're hosting. The architecture pattern determines your infrastructure requirements.
Static Sites
Pre-built HTML/CSS/JS files served as-is. No server-side processing per request.
- How it works: Build step generates HTML files → upload to web server or CDN → served directly
- Tools: Hugo, Jekyll, Eleventy, Astro (static mode), Next.js (export)
- Examples: Documentation sites, blogs, portfolios, landing pages
- Hosting: Cheapest option — CDN, S3+CloudFront, Netlify, GitHub Pages
# Example: Build and deploy a Hugo static site
hugo build # Generates ./public/ directory
aws s3 sync ./public s3://my-bucket --delete
aws cloudfront create-invalidation --distribution-id EXXX --paths "/*"
Server-Side Rendering (SSR)
Server generates HTML on every request. The traditional model (PHP, Rails, Django, Express with templates).
- How it works: Request → server processes → queries DB → renders HTML → sends response
- Tools: Next.js (SSR mode), Nuxt.js, Django, Rails, Laravel, Express+EJS
- Examples: E-commerce product pages, news sites, dashboards
- Hosting: Needs a running server process — VPS, PaaS, containers
Single-Page Applications (SPA)
One HTML file + JavaScript bundle. All rendering happens in the browser. Backend is a separate API.
- How it works: Browser loads JS bundle → JS fetches data from API → renders UI client-side
- Tools: React, Vue, Angular, Svelte
- Examples: Gmail, Trello, Figma
- Hosting: Static files (CDN) for frontend + API server for backend
JAMstack (JavaScript, APIs, Markup)
Pre-rendered static pages enhanced with JavaScript calling APIs at runtime. Best of both worlds.
- How it works: Build-time rendering + client-side API calls for dynamic content
- Tools: Next.js (ISR/SSG), Gatsby, Astro, Remix
- Examples: Marketing sites with dynamic forms, blogs with comments
- Hosting: CDN for pages + serverless functions or API for dynamic parts
Monolithic Architecture
Single deployable unit containing all functionality. Frontend, backend, and data access in one codebase.
- How it works: One application handles everything — routing, business logic, data, rendering
- Tools: Rails, Django, Laravel, Spring Boot, ASP.NET
- Pros: Simple to develop, test, deploy. One thing to monitor.
- Cons: Scales as a unit (can't scale just the hot path). Deployment = all or nothing.
Service-Oriented Architecture (SOA) / Microservices
Application split into independent services communicating over network (HTTP/gRPC/message queues).
- How it works: Each service owns its domain, data, and deployment lifecycle
- Tools: Any language per service. Kubernetes for orchestration. Service mesh (Istio, Linkerd).
- Pros: Independent scaling, independent deployment, technology diversity
- Cons: Network complexity, distributed debugging, operational overhead
Comparison Table
| Pattern | Server Needed | Scalability | Complexity | SEO | Best For |
|---|---|---|---|---|---|
| Static | No (CDN) | ★★★★★ | ★☆☆☆☆ | ★★★★★ | Content sites, docs |
| SSR | Yes | ★★★☆☆ | ★★★☆☆ | ★★★★★ | Dynamic content + SEO |
| SPA | API only | ★★★★☆ | ★★★☆☆ | ★★☆☆☆ | App-like experiences |
| JAMstack | Partial | ★★★★★ | ★★★☆☆ | ★★★★★ | Content + interactivity |
| Monolith | Yes | ★★★☆☆ | ★★☆☆☆ | Varies | MVPs, small-medium apps |
| Microservices | Yes (many) | ★★★★★ | ★★★★★ | Varies | Large teams, high scale |
• Content-heavy, rarely changes? → Static / JAMstack
• Need SEO + dynamic data? → SSR
• Rich interactive app (logged-in users)? → SPA + API
• Small team, getting started? → Monolith
• Large team, proven bounded contexts, high scale? → Microservices
Chapter 3 — Hosting Taxonomy
This is the complete landscape of where your website can live. Each option trades control for convenience at a different point.
The Control vs. Convenience Spectrum
Complete Hosting Types
| Type | What You Get | Cost Range | Control | Complexity | Best For |
|---|---|---|---|---|---|
| Shared Hosting | Space on a shared server (cPanel) | $3-15/mo | ★☆☆☆☆ | ★☆☆☆☆ | WordPress blogs, tiny sites |
| VPS | Virtual machine, root access | $5-80/mo | ★★★★☆ | ★★★☆☆ | Most web apps, APIs |
| Dedicated Server | Entire physical server rented | $80-500/mo | ★★★★★ | ★★★★☆ | High-performance, compliance |
| Colocation | Your hardware in their data center | $200-2000/mo | ★★★★★ | ★★★★★ | Maximum control, large scale |
| IaaS | Cloud VMs + managed services | Pay-per-use | ★★★★☆ | ★★★★☆ | Variable workloads, scaling |
| PaaS | Managed platform, push code | $5-500/mo | ★★☆☆☆ | ★★☆☆☆ | Startups, rapid deployment |
| Serverless/FaaS | Functions triggered by events | Pay-per-invocation | ★☆☆☆☆ | ★★☆☆☆ | APIs, event processing |
| Static Hosting | CDN-served static files | $0-20/mo | ★☆☆☆☆ | ★☆☆☆☆ | Static sites, SPAs |
| Managed K8s | Kubernetes cluster (managed control plane) | $70-1000+/mo | ★★★★☆ | ★★★★★ | Microservices at scale |
Shared Hosting — The Beginner Trap
Shared hosting (GoDaddy, Bluehost, Hostinger) puts hundreds of sites on one server. You get a cPanel interface, FTP access, and PHP. That's it.
- Pros: Cheapest, zero server management, one-click WordPress
- Cons: No root access, limited languages (usually PHP only), noisy neighbors affect performance, can't install custom software, terrible for anything beyond WordPress
When to Use What — Quick Reference
Simple web app, small team? → VPS (DigitalOcean, Hetzner) or PaaS (Railway, Render)
Need auto-scaling, variable traffic? → IaaS (AWS, GCP) or containers
Microservices, large team? → Managed Kubernetes (EKS, GKE)
Event-driven, sporadic traffic? → Serverless (Lambda, Workers)
Compliance/performance requirements? → Dedicated or colocation
Learning / side project? → VPS ($5/mo) — best bang for learning
Chapter 4 — Self-Hosting (Bare Metal / Home Server)
Self-hosting means running a web server on hardware you physically control — a spare PC, a Raspberry Pi, or a rack server in your closet. It's the most educational option and gives maximum control.
When Self-Hosting Makes Sense
- Learning and experimentation (best way to understand the full stack)
- Internal tools (home automation, media server, dev environments)
- Data sovereignty requirements (data never leaves your premises)
- One-time cost preference over recurring cloud bills
Setting Up Nginx on Linux
# Install Nginx (Ubuntu/Debian)
sudo apt update && sudo apt install -y nginx
# Start and enable
sudo systemctl start nginx
sudo systemctl enable nginx
# Verify it's running
curl http://localhost
# Should return the Nginx welcome page HTML
Virtual Hosts (Serving Multiple Sites)
# /etc/nginx/sites-available/mysite.conf
server {
listen 80;
server_name mysite.example.com;
root /var/www/mysite;
index index.html;
location / {
try_files $uri $uri/ =404;
}
# For a backend app (reverse proxy)
location /api/ {
proxy_pass http://127.0.0.1:3000;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
}
}
# Enable the site
sudo ln -s /etc/nginx/sites-available/mysite.conf /etc/nginx/sites-enabled/
sudo nginx -t # Test configuration
sudo systemctl reload nginx
Dynamic DNS (Solving the Dynamic IP Problem)
Home internet usually gives you a dynamic IP that changes periodically. Dynamic DNS services map a hostname to your current IP.
# Using ddclient with Cloudflare (install: sudo apt install ddclient)
# /etc/ddclient.conf
protocol=cloudflare
zone=example.com
login=your-email@example.com
password=your-cloudflare-api-token
use=web, web=https://api.ipify.org
mysite.example.com
# Or use a cron job with curl
*/5 * * * * curl -s "https://api.cloudflare.com/client/v4/zones/ZONE_ID/dns_records/RECORD_ID" \
-X PATCH \
-H "Authorization: Bearer YOUR_TOKEN" \
-H "Content-Type: application/json" \
--data "{\"content\":\"$(curl -s https://api.ipify.org)\"}"
Port Forwarding
Your router's NAT blocks incoming connections. You need to forward ports 80 (HTTP) and 443 (HTTPS) to your server's local IP.
- Give your server a static local IP (e.g., 192.168.1.100) via DHCP reservation
- In your router admin panel: forward external port 80 → 192.168.1.100:80
- Forward external port 443 → 192.168.1.100:443
- Test from outside your network (use your phone on mobile data)
Let's Encrypt (Free SSL/TLS)
# Install Certbot
sudo apt install -y certbot python3-certbot-nginx
# Obtain certificate (Nginx plugin auto-configures)
sudo certbot --nginx -d mysite.example.com
# Auto-renewal is set up automatically via systemd timer
sudo systemctl status certbot.timer
# Manual renewal test
sudo certbot renew --dry-run
Running Your App as a systemd Service
# /etc/systemd/system/myapp.service
[Unit]
Description=My Web Application
After=network.target
[Service]
Type=simple
User=www-data
WorkingDirectory=/opt/myapp
ExecStart=/usr/bin/node /opt/myapp/server.js
Restart=always
RestartSec=5
Environment=NODE_ENV=production
Environment=PORT=3000
# Security hardening
NoNewPrivileges=true
ProtectSystem=strict
ProtectHome=true
ReadWritePaths=/opt/myapp/data
[Install]
WantedBy=multi-user.target
# Enable and start
sudo systemctl daemon-reload
sudo systemctl enable myapp
sudo systemctl start myapp
sudo systemctl status myapp # Check it's running
sudo journalctl -u myapp -f # View logs
Chapter 5 — VPS & Dedicated Servers
A VPS (Virtual Private Server) is the workhorse of professional web hosting. You get a virtual machine with root access, a public IP, and full control — without managing physical hardware.
Provider Comparison
| Provider | Cheapest VPS | Data Centers | Strengths | Best For |
|---|---|---|---|---|
| DigitalOcean | $4/mo (512MB) | 14 regions | Simple UI, great docs, managed DBs | Startups, learning |
| Hetzner | €3.79/mo (2GB) | EU + US | Best price/performance ratio | Price-conscious, EU hosting |
| Linode (Akamai) | $5/mo (1GB) | 11 regions | Reliable, good support | General purpose |
| Vultr | $2.50/mo (512MB) | 32 locations | Most locations, bare metal option | Edge deployments |
| OVH | €3.50/mo (2GB) | EU focused | Cheap dedicated servers too | EU, budget dedicated |
Initial Server Setup (The First 10 Minutes)
Every new VPS should go through this hardening process before deploying anything:
# 1. Connect as root (first time only)
ssh root@YOUR_SERVER_IP
# 2. Update the system
apt update && apt upgrade -y
# 3. Create a non-root user
adduser deploy
usermod -aG sudo deploy
# 4. Set up SSH key authentication for the new user
mkdir -p /home/deploy/.ssh
cp ~/.ssh/authorized_keys /home/deploy/.ssh/
chown -R deploy:deploy /home/deploy/.ssh
chmod 700 /home/deploy/.ssh
chmod 600 /home/deploy/.ssh/authorized_keys
# 5. Harden SSH - edit /etc/ssh/sshd_config
sudo sed -i 's/#PermitRootLogin yes/PermitRootLogin no/' /etc/ssh/sshd_config
sudo sed -i 's/#PasswordAuthentication yes/PasswordAuthentication no/' /etc/ssh/sshd_config
sudo sed -i 's/#Port 22/Port 2222/' /etc/ssh/sshd_config
sudo systemctl restart sshd
# 6. Set up firewall
sudo apt install -y ufw
sudo ufw default deny incoming
sudo ufw default allow outgoing
sudo ufw allow 2222/tcp # SSH (custom port)
sudo ufw allow 80/tcp # HTTP
sudo ufw allow 443/tcp # HTTPS
sudo ufw enable
# 7. Install fail2ban (brute-force protection)
sudo apt install -y fail2ban
sudo systemctl enable fail2ban
# 8. Set up automatic security updates
sudo apt install -y unattended-upgrades
sudo dpkg-reconfigure -plow unattended-upgrades
Deploying an Application
# On your server (as deploy user):
# Install your runtime (example: Node.js via nvm)
curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.39.0/install.sh | bash
source ~/.bashrc
nvm install --lts
# Clone your application
cd /opt
sudo mkdir myapp && sudo chown deploy:deploy myapp
git clone git@github.com:you/myapp.git /opt/myapp
cd /opt/myapp
npm install --production
# Set up environment variables
sudo cp .env.example /etc/myapp.env
sudo chmod 600 /etc/myapp.env
# Edit with your production values
# Create systemd service (as shown in Chapter 4)
# Set up Nginx reverse proxy (as shown in Chapter 4)
# Obtain SSL certificate with Certbot
Deployment Strategies for VPS
Option A: Git Pull (Simple)
# On server:
cd /opt/myapp
git pull origin main
npm install --production
sudo systemctl restart myapp
Option B: rsync (No Git on Server)
# From your local machine:
rsync -avz --delete \
--exclude='node_modules' \
--exclude='.env' \
./dist/ deploy@server:/opt/myapp/
ssh deploy@server 'cd /opt/myapp && npm install --production && sudo systemctl restart myapp'
Option C: Docker (Recommended for Production)
# Build locally or in CI, push to registry
docker build -t myregistry/myapp:v1.2.3 .
docker push myregistry/myapp:v1.2.3
# On server:
docker pull myregistry/myapp:v1.2.3
docker stop myapp && docker rm myapp
docker run -d --name myapp -p 3000:3000 --env-file /etc/myapp.env myregistry/myapp:v1.2.3
Process Managers
| Tool | Language | Features | When to Use |
|---|---|---|---|
| systemd | Any | Built into Linux, restart policies, logging | Always (it's already there) |
| PM2 | Node.js | Cluster mode, zero-downtime reload, monitoring | Node.js apps without Docker |
| Supervisor | Any | Simple config, process groups | Legacy systems, multiple processes |
Chapter 6 — Platform-as-a-Service (PaaS)
PaaS abstracts away the server entirely. You push code, the platform handles building, deploying, scaling, SSL, and infrastructure. You focus purely on your application.
What PaaS Manages For You
Provider Comparison
| Platform | Free Tier | Paid From | Strengths | Limitations |
|---|---|---|---|---|
| Railway | $5 credit/mo | $5/mo | Modern, fast deploys, good DX | Newer, smaller community |
| Render | Static free, services spin down | $7/mo | Heroku alternative, auto-deploy | Cold starts on free tier |
| Fly.io | 3 shared VMs free | Pay-per-use | Edge deployment, Docker-native | More complex than others |
| Heroku | None (removed) | $5/mo | Pioneer, huge ecosystem | Expensive at scale, aging |
| Google App Engine | Limited free | Pay-per-use | Google infrastructure, auto-scale | Vendor lock-in |
| Azure App Service | Limited free | ~$13/mo | Enterprise, .NET native | Complex pricing |
Example: Deploying to Railway
# Your project needs:
# 1. A start command (in package.json, Procfile, or Dockerfile)
# 2. Listen on the PORT environment variable
# package.json
{
"scripts": {
"start": "node server.js"
}
}
# server.js — must use process.env.PORT
const port = process.env.PORT || 3000;
app.listen(port, '0.0.0.0');
# Deploy:
# Option A: Connect GitHub repo in Railway dashboard (auto-deploys on push)
# Option B: Railway CLI
npm install -g @railway/cli
railway login
railway init
railway up
Procfile (Heroku-style Process Declaration)
# Procfile — tells the platform what processes to run
web: node server.js
worker: node worker.js
release: node migrate.js # Runs before each deploy
When PaaS is the Right Choice
• Small team (1-5 devs) that wants to focus on product, not infrastructure
• Predictable, moderate traffic (not massive spikes)
• Standard web app (HTTP server + database)
• Fast iteration speed matters more than cost optimization
• You don't need custom system-level software
Avoid PaaS when:
• Cost-sensitive at scale (PaaS markup is 3-10x vs raw compute)
• Need custom networking, kernel modules, or system packages
• Compliance requires specific infrastructure control
• Traffic is highly variable (serverless may be cheaper)
• You need persistent local storage or specific hardware
The PaaS Cost Trap
PaaS is cheap to start but expensive to scale. A $7/mo Render service running a Node.js app is great. But when you need 4 instances + a managed database + Redis + background workers, you're suddenly paying $200/mo for what a $40/mo VPS could handle.
Chapter 7 — Infrastructure-as-a-Service (IaaS)
IaaS gives you virtual machines in the cloud with pay-per-use pricing, elastic scaling, and a massive ecosystem of managed services around them. This is where most professional production workloads live.
Core Concepts
Instances (Virtual Machines)
A cloud VM with configurable CPU, RAM, storage, and networking. You choose the OS, install what you want, and pay by the hour/second.
AMIs / Images
Pre-configured OS snapshots. You can use official images (Ubuntu 22.04) or create custom ones with your software pre-installed (golden images).
Security Groups
Virtual firewalls controlling inbound/outbound traffic to your instances. Stateful — if you allow inbound on port 443, the response traffic is automatically allowed.
VPC (Virtual Private Cloud)
Your own isolated network in the cloud. You define subnets (public/private), route tables, and internet gateways. Instances in private subnets can't be reached from the internet directly.
Hands-On: Launching an EC2 Instance (AWS CLI)
# Prerequisites: AWS CLI installed, credentials configured
# aws configure (set access key, secret, region)
# 1. Create a key pair for SSH access
aws ec2 create-key-pair --key-name myapp-key --query 'KeyMaterial' \
--output text > ~/.ssh/myapp-key.pem
chmod 400 ~/.ssh/myapp-key.pem
# 2. Create a security group
aws ec2 create-security-group \
--group-name myapp-sg \
--description "Web server security group"
# Allow SSH, HTTP, HTTPS
aws ec2 authorize-security-group-ingress --group-name myapp-sg \
--protocol tcp --port 22 --cidr 0.0.0.0/0
aws ec2 authorize-security-group-ingress --group-name myapp-sg \
--protocol tcp --port 80 --cidr 0.0.0.0/0
aws ec2 authorize-security-group-ingress --group-name myapp-sg \
--protocol tcp --port 443 --cidr 0.0.0.0/0
# 3. Launch the instance
aws ec2 run-instances \
--image-id ami-0c55b159cbfafe1f0 \
--instance-type t3.micro \
--key-name myapp-key \
--security-groups myapp-sg \
--count 1 \
--tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=myapp-web}]'
# 4. Get the public IP
aws ec2 describe-instances \
--filters "Name=tag:Name,Values=myapp-web" \
--query 'Reservations[0].Instances[0].PublicIpAddress' --output text
# 5. SSH in
ssh -i ~/.ssh/myapp-key.pem ubuntu@INSTANCE_IP
Instance Types (AWS Example)
| Family | Optimized For | Example | Use Case |
|---|---|---|---|
| t3/t4g | Burstable general purpose | t3.micro (2 vCPU, 1GB) | Low-traffic web apps, dev |
| m6i/m7g | Balanced compute/memory | m6i.large (2 vCPU, 8GB) | General web apps |
| c6i/c7g | Compute-intensive | c6i.xlarge (4 vCPU, 8GB) | API servers, batch processing |
| r6i/r7g | Memory-intensive | r6i.large (2 vCPU, 16GB) | Caching, in-memory DBs |
| Graviton (g suffix) | ARM-based, 20% cheaper | t4g.micro | Everything (if your app supports ARM) |
Auto Scaling
Auto Scaling automatically adjusts the number of instances based on demand:
# Conceptual flow:
# 1. Create a Launch Template (defines instance config)
# 2. Create an Auto Scaling Group (min/max/desired instances)
# 3. Attach scaling policies (CPU > 70% → add instance)
# 4. Attach to a Load Balancer (distributes traffic)
# CloudWatch alarm triggers scaling:
# CPU > 70% for 5 minutes → scale out (add instances)
# CPU < 30% for 10 minutes → scale in (remove instances)
Chapter 8 — Containers & Orchestration
Containers package your application with all its dependencies into a portable, reproducible unit. This solves "works on my machine" permanently.
Why Containers
- Reproducibility: Same image runs identically everywhere
- Isolation: Each container has its own filesystem, network, processes
- Efficiency: Lighter than VMs (shared kernel, no guest OS)
- Speed: Start in seconds, not minutes
- Immutability: Never patch a running container — replace it
Dockerfile Best Practices
# Multi-stage build — keeps final image small
# Stage 1: Build
FROM node:20-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production
COPY . .
RUN npm run build
# Stage 2: Production image
FROM node:20-alpine
WORKDIR /app
# Don't run as root
RUN addgroup -S appgroup && adduser -S appuser -G appgroup
# Copy only what's needed
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/node_modules ./node_modules
COPY --from=builder /app/package.json ./
USER appuser
EXPOSE 3000
HEALTHCHECK --interval=30s --timeout=3s CMD wget -qO- http://localhost:3000/health || exit 1
CMD ["node", "dist/server.js"]
• Use specific base image tags (node:20-alpine, not node:latest)
• Multi-stage builds to minimize image size
• Copy package.json first (layer caching for dependencies)
• Run as non-root user
• Add HEALTHCHECK for orchestrators
• Use .dockerignore to exclude node_modules, .git, etc.
Docker Compose (Multi-Service Apps)
# docker-compose.yml — typical web app stack
services:
app:
build: .
ports:
- "3000:3000"
environment:
- DATABASE_URL=postgres://user:pass@db:5432/myapp
- REDIS_URL=redis://cache:6379
depends_on:
db:
condition: service_healthy
cache:
condition: service_started
restart: unless-stopped
db:
image: postgres:16-alpine
volumes:
- pgdata:/var/lib/postgresql/data
environment:
POSTGRES_DB: myapp
POSTGRES_USER: user
POSTGRES_PASSWORD: pass
healthcheck:
test: ["CMD-SHELL", "pg_isready -U user -d myapp"]
interval: 5s
timeout: 3s
retries: 5
cache:
image: redis:7-alpine
restart: unless-stopped
nginx:
image: nginx:alpine
ports:
- "80:80"
- "443:443"
volumes:
- ./nginx.conf:/etc/nginx/nginx.conf:ro
- ./certs:/etc/nginx/certs:ro
depends_on:
- app
volumes:
pgdata:
# Commands
docker compose up -d # Start all services
docker compose logs -f app # Follow app logs
docker compose ps # Status of all services
docker compose down # Stop and remove
docker compose up -d --build # Rebuild and restart
Container Registries
| Registry | Free Tier | Best For |
|---|---|---|
| Docker Hub | 1 private repo | Public images, open source |
| GitHub Container Registry (ghcr.io) | Generous free | GitHub-based projects |
| AWS ECR | 500MB free | AWS deployments |
| Google Artifact Registry | 500MB free | GCP deployments |
Kubernetes — When You Need Orchestration
Kubernetes (K8s) manages containers at scale: scheduling, scaling, self-healing, service discovery, rolling updates.
Core Kubernetes Concepts
# Pod — smallest deployable unit (one or more containers)
# Deployment — manages replica sets, rolling updates
# Service — stable network endpoint for pods
# Ingress — HTTP routing from outside the cluster
# Example: Deployment + Service + Ingress
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp
spec:
replicas: 3
selector:
matchLabels:
app: myapp
template:
metadata:
labels:
app: myapp
spec:
containers:
- name: myapp
image: ghcr.io/you/myapp:v1.2.3
ports:
- containerPort: 3000
resources:
requests:
memory: "128Mi"
cpu: "100m"
limits:
memory: "256Mi"
cpu: "500m"
livenessProbe:
httpGet:
path: /health
port: 3000
initialDelaySeconds: 10
periodSeconds: 30
---
apiVersion: v1
kind: Service
metadata:
name: myapp-svc
spec:
selector:
app: myapp
ports:
- port: 80
targetPort: 3000
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: myapp-ingress
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
tls:
- hosts:
- myapp.example.com
secretName: myapp-tls
rules:
- host: myapp.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: myapp-svc
port:
number: 80
Managed Kubernetes Services
| Service | Provider | Starting Cost | Notes |
|---|---|---|---|
| EKS | AWS | $0.10/hr control plane + nodes | Most popular, complex |
| GKE | 1 free zonal cluster + nodes | Best K8s experience (Google made K8s) | |
| AKS | Azure | Free control plane + nodes | Good for Microsoft shops |
| DOKS | DigitalOcean | Free control plane + $12/node | Simplest managed K8s |
Chapter 9 — Serverless & Edge
Serverless means you write functions, not servers. The cloud provider handles all infrastructure — you pay only when your code runs.
How Serverless Works
AWS Lambda Example
// handler.js — AWS Lambda function
exports.handler = async (event) => {
const body = JSON.parse(event.body || '{}');
// Your business logic here
const result = await processRequest(body);
return {
statusCode: 200,
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify(result)
};
};
// Deploy with AWS SAM or Serverless Framework:
# serverless.yml
service: myapi
provider:
name: aws
runtime: nodejs20.x
region: eu-west-1
functions:
api:
handler: handler.handler
events:
- httpApi:
path: /api/{proxy+}
method: ANY
memorySize: 256
timeout: 10
Cloudflare Workers (Edge Computing)
// worker.js — runs at 300+ edge locations worldwide
export default {
async fetch(request, env) {
const url = new URL(request.url);
if (url.pathname === '/api/hello') {
return new Response(JSON.stringify({ message: 'Hello from the edge!' }), {
headers: { 'Content-Type': 'application/json' }
});
}
// Proxy to origin for other routes
return fetch(request);
}
};
Serverless Comparison
| Platform | Cold Start | Max Duration | Free Tier | Best For |
|---|---|---|---|---|
| AWS Lambda | 100-500ms | 15 min | 1M requests/mo | Full backend APIs, event processing |
| Cloudflare Workers | ~0ms (no cold start) | 30s (free), 15min (paid) | 100K requests/day | Edge logic, fast APIs |
| Vercel Functions | ~250ms | 10-60s | 100GB-hrs/mo | Next.js apps, frontend teams |
| Netlify Functions | ~200ms | 10-26s | 125K requests/mo | JAMstack backends |
| Google Cloud Functions | 100-400ms | 9-60 min | 2M invocations/mo | GCP ecosystem, event-driven |
When Serverless Fits
• Traffic is sporadic/unpredictable (pay-per-use saves money)
• Individual requests are short-lived (<30s)
• You want zero infrastructure management
• Event-driven workloads (file uploads, webhooks, scheduled tasks)
• API endpoints with variable traffic
Avoid serverless when:
• Consistent high traffic (a server is cheaper)
• Long-running processes (video encoding, ML training)
• WebSocket connections needed
• You need local filesystem or persistent state
• Cold starts are unacceptable (real-time systems)
Edge Computing
Edge computing runs your code at CDN points-of-presence (PoPs) close to users, reducing latency from ~100ms to ~10ms.
- Cloudflare Workers: 300+ locations, V8 isolates (not containers), near-zero cold start
- AWS CloudFront Functions: Lightweight, runs at CloudFront edge locations
- Deno Deploy: 35+ regions, TypeScript-native
- Vercel Edge Functions: Built on Cloudflare Workers
Chapter 10 — Cloud Providers Deep Dive
The "Big Three" (AWS, GCP, Azure) plus strong alternatives. Understanding their service ecosystems lets you pick the right provider and avoid vendor lock-in traps.
Service Mapping Across Providers
| Category | AWS | GCP | Azure | DigitalOcean |
|---|---|---|---|---|
| Compute (VMs) | EC2 | Compute Engine | Virtual Machines | Droplets |
| Containers | ECS / EKS | Cloud Run / GKE | ACI / AKS | DOKS / App Platform |
| Serverless | Lambda | Cloud Functions | Azure Functions | Functions (beta) |
| Object Storage | S3 | Cloud Storage | Blob Storage | Spaces |
| SQL Database | RDS / Aurora | Cloud SQL / Spanner | Azure SQL | Managed Databases |
| NoSQL | DynamoDB | Firestore / Bigtable | Cosmos DB | MongoDB (managed) |
| CDN | CloudFront | Cloud CDN | Azure CDN | Spaces CDN |
| DNS | Route 53 | Cloud DNS | Azure DNS | DNS (basic) |
| Load Balancer | ALB / NLB | Cloud Load Balancing | Azure LB | Load Balancers |
| Secrets | Secrets Manager | Secret Manager | Key Vault | — |
| Monitoring | CloudWatch | Cloud Monitoring | Azure Monitor | Built-in metrics |
| IaC | CloudFormation | Deployment Manager | ARM / Bicep | Terraform only |
Pricing Models
| Model | Description | Savings | Commitment |
|---|---|---|---|
| On-Demand | Pay by the hour/second, no commitment | 0% (baseline) | None |
| Reserved / Committed | 1-3 year commitment for lower rate | 30-72% | 1-3 years |
| Spot / Preemptible | Unused capacity, can be terminated anytime | 60-90% | None (but unreliable) |
| Savings Plans | Commit to $/hr spend, flexible instance types | 20-50% | 1-3 years |
Hands-On: Full Stack on AWS
Deploying a web app with EC2 + RDS + S3 + CloudFront:
# Architecture:
# CloudFront (CDN) → ALB → EC2 (app) → RDS (PostgreSQL)
# → S3 (static assets/uploads)
# Step 1: Create VPC with public/private subnets
aws ec2 create-vpc --cidr-block 10.0.0.0/16
# (In practice, use Terraform — shown in Chapter 14)
# Step 2: Launch RDS in private subnet
aws rds create-db-instance \
--db-instance-identifier myapp-db \
--db-instance-class db.t3.micro \
--engine postgres \
--engine-version 16 \
--master-username admin \
--master-user-password "$(openssl rand -base64 24)" \
--allocated-storage 20 \
--no-publicly-accessible \
--vpc-security-group-ids sg-xxxxx
# Step 3: Create S3 bucket for assets
aws s3 mb s3://myapp-assets-prod
aws s3api put-bucket-policy --bucket myapp-assets-prod \
--policy '{"Statement":[{"Effect":"Allow","Principal":"*","Action":"s3:GetObject","Resource":"arn:aws:s3:::myapp-assets-prod/*"}]}'
# Step 4: Create CloudFront distribution
aws cloudfront create-distribution \
--origin-domain-name myapp-assets-prod.s3.amazonaws.com \
--default-root-object index.html
# Step 5: Set up ALB + EC2 (use launch template + auto-scaling group)
# Step 6: Configure Route 53 to point domain to CloudFront
Hands-On: Full Stack on DigitalOcean
# Architecture:
# Load Balancer → Droplet(s) → Managed PostgreSQL
# → Spaces (S3-compatible storage)
# Step 1: Create a Droplet
doctl compute droplet create myapp-web \
--image ubuntu-22-04-x64 \
--size s-1vcpu-2gb \
--region fra1 \
--ssh-keys YOUR_KEY_FINGERPRINT
# Step 2: Create managed database
doctl databases create myapp-db \
--engine pg \
--version 16 \
--size db-s-1vcpu-1gb \
--region fra1 \
--num-nodes 1
# Step 3: Create Spaces bucket (S3-compatible)
# Done via web console or s3cmd with DO endpoint
# Step 4: Create load balancer
doctl compute load-balancer create \
--name myapp-lb \
--region fra1 \
--forwarding-rules "entry_protocol:https,entry_port:443,target_protocol:http,target_port:3000,certificate_id:YOUR_CERT" \
--droplet-ids DROPLET_ID
# Step 5: Point domain DNS to load balancer IP
Which Provider to Choose
GCP: Best for data/ML, Kubernetes (they invented it), clean APIs. Smaller market share.
Azure: Best for Microsoft/.NET shops, enterprise AD integration. Complex portal.
DigitalOcean: Simplest UX, predictable pricing, great docs. Fewer services, smaller scale.
Hetzner: Best price/performance in EU. Minimal managed services but unbeatable value.
Rule of thumb: Start with DigitalOcean or Hetzner for simplicity. Move to AWS/GCP when you need managed services (ML, analytics, complex networking) that simpler providers don't offer.
Cost Calculator Links
- AWS Pricing Calculator
- GCP Pricing Calculator
- Azure Pricing Calculator
- DigitalOcean Pricing (simple flat rates)
Chapter 11 — Domain Names & DNS
DNS (Domain Name System) translates human-readable names to IP addresses. It's the phone book of the internet, and misconfiguring it is one of the most common causes of "my site is down."
How DNS Resolution Works
DNS Record Types
| Type | Purpose | Example |
|---|---|---|
| A | Maps name to IPv4 address | example.com → 93.184.216.34 |
| AAAA | Maps name to IPv6 address | example.com → 2606:2800:220:1:... |
| CNAME | Alias to another name | www.example.com → example.com |
| MX | Mail server for the domain | example.com → mail.example.com (priority 10) |
| TXT | Arbitrary text (verification, SPF, DKIM) | example.com → "v=spf1 include:_spf.google.com ~all" |
| NS | Nameservers for the domain | example.com → ns1.cloudflare.com |
| SRV | Service location (port + priority) | _sip._tcp.example.com → sipserver.example.com:5060 |
| CAA | Which CAs can issue certificates | example.com → "0 issue letsencrypt.org" |
Practical DNS Configuration
# Typical DNS setup for a web app:
# Root domain → your server
example.com. A 93.184.216.34
example.com. AAAA 2606:2800:220:1::248
# www subdomain → alias to root
www.example.com. CNAME example.com.
# API subdomain → different server or load balancer
api.example.com. A 10.20.30.40
# Email (Google Workspace example)
example.com. MX 1 aspmx.l.google.com.
example.com. MX 5 alt1.aspmx.l.google.com.
example.com. TXT "v=spf1 include:_spf.google.com ~all"
# DKIM (email authentication)
google._domainkey.example.com. TXT "v=DKIM1; k=rsa; p=MIGfMA0..."
# DMARC (email policy)
_dmarc.example.com. TXT "v=DMARC1; p=quarantine; rua=mailto:dmarc@example.com"
# Let's Encrypt verification
_acme-challenge.example.com. TXT "random-verification-string"
TTL (Time To Live)
TTL tells resolvers how long to cache a record (in seconds):
- 300 (5 min): Good for records that might change (during migrations)
- 3600 (1 hr): Standard for most records
- 86400 (24 hr): Stable records (MX, NS)
DNS Providers
| Provider | Free Tier | Best Feature | Best For |
|---|---|---|---|
| Cloudflare | Unlimited zones | CDN + DDoS + DNS in one | Most websites (recommended default) |
| AWS Route 53 | None ($0.50/zone) | Latency/geo routing, health checks | AWS-heavy infrastructure |
| Google Cloud DNS | None | Low latency, DNSSEC | GCP infrastructure |
| NS1 | Limited free | Advanced traffic management | Complex routing needs |
CDN Integration
A CDN (Content Delivery Network) caches your content at edge locations worldwide. DNS is how you route users to the nearest edge:
# Without CDN: Users hit your origin server directly
example.com. A YOUR_SERVER_IP
# With Cloudflare (proxy mode): Users hit Cloudflare edge, which proxies to origin
# Just enable the orange cloud icon in Cloudflare dashboard
# DNS resolves to Cloudflare's anycast IPs, not your server
# With AWS CloudFront: Point domain to CloudFront distribution
example.com. ALIAS d1234567890.cloudfront.net.
# (ALIAS is AWS-specific; equivalent to CNAME at zone apex)
Chapter 12 — Reverse Proxies & Load Balancing
A reverse proxy sits between the internet and your application servers. It handles SSL termination, load balancing, caching, rate limiting, and request routing — so your app doesn't have to.
Why Use a Reverse Proxy
- SSL termination: Handle HTTPS at the proxy, app speaks plain HTTP internally
- Load balancing: Distribute requests across multiple app instances
- Static file serving: Serve assets directly without hitting your app
- Caching: Cache responses to reduce backend load
- Security: Hide backend topology, add rate limiting, filter bad requests
- Compression: Gzip/Brotli responses automatically
Nginx — Full Production Configuration
# /etc/nginx/sites-available/myapp.conf
upstream app_backend {
# Load balancing across multiple app instances
server 127.0.0.1:3000 weight=3;
server 127.0.0.1:3001 weight=2;
server 127.0.0.1:3002 backup; # Only used if others are down
# Health checks (Nginx Plus) or use passive checks
keepalive 32; # Persistent connections to backend
}
# Redirect HTTP to HTTPS
server {
listen 80;
server_name example.com www.example.com;
return 301 https://$server_name$request_uri;
}
# Main HTTPS server
server {
listen 443 ssl http2;
server_name example.com www.example.com;
# SSL Configuration
ssl_certificate /etc/letsencrypt/live/example.com/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/example.com/privkey.pem;
ssl_protocols TLSv1.2 TLSv1.3;
ssl_ciphers ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256;
ssl_prefer_server_ciphers off;
ssl_session_cache shared:SSL:10m;
ssl_session_timeout 1d;
# Security headers
add_header Strict-Transport-Security "max-age=63072000; includeSubDomains" always;
add_header X-Frame-Options DENY always;
add_header X-Content-Type-Options nosniff always;
add_header X-XSS-Protection "1; mode=block" always;
# Gzip compression
gzip on;
gzip_types text/plain text/css application/json application/javascript text/xml;
gzip_min_length 1000;
# Static files — served directly by Nginx (fast)
location /static/ {
alias /var/www/myapp/static/;
expires 30d;
add_header Cache-Control "public, immutable";
}
# API — proxy to backend
location /api/ {
proxy_pass http://app_backend;
proxy_http_version 1.1;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_set_header Connection "";
# Timeouts
proxy_connect_timeout 5s;
proxy_send_timeout 60s;
proxy_read_timeout 60s;
# Rate limiting
limit_req zone=api burst=20 nodelay;
}
# WebSocket support
location /ws/ {
proxy_pass http://app_backend;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
proxy_set_header Host $host;
proxy_read_timeout 86400; # Keep WebSocket alive
}
}
# Rate limiting zone (defined in nginx.conf http block)
# limit_req_zone $binary_remote_addr zone=api:10m rate=10r/s;
Caddy — The Modern Alternative (Auto-HTTPS)
# Caddyfile — entire config for production site with auto-SSL
example.com {
# Automatic HTTPS (Let's Encrypt) — zero configuration needed!
# Reverse proxy to app
reverse_proxy /api/* localhost:3000
# Static files
root * /var/www/myapp/static
file_server
# Compression
encode gzip zstd
# Security headers
header {
Strict-Transport-Security "max-age=63072000; includeSubDomains"
X-Frame-Options DENY
X-Content-Type-Options nosniff
}
# Rate limiting (with caddy-ratelimit plugin)
rate_limit {remote.ip} 10r/s
# Logging
log {
output file /var/log/caddy/access.log
format json
}
}
Load Balancing Algorithms
| Algorithm | How It Works | Best For |
|---|---|---|
| Round Robin | Requests distributed sequentially | Identical servers, stateless apps |
| Weighted Round Robin | More requests to higher-weight servers | Mixed server capacities |
| Least Connections | Send to server with fewest active connections | Variable request durations |
| IP Hash | Same client IP always goes to same server | Session affinity (sticky sessions) |
| Random | Random server selection | Large clusters, simple |
Cloud Load Balancers
| Type | AWS | Layer | Use Case |
|---|---|---|---|
| Application LB | ALB | Layer 7 (HTTP) | Web apps, path-based routing, WebSocket |
| Network LB | NLB | Layer 4 (TCP/UDP) | High performance, static IP, non-HTTP |
| Gateway LB | GWLB | Layer 3 | Network appliances (firewalls, IDS) |
Chapter 13 — CI/CD Pipelines
CI/CD (Continuous Integration / Continuous Deployment) automates the path from code commit to production. No more manual deployments, no more "I forgot to run the tests."
CI vs CD
| Term | What It Does | Triggered By |
|---|---|---|
| Continuous Integration (CI) | Automatically build and test every commit/PR | Every push or pull request |
| Continuous Delivery (CD) | Automatically prepare releases (deploy to staging) | Merge to main branch |
| Continuous Deployment (CD) | Automatically deploy to production | After all checks pass |
GitHub Actions — Complete Workflow
# .github/workflows/deploy.yml
name: CI/CD Pipeline
on:
push:
branches: [main]
pull_request:
branches: [main]
env:
REGISTRY: ghcr.io
IMAGE_NAME: ${{ github.repository }}
jobs:
# ─── CI: Test & Build ───────────────────────────────────
test:
runs-on: ubuntu-latest
services:
postgres:
image: postgres:16
env:
POSTGRES_DB: test
POSTGRES_PASSWORD: test
ports: ['5432:5432']
options: >-
--health-cmd pg_isready
--health-interval 10s
--health-timeout 5s
--health-retries 5
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: 20
cache: 'npm'
- run: npm ci
- run: npm run lint
- run: npm run test
env:
DATABASE_URL: postgres://postgres:test@localhost:5432/test
# ─── Build & Push Docker Image ─────────────────────────
build:
needs: test
runs-on: ubuntu-latest
if: github.event_name == 'push' && github.ref == 'refs/heads/main'
permissions:
contents: read
packages: write
outputs:
image-tag: ${{ steps.meta.outputs.tags }}
steps:
- uses: actions/checkout@v4
- uses: docker/login-action@v3
with:
registry: ${{ env.REGISTRY }}
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- id: meta
uses: docker/metadata-action@v5
with:
images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
tags: |
type=sha,prefix=
type=raw,value=latest
- uses: docker/build-push-action@v5
with:
context: .
push: true
tags: ${{ steps.meta.outputs.tags }}
cache-from: type=gha
cache-to: type=gha,mode=max
# ─── Deploy to Production ──────────────────────────────
deploy:
needs: build
runs-on: ubuntu-latest
if: github.event_name == 'push' && github.ref == 'refs/heads/main'
environment: production # Requires approval if configured
steps:
- name: Deploy to server via SSH
uses: appleboy/ssh-action@v1
with:
host: ${{ secrets.SERVER_HOST }}
username: deploy
key: ${{ secrets.SSH_PRIVATE_KEY }}
script: |
docker pull ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:latest
docker stop myapp || true
docker rm myapp || true
docker run -d \
--name myapp \
--restart unless-stopped \
-p 3000:3000 \
--env-file /etc/myapp.env \
${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:latest
# Health check
sleep 5
curl -f http://localhost:3000/health || (docker logs myapp && exit 1)
Deployment Strategies
| Strategy | How It Works | Downtime | Risk | Rollback Speed |
|---|---|---|---|---|
| Recreate | Stop old, start new | Yes (seconds-minutes) | Low complexity | Redeploy old version |
| Rolling | Replace instances one by one | No | Medium (mixed versions briefly) | Continue rolling with old |
| Blue-Green | Run two identical environments, switch traffic | No | Low (instant switch) | Switch back instantly |
| Canary | Route small % of traffic to new version | No | Lowest (limited blast radius) | Route 100% back to old |
GitLab CI Example
# .gitlab-ci.yml
stages:
- test
- build
- deploy
test:
stage: test
image: node:20
services:
- postgres:16
variables:
DATABASE_URL: postgres://postgres:test@postgres:5432/test
script:
- npm ci
- npm run lint
- npm run test
build:
stage: build
image: docker:24
services:
- docker:24-dind
script:
- docker build -t $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA .
- docker push $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA
only:
- main
deploy:
stage: deploy
script:
- ssh deploy@$SERVER "docker pull $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA && docker-compose up -d"
only:
- main
environment:
name: production
Chapter 14 — Infrastructure as Code
Infrastructure as Code (IaC) means defining your servers, networks, databases, and all cloud resources in version-controlled configuration files — not clicking through web consoles.
Why IaC
- Reproducibility: Spin up identical environments (dev/staging/prod) from the same code
- Version control: Track every infrastructure change in Git (who changed what, when, why)
- Review process: Infrastructure changes go through pull requests like code
- Disaster recovery: Rebuild entire infrastructure from scratch in minutes
- Documentation: The code IS the documentation of your infrastructure
Terraform — The Industry Standard
# main.tf — Deploy a web app on AWS
terraform {
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
}
# Store state remotely (never in git!)
backend "s3" {
bucket = "myapp-terraform-state"
key = "prod/terraform.tfstate"
region = "eu-west-1"
}
}
provider "aws" {
region = var.region
}
# ─── VPC ──────────────────────────────────────────────────
module "vpc" {
source = "terraform-aws-modules/vpc/aws"
version = "5.0"
name = "${var.app_name}-vpc"
cidr = "10.0.0.0/16"
azs = ["${var.region}a", "${var.region}b"]
public_subnets = ["10.0.1.0/24", "10.0.2.0/24"]
private_subnets = ["10.0.10.0/24", "10.0.11.0/24"]
enable_nat_gateway = true
single_nat_gateway = true # Cost saving for non-prod
}
# ─── Database ─────────────────────────────────────────────
resource "aws_db_instance" "main" {
identifier = "${var.app_name}-db"
engine = "postgres"
engine_version = "16"
instance_class = "db.t3.micro"
allocated_storage = 20
storage_encrypted = true
db_name = var.app_name
username = "admin"
password = var.db_password # From secrets, never hardcoded
vpc_security_group_ids = [aws_security_group.db.id]
db_subnet_group_name = aws_db_subnet_group.main.name
skip_final_snapshot = false
final_snapshot_identifier = "${var.app_name}-final-snapshot"
backup_retention_period = 7
multi_az = var.environment == "prod" ? true : false
}
# ─── EC2 Instance ─────────────────────────────────────────
resource "aws_instance" "web" {
ami = data.aws_ami.ubuntu.id
instance_type = var.instance_type
subnet_id = module.vpc.public_subnets[0]
vpc_security_group_ids = [aws_security_group.web.id]
key_name = aws_key_pair.deploy.key_name
user_data = templatefile("${path.module}/userdata.sh", {
app_name = var.app_name
db_host = aws_db_instance.main.endpoint
})
tags = {
Name = "${var.app_name}-web"
Environment = var.environment
}
}
# ─── Variables ────────────────────────────────────────────
# variables.tf
variable "app_name" { default = "myapp" }
variable "region" { default = "eu-west-1" }
variable "environment" { default = "prod" }
variable "instance_type" { default = "t3.small" }
variable "db_password" { sensitive = true }
# Terraform workflow:
terraform init # Download providers, initialize backend
terraform plan # Preview changes (ALWAYS review this)
terraform apply # Apply changes (creates/modifies resources)
terraform destroy # Tear down everything (careful!)
Ansible — Configuration Management
Terraform creates infrastructure. Ansible configures it (installs software, deploys apps, manages configs).
# playbook.yml — Configure a web server
---
- name: Configure web server
hosts: webservers
become: yes
vars:
app_name: myapp
app_port: 3000
node_version: "20"
tasks:
- name: Update apt cache
apt:
update_cache: yes
cache_valid_time: 3600
- name: Install required packages
apt:
name: [nginx, certbot, python3-certbot-nginx, ufw]
state: present
- name: Configure UFW firewall
ufw:
rule: allow
port: "{{ item }}"
proto: tcp
loop: ['22', '80', '443']
- name: Enable UFW
ufw:
state: enabled
policy: deny
- name: Install Node.js
shell: |
curl -fsSL https://deb.nodesource.com/setup_{{ node_version }}.x | bash -
apt-get install -y nodejs
args:
creates: /usr/bin/node
- name: Deploy application
git:
repo: "https://github.com/you/{{ app_name }}.git"
dest: "/opt/{{ app_name }}"
version: main
notify: restart app
- name: Install dependencies
npm:
path: "/opt/{{ app_name }}"
production: yes
- name: Create systemd service
template:
src: templates/app.service.j2
dest: "/etc/systemd/system/{{ app_name }}.service"
notify: restart app
- name: Configure Nginx
template:
src: templates/nginx.conf.j2
dest: "/etc/nginx/sites-available/{{ app_name }}"
notify: reload nginx
handlers:
- name: restart app
systemd:
name: "{{ app_name }}"
state: restarted
daemon_reload: yes
- name: reload nginx
systemd:
name: nginx
state: reloaded
# Run Ansible:
ansible-playbook -i inventory.yml playbook.yml
# inventory.yml
webservers:
hosts:
web1:
ansible_host: 93.184.216.34
ansible_user: deploy
GitOps Principles
- Declarative: Describe desired state, not steps to get there
- Versioned: All infrastructure definitions in Git
- Automated: Changes applied automatically when Git changes
- Observable: Drift detection — alert when actual state ≠ desired state
• Terraform: Multi-cloud, creates infrastructure (VMs, networks, DBs). Industry standard.
• Ansible: Configures existing servers (install software, deploy apps). Agentless, SSH-based.
• CloudFormation: AWS-only IaC. Use if you're 100% AWS and want native integration.
• Pulumi: IaC in real programming languages (TypeScript, Python, Go). Good for developers who dislike HCL.
Common combo: Terraform to create infrastructure + Ansible to configure it. Or Terraform + Docker (no config management needed — the container IS the config).
Chapter 15 — Monitoring, Logging & Observability
If you can't see what's happening in production, you can't fix it. Observability is the ability to understand your system's internal state from its external outputs.
The Three Pillars
| Pillar | What | Tools | Answers |
|---|---|---|---|
| Metrics | Numeric measurements over time | Prometheus, CloudWatch, Datadog | "How much?" "How fast?" "How often?" |
| Logs | Discrete events with context | Loki, ELK, CloudWatch Logs | "What happened?" "Why did it fail?" |
| Traces | Request flow across services | Jaeger, Zipkin, AWS X-Ray | "Where is the bottleneck?" "Which service is slow?" |
Prometheus + Grafana (The Open-Source Standard)
# docker-compose.yml — Monitoring stack
services:
prometheus:
image: prom/prometheus:latest
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus_data:/prometheus
ports:
- "9090:9090"
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.retention.time=30d'
grafana:
image: grafana/grafana:latest
volumes:
- grafana_data:/var/lib/grafana
ports:
- "3001:3000"
environment:
GF_SECURITY_ADMIN_PASSWORD: changeme
node-exporter:
image: prom/node-exporter:latest
ports:
- "9100:9100"
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
volumes:
prometheus_data:
grafana_data:
# prometheus.yml — Scrape configuration
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- "alerts.yml"
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanager:9093']
scrape_configs:
- job_name: 'node'
static_configs:
- targets: ['node-exporter:9100']
- job_name: 'myapp'
static_configs:
- targets: ['myapp:3000']
metrics_path: '/metrics'
Application Metrics (What to Measure)
The RED method for services and USE method for resources:
| Method | Metric | What It Tells You |
|---|---|---|
| RED (Services) | Rate | Requests per second |
| Errors | Failed requests per second | |
| Duration | Response time (p50, p95, p99) | |
| USE (Resources) | Utilization | % of resource capacity used |
| Saturation | Queue depth, waiting work | |
| Errors | Error count on the resource |
# Example: Exposing metrics in Node.js (using prom-client)
const client = require('prom-client');
// Default metrics (CPU, memory, event loop)
client.collectDefaultMetrics();
// Custom metrics
const httpRequestDuration = new client.Histogram({
name: 'http_request_duration_seconds',
help: 'Duration of HTTP requests in seconds',
labelNames: ['method', 'route', 'status_code'],
buckets: [0.01, 0.05, 0.1, 0.5, 1, 5]
});
// Middleware to track requests
app.use((req, res, next) => {
const end = httpRequestDuration.startTimer();
res.on('finish', () => {
end({ method: req.method, route: req.route?.path || req.path, status_code: res.statusCode });
});
next();
});
// Metrics endpoint for Prometheus to scrape
app.get('/metrics', async (req, res) => {
res.set('Content-Type', client.register.contentType);
res.end(await client.register.metrics());
});
Alerting Rules
# alerts.yml — Prometheus alerting rules
groups:
- name: webapp
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status_code=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate (> 5%)"
description: "{{ $labels.instance }} has {{ $value | humanizePercentage }} error rate"
- alert: HighLatency
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 2
for: 5m
labels:
severity: warning
annotations:
summary: "High p95 latency (> 2s)"
- alert: InstanceDown
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Instance {{ $labels.instance }} is down"
SLIs, SLOs, and SLAs
| Term | Definition | Example |
|---|---|---|
| SLI (Indicator) | A measurable metric of service quality | 99.2% of requests complete in <500ms |
| SLO (Objective) | Target value for an SLI (internal goal) | "99.9% availability over 30 days" |
| SLA (Agreement) | Contract with customers (with consequences) | "99.9% uptime or we credit your bill" |
Chapter 16 — Security & Hardening
Security is not a feature you add later — it's a practice woven into every layer. This chapter covers the essential security measures for any production website.
SSL/TLS — Encrypting Traffic
How TLS Works (Simplified)
Certificate Options
| Provider | Cost | Validation | Best For |
|---|---|---|---|
| Let's Encrypt | Free | Domain Validation (DV) | Everything (90-day auto-renewal) |
| Cloudflare | Free (with proxy) | DV | Sites behind Cloudflare |
| AWS ACM | Free (with AWS services) | DV | AWS ALB/CloudFront |
| Commercial CAs | $10-1000/yr | OV/EV | Enterprise, legal requirements |
HSTS (HTTP Strict Transport Security)
# Force browsers to always use HTTPS (add to Nginx/response headers)
Strict-Transport-Security: max-age=63072000; includeSubDomains; preload
# Once set, browsers will NEVER make HTTP requests to your domain
# Submit to HSTS preload list: https://hstspreload.org/
Firewall Configuration
# UFW (Uncomplicated Firewall) — Ubuntu/Debian
sudo ufw default deny incoming
sudo ufw default allow outgoing
sudo ufw allow 22/tcp # SSH (or your custom port)
sudo ufw allow 80/tcp # HTTP
sudo ufw allow 443/tcp # HTTPS
sudo ufw enable
sudo ufw status verbose
# iptables (lower level, more control)
# Allow established connections
iptables -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT
# Allow loopback
iptables -A INPUT -i lo -j ACCEPT
# Allow SSH, HTTP, HTTPS
iptables -A INPUT -p tcp --dport 22 -j ACCEPT
iptables -A INPUT -p tcp --dport 80 -j ACCEPT
iptables -A INPUT -p tcp --dport 443 -j ACCEPT
# Drop everything else
iptables -A INPUT -j DROP
# Save rules (persist across reboot)
sudo apt install iptables-persistent
sudo netfilter-persistent save
SSH Hardening
# /etc/ssh/sshd_config — Production SSH configuration
Port 2222 # Non-standard port (reduces noise)
PermitRootLogin no # Never allow root SSH
PasswordAuthentication no # Keys only
PubkeyAuthentication yes
MaxAuthTries 3
ClientAliveInterval 300
ClientAliveCountMax 2
AllowUsers deploy # Whitelist specific users
Protocol 2
# Optional: Restrict to specific IPs (if you have static IP)
# AllowUsers deploy@YOUR_IP
Security Headers
# Add to Nginx server block or application responses:
add_header X-Frame-Options "DENY" always;
add_header X-Content-Type-Options "nosniff" always;
add_header X-XSS-Protection "1; mode=block" always;
add_header Referrer-Policy "strict-origin-when-cross-origin" always;
add_header Permissions-Policy "camera=(), microphone=(), geolocation=()" always;
add_header Content-Security-Policy "default-src 'self'; script-src 'self'; style-src 'self' 'unsafe-inline';" always;
# Test your headers: https://securityheaders.com/
Secrets Management
| Tool | Type | Best For |
|---|---|---|
| Environment variables | Runtime injection | Simple apps (loaded from secure source) |
| AWS Secrets Manager | Cloud-managed | AWS workloads, auto-rotation |
| HashiCorp Vault | Self-hosted/cloud | Multi-cloud, dynamic secrets, PKI |
| SOPS | Encrypted files in Git | GitOps workflows, small teams |
| Doppler / 1Password | SaaS | Teams wanting simple UI |
# Example: Using SOPS to encrypt secrets in Git
# Install: brew install sops age
# Generate an age key
age-keygen -o keys.txt
# Public key: age1xxxxxxx...
# Create .sops.yaml in repo root
creation_rules:
- path_regex: \.enc\.yaml$
age: age1xxxxxxx...
# Encrypt a secrets file
sops --encrypt secrets.yaml > secrets.enc.yaml
# secrets.enc.yaml is safe to commit — encrypted at rest
# Decrypt at deploy time
sops --decrypt secrets.enc.yaml > /etc/myapp.env
Backup Strategy — The 3-2-1 Rule
- 3 copies of your data
- 2 different storage media/types
- 1 copy offsite (different geographic location)
# Automated database backup script
#!/bin/bash
# /opt/scripts/backup-db.sh
set -euo pipefail
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
BACKUP_DIR="/opt/backups"
S3_BUCKET="s3://myapp-backups-prod"
# Dump database
pg_dump -h localhost -U myapp myapp_prod | gzip > "$BACKUP_DIR/db_$TIMESTAMP.sql.gz"
# Upload to S3 (offsite copy)
aws s3 cp "$BACKUP_DIR/db_$TIMESTAMP.sql.gz" "$S3_BUCKET/db/$TIMESTAMP.sql.gz"
# Retain only last 7 local backups
ls -t $BACKUP_DIR/db_*.sql.gz | tail -n +8 | xargs rm -f
# Cron: Run daily at 3 AM
# 0 3 * * * /opt/scripts/backup-db.sh >> /var/log/backup.log 2>&1
DDoS Mitigation
- Cloudflare (free tier): Proxies traffic, absorbs L3/L4 attacks, rate limiting
- AWS Shield: Standard (free, basic protection) or Advanced ($3000/mo, dedicated response team)
- Rate limiting: Nginx limit_req, application-level throttling
- Fail2ban: Automatically ban IPs with suspicious patterns
Production Security Checklist
☐ HTTPS everywhere (HSTS enabled)
☐ SSH: key-only auth, non-standard port, root disabled
☐ Firewall: only required ports open
☐ Security headers configured
☐ Secrets not in Git (use secrets manager)
☐ Database not publicly accessible
☐ Automatic security updates enabled
☐ Fail2ban or equivalent running
☐ Backups automated and tested
☐ Dependencies scanned for vulnerabilities (npm audit, Snyk)
☐ Application logs don't contain sensitive data
☐ Rate limiting on authentication endpoints
☐ CORS configured correctly (not wildcard in production)
Chapter 17 — Maintenance & Operations
Launching is just the beginning. Day-2 operations — keeping the system running, updated, and healthy — is where most of the work lives.
OS & Dependency Updates
# Automatic security updates (Ubuntu)
sudo apt install unattended-upgrades
sudo dpkg-reconfigure -plow unattended-upgrades
# /etc/apt/apt.conf.d/50unattended-upgrades
Unattended-Upgrade::Allowed-Origins {
"${distro_id}:${distro_codename}-security";
};
Unattended-Upgrade::Automatic-Reboot "true";
Unattended-Upgrade::Automatic-Reboot-Time "03:00";
# Application dependency updates — use Dependabot or Renovate
# .github/dependabot.yml
version: 2
updates:
- package-ecosystem: "npm"
directory: "/"
schedule:
interval: "weekly"
open-pull-requests-limit: 5
Zero-Downtime Deployments
# Method 1: Rolling restart with multiple instances behind LB
# Deploy to instance 1, health check passes, deploy to instance 2...
# Method 2: Docker with Nginx upstream reload
# deploy.sh
docker pull myregistry/myapp:$NEW_VERSION
docker run -d --name myapp-new -p 3001:3000 myregistry/myapp:$NEW_VERSION
# Wait for health check
until curl -sf http://localhost:3001/health; do sleep 1; done
# Switch Nginx upstream
sed -i 's/127.0.0.1:3000/127.0.0.1:3001/' /etc/nginx/conf.d/upstream.conf
nginx -s reload
# Stop old container
docker stop myapp-old && docker rm myapp-old
docker rename myapp-new myapp-old
Rollback Procedures
# Docker rollback — instant (previous image still cached)
docker stop myapp
docker run -d --name myapp -p 3000:3000 myregistry/myapp:PREVIOUS_VERSION
# Kubernetes rollback
kubectl rollout undo deployment/myapp
kubectl rollout status deployment/myapp # Watch progress
# Database rollback — this is the hard part
# Always make migrations reversible:
# - migration_001_add_column.up.sql
# - migration_001_add_column.down.sql
# Run: migrate down 1
• Never drop columns in the same deploy that removes the code using them
• Use expand-contract pattern: add new column → deploy code using both → remove old column
• Always write down migrations
• Test migrations against a production-size dataset before deploying
Disaster Recovery Plan
| Metric | Definition | Your Target |
|---|---|---|
| RTO (Recovery Time Objective) | Max acceptable downtime | e.g., 1 hour |
| RPO (Recovery Point Objective) | Max acceptable data loss | e.g., 15 minutes |
# Disaster Recovery Runbook Template:
## Scenario: Complete server failure
1. Spin up new server from Terraform (5 min)
2. Restore latest database backup (10 min)
3. Deploy latest Docker image (2 min)
4. Update DNS to new server IP (5 min + propagation)
5. Verify application health
6. Notify stakeholders
## Scenario: Database corruption
1. Stop application (prevent further writes)
2. Identify last good backup
3. Restore to point-in-time (RDS: use PITR)
4. Verify data integrity
5. Restart application
6. Post-mortem: identify root cause
## Scenario: Security breach
1. Isolate affected systems (revoke access, block IPs)
2. Preserve evidence (don't destroy logs)
3. Rotate ALL credentials and secrets
4. Assess scope of breach
5. Patch vulnerability
6. Notify affected users (legal requirement in many jurisdictions)
7. Post-mortem and remediation plan
Runbooks & On-Call
- Runbooks: Step-by-step procedures for common incidents. Written for 3 AM brain — clear, no ambiguity.
- On-call rotation: Tools like PagerDuty, Opsgenie, or Grafana OnCall route alerts to the right person.
- Escalation policy: If primary doesn't acknowledge in 5 min → page secondary → page manager.
- Post-mortems: Blameless analysis after every incident. Focus on systemic fixes, not individual blame.
Chapter 18 — Cost Management
Cloud bills can spiral out of control fast. Professional operations include cost awareness as a first-class concern.
Cloud Pricing Models
| Model | How It Works | Best For | Savings vs On-Demand |
|---|---|---|---|
| On-Demand | Pay per hour/second of use | Variable workloads, testing | 0% (baseline) |
| Reserved Instances | 1-3 year commitment, fixed rate | Steady-state production | 30-72% |
| Savings Plans | Commit to $/hr spend (flexible) | Predictable spend, flexible instances | 20-50% |
| Spot/Preemptible | Bid on unused capacity (can be terminated) | Batch jobs, CI runners, stateless workers | 60-90% |
Cost Optimization Strategies
- Right-sizing: Monitor actual CPU/memory usage. Most instances are over-provisioned. A t3.medium using 10% CPU should be a t3.small.
- Auto-scaling: Scale down during off-hours. Many apps have 10x traffic difference between peak and trough.
- Spot instances: Use for CI/CD runners, batch processing, and stateless workers. Save 60-90%.
- Reserved capacity: For databases and always-on servers, commit for 1-3 years.
- Storage tiering: Move old data to cheaper storage (S3 Glacier, cold storage).
- Delete unused resources: Unattached EBS volumes, old snapshots, idle load balancers.
- Use ARM instances: Graviton (AWS) / T2A (GCP) are 20% cheaper with same or better performance.
TCO Comparison: Self-Hosted vs Cloud
| Factor | VPS ($40/mo) | AWS (equivalent) | Self-Hosted |
|---|---|---|---|
| Compute | $40/mo | $70-150/mo | $500 one-time + power |
| Database | Included (self-managed) | $30-100/mo (RDS) | Included |
| Bandwidth | Usually generous | $0.09/GB out (adds up!) | ISP cost |
| Your time | Medium (manage server) | Low (managed services) | High (manage everything) |
| Scaling | Manual (resize/add VPS) | Automatic | Buy more hardware |
| Reliability | 99.9% SLA typical | 99.99% possible | Depends on you |
Billing Alerts
# AWS: Set up billing alarm via CLI
aws cloudwatch put-metric-alarm \
--alarm-name "MonthlyBillingAlarm" \
--alarm-description "Alert when bill exceeds $100" \
--metric-name EstimatedCharges \
--namespace AWS/Billing \
--statistic Maximum \
--period 21600 \
--threshold 100 \
--comparison-operator GreaterThanThreshold \
--evaluation-periods 1 \
--alarm-actions arn:aws:sns:us-east-1:ACCOUNT:billing-alerts \
--dimensions Name=Currency,Value=USD
# Also set alerts at 50%, 80%, 100% of budget
# AWS Budgets is more powerful than CloudWatch for this
• Data transfer out: AWS charges $0.09/GB. A site serving 1TB/month = $90 just in bandwidth.
• NAT Gateway: $0.045/hr + $0.045/GB processed. Can easily be $30-100/mo.
• Load Balancer: $16-25/mo minimum even with zero traffic.
• Managed databases: 3-5x the cost of self-managed on a VPS.
Mitigation: Use Cloudflare (free CDN, absorbs bandwidth), minimize NAT Gateway usage, consider Hetzner/DO for predictable pricing.
Chapter 19 — Decision Framework
This chapter synthesizes everything into actionable decision-making tools. When you get a new project, use these frameworks to systematically choose the right architecture and hosting.
The Decision Flowchart
Scoring Matrix
Rate each factor 1-5 for your project, then match to the hosting type that scores highest:
| Factor | Static | VPS | PaaS | IaaS | K8s | Serverless |
|---|---|---|---|---|---|---|
| Low budget priority | 5 | 4 | 3 | 2 | 1 | 4 |
| Need auto-scaling | 5 | 1 | 3 | 5 | 5 | 5 |
| Minimal ops work | 5 | 2 | 5 | 2 | 1 | 5 |
| Maximum control | 1 | 5 | 1 | 4 | 4 | 1 |
| Fast time-to-market | 5 | 3 | 5 | 2 | 1 | 4 |
| Compliance needs | 2 | 4 | 2 | 5 | 5 | 3 |
| Team > 5 devs | 3 | 2 | 3 | 4 | 5 | 3 |
| Variable traffic | 5 | 1 | 3 | 4 | 4 | 5 |
Migration Paths
Key Decision Questions
1. What's the expected traffic? (requests/sec, concurrent users)
2. What's the budget? (monthly hosting budget, one-time setup budget)
3. What's the team size? (who maintains this after launch?)
4. What's the SLA requirement? (99.9%? 99.99%? "best effort"?)
5. Are there compliance requirements? (GDPR, HIPAA, PCI-DSS, data residency)
6. How variable is the traffic? (steady vs. spiky vs. seasonal)
7. What's the time-to-market pressure? (launch in a week vs. 6 months)
8. What's the data sensitivity? (public content vs. financial/health data)
9. Do you need specific geographic presence? (latency requirements, legal)
10. What's the expected growth? (10x in a year? Stable? Unknown?)
Anti-Patterns to Avoid
- Premature Kubernetes: Using K8s for a single service with 100 users. Use a VPS.
- Premature microservices: Splitting a monolith before you understand the domain boundaries.
- Over-engineering for scale: Building for 1M users when you have 100. Optimize when you have data.
- Ignoring managed services: Running your own PostgreSQL/Redis/Elasticsearch when managed versions exist and your time is expensive.
- Vendor lock-in without awareness: Using proprietary services is fine — just know the exit cost.
- No staging environment: Testing in production is not a strategy.
- Manual deployments: If deploying requires SSH and running commands, you'll eventually make a mistake at 2 AM.
Chapter 20 — Real-World Scenarios
Let's apply everything to five concrete projects, showing how the decision framework leads to different architectures.
Scenario 1: Personal Technical Blog
| Aspect | Decision |
|---|---|
| Traffic | ~1000 visitors/day, spikes when posts hit HN/Reddit |
| Budget | $0-20/month |
| Team | Just you |
| SLA | Best effort (downtime is annoying, not catastrophic) |
Architecture
- Pattern: Static site (JAMstack)
- Hosting: Cloudflare Pages (free) or Netlify (free tier)
- CI/CD: GitHub Actions builds on push to main
- Cost: $0/month (domain: $10/year)
- Why: Zero server management, handles traffic spikes effortlessly (CDN), free
Scenario 2: Startup SaaS MVP
| Aspect | Decision |
|---|---|
| Traffic | ~500 users, growing. API-heavy (user auth, CRUD, real-time) |
| Budget | $50-200/month |
| Team | 2-3 developers |
| SLA | 99.9% (paying customers) |
Architecture
- Pattern: SPA + API backend (monolith)
- Hosting: PaaS (Railway/Render) for backend, Vercel for frontend
- Database: Managed PostgreSQL (Railway or Neon)
- CI/CD: Auto-deploy on push (PaaS built-in)
- Cost: ~$50-100/month
- Why: Maximum development speed, minimal ops, easy to iterate. Graduate to VPS/cloud when costs justify it.
Scenario 3: E-Commerce Site
| Aspect | Decision |
|---|---|
| Traffic | ~10K daily visitors, 5x spikes during sales |
| Budget | $200-500/month |
| Team | 3-5 developers + 1 DevOps |
| SLA | 99.95% (downtime = lost revenue) |
| Compliance | PCI-DSS (handling payments) |
Architecture
- Pattern: SSR monolith (Next.js or Django) with background workers
- Hosting: AWS ECS Fargate (containers without managing servers)
- Database: RDS PostgreSQL Multi-AZ (automatic failover)
- Payments: Stripe (PCI compliance handled by them)
- CI/CD: GitHub Actions → ECR → ECS rolling deployment
- Cost: ~$300-500/month
- Why: Auto-scaling for sales spikes, managed services reduce ops burden, Multi-AZ for reliability.
Scenario 4: Enterprise Internal Tool
| Aspect | Decision |
|---|---|
| Traffic | ~200 internal users, business hours only |
| Budget | $100-300/month |
| Team | 1-2 developers (part-time maintenance) |
| SLA | 99.5% (business hours) |
| Compliance | Data must stay in EU, SSO required |
Architecture
- Pattern: Monolith (Django/Rails/Next.js)
- Hosting: Single Hetzner VPS in EU (€20/mo for 4GB RAM)
- Auth: SAML/OIDC integration with corporate IdP
- CI/CD: GitHub Actions → Docker → SSH deploy
- Cost: ~€40/month (VPS + backups)
- Why: Simple, cheap, EU data residency, VPN restricts access. No need for cloud complexity for 200 users.
Scenario 5: High-Traffic Media/Content Site
| Aspect | Decision |
|---|---|
| Traffic | ~1M daily visitors, global audience, viral spikes |
| Budget | $2000-10000/month |
| Team | 8-15 developers, 2-3 SRE/DevOps |
| SLA | 99.99% (ad revenue depends on uptime) |
Architecture
- Pattern: Microservices (CMS, Auth, Search, Media Processing)
- Hosting: AWS EKS (Kubernetes) multi-region
- Database: Aurora Global Database (cross-region replication)
- CDN: Cloudflare + CloudFront (multi-layer caching)
- CI/CD: GitLab CI → ECR → ArgoCD (GitOps) → EKS
- Monitoring: Datadog or Prometheus + Grafana + PagerDuty
- Cost: ~$5000-8000/month
- Why: Global audience needs multi-region. Viral spikes need auto-scaling. Multiple teams need independent deployments (microservices). Revenue justifies the infrastructure investment.
Further Reading & Resources
Books
| Book | Author | Covers |
|---|---|---|
| The Phoenix Project | Gene Kim | DevOps culture, IT operations (novel format) |
| Site Reliability Engineering | Google (free online) | SRE practices, monitoring, incident response |
| Infrastructure as Code | Kief Morris | IaC principles, patterns, practices |
| Terraform: Up & Running | Yevgeniy Brikman | Practical Terraform (updated regularly) |
| Docker Deep Dive | Nigel Poulton | Docker from basics to production |
| Kubernetes in Action | Marko Lukša | K8s concepts and hands-on |
| Web Scalability for Startup Engineers | Artur Ejsmont | Scaling web apps pragmatically |
| Designing Data-Intensive Applications | Martin Kleppmann | Distributed systems, databases, architecture |
Official Documentation
- AWS Documentation — Comprehensive, with tutorials
- GCP Documentation — Clean, well-organized
- DigitalOcean Tutorials — Best beginner-friendly guides
- Nginx Documentation
- Caddy Documentation
- Terraform Documentation
- Ansible Documentation
- Kubernetes Documentation
- Prometheus Documentation
- Docker Documentation
Free Courses & Tutorials
- DigitalOcean Community Tutorials — Step-by-step server guides
- Google SRE Books (free) — SRE, workbook, building secure systems
- AWS Skill Builder — Free AWS training
- HashiCorp Learn — Terraform, Vault, Consul tutorials
- KillerCoda (ex-Katacoda) — Interactive Linux/Docker/K8s labs
- DevOps Exercises (GitHub) — Practice questions and scenarios
Tools Reference
| Category | Tools |
|---|---|
| Web Servers | Nginx, Caddy, Apache, Traefik |
| Containers | Docker, Podman, containerd |
| Orchestration | Kubernetes, Docker Swarm, Nomad |
| CI/CD | GitHub Actions, GitLab CI, Jenkins, CircleCI, ArgoCD |
| IaC | Terraform, Ansible, Pulumi, CloudFormation |
| Monitoring | Prometheus, Grafana, Datadog, New Relic |
| Logging | Loki, ELK Stack, Fluentd, Vector |
| Secrets | Vault, AWS Secrets Manager, SOPS, Doppler |
| DNS/CDN | Cloudflare, Route 53, Fastly |
| SSL | Let's Encrypt, Certbot, cert-manager (K8s) |
Communities
- /r/selfhosted — Self-hosting community
- /r/devops — DevOps discussions
- /r/sysadmin — System administration
- Hacker News — Tech discussions, deployment war stories
- DEV Community — Tutorials and discussions
Knowledge Check
🧠 Test Your Understanding
Q1: A startup with 2 developers needs to launch an MVP in 2 weeks. They expect ~200 users initially. What's the best hosting choice?
Q2: Your website serves static HTML/CSS/JS with no backend logic. What's the most cost-effective and performant hosting?
Q3: What does the "3-2-1 backup rule" mean?
Q4: When should you consider migrating from PaaS to VPS/IaaS?
Q5: What is the primary purpose of a reverse proxy?