Perf & Flame Graphs — Complete Learning Guide
From zero to production profiling on Kubernetes clusters. Written for developers with C/telecom background.
Chapter 1: What is Profiling?
The Problem
Your microservice is using 80% CPU. Where is that time going? You have thousands of functions — which ones are hot? Profiling answers: "Where does my program spend its time?"
Profiling vs. Tracing vs. Logging
| Technique | Answers | Overhead | Granularity |
|---|---|---|---|
| Logging | What happened? | Variable | Per-event |
| Tracing | What path did a request take? | Low-Medium | Per-request |
| Profiling | Where is time/CPU spent? | Very Low (1-5%) | Statistical |
Two Approaches to Profiling
Instrumentation: Insert measurement code at function entry/exit. Accurate but high overhead. Changes program behavior (Heisenberg effect).
Sampling: Periodically interrupt the program and record the call stack. Low overhead, statistical accuracy. This is what perf does.
Why This Matters for Telecom
In AMF/5GC services, you deal with high-throughput message processing (NAS, NGAP). A single hot function in the encoding/decoding path can dominate CPU. Profiling finds it in minutes instead of days of code review.
Chapter 2: The Linux perf Subsystem
What is perf?
perf is the official Linux kernel profiling tool. It's part of the kernel source tree (tools/perf/). It leverages hardware Performance Monitoring Units (PMUs) built into every modern CPU.
Architecture
Key Concepts
- PMU (Performance Monitoring Unit): Hardware counters in the CPU. Can count cycles, cache misses, branch mispredictions, etc.
- perf_event_open(): The syscall that configures what to monitor
- Ring buffer: Kernel writes samples here; userspace reads them with minimal overhead
- NMI (Non-Maskable Interrupt): Used for sampling — can't be blocked, so you get accurate samples even in interrupt-disabled code
perf vs. Other Tools
| Tool | Mechanism | Pros | Cons |
|---|---|---|---|
| perf | PMU + kernel | Lowest overhead, kernel-integrated | Linux only, needs privileges |
| gprof | Instrumentation | Simple | High overhead, inaccurate |
| Valgrind/Callgrind | Emulation | Exact counts | 20-50x slowdown |
| DTrace/BPF | Dynamic tracing | Programmable | More complex |
perf is the gold standard. Its overhead is typically <2%.Chapter 3: What Are Flame Graphs?
The Visualization Problem
perf report gives you a flat list or a tree. But with thousands of functions and deep call stacks, it's hard to see the big picture. Flame graphs solve this.
Invented by Brendan Gregg
Brendan Gregg (Netflix, now Intel) created flame graphs in 2011. They're now the standard way to visualize profiling data across all languages and platforms.
Anatomy of a Flame Graph
What to Look For
- Wide plateaus at the top: Functions that are themselves consuming CPU (leaf functions)
- Wide towers: Deep call stacks that are frequently hit
- Unexpected functions: Why is
malloc()taking 15%? Memory allocation issue! - Missing frames: Gaps in the stack usually mean missing debug symbols
Interactive SVG Flame Graphs
The standard output is an interactive SVG where you can:
- Click a frame to zoom into that subtree
- Hover to see exact sample count and percentage
- Search (Ctrl+F) to highlight matching functions
- Reset Zoom by clicking the bottom frame
Chapter 4: How Sampling Works
The Sampling Loop
When you run perf record, here's what happens at the hardware/kernel level:
- Kernel programs a PMU counter to overflow after N events (e.g., every 10,000,000 CPU cycles)
- When counter overflows → hardware generates an NMI (Non-Maskable Interrupt)
- NMI handler captures: instruction pointer (IP), full call stack, PID, TID, timestamp
- Sample is written to a per-CPU ring buffer
perf recordprocess reads ring buffer and writes toperf.datafile
Sample Rate vs. Frequency
# Fixed frequency: 99 samples/second (default)
perf record -F 99 -g ./my_program
# Fixed period: sample every 1,000,000 cycles
perf record -c 1000000 -g ./my_program
Call Stack Unwinding
Getting the full call stack from a sample is non-trivial. Three methods:
| Method | Flag | Pros | Cons |
|---|---|---|---|
| Frame pointers | --call-graph fp | Fast, simple | Requires -fno-omit-frame-pointer |
| DWARF | --call-graph dwarf | Works without frame pointers | Larger perf.data, slower |
| LBR | --call-graph lbr | Hardware-assisted, fast | Limited stack depth (~32) |
-fno-omit-frame-pointer to CFLAGS. This gives you perfect stacks with minimal overhead. Many distros now enable this by default (Fedora 38+, Ubuntu 24.04+).How Many Samples Do You Need?
Statistical rule of thumb:
- ~1000 samples: Functions using >5% CPU are visible
- ~10,000 samples: Functions using >1% CPU are visible
- ~100,000 samples: Fine-grained analysis possible
At 99 Hz, recording for 10 seconds gives ~990 samples per CPU. For a 4-core system under load, that's ~4000 samples — usually enough for a first look.
🧪 Quiz 1: Foundations (Chapters 1–4)
1. What type of profiling does perf use?
2. In a flame graph, what does the WIDTH of a box represent?
3. Why is the default sampling frequency 99 Hz instead of 100 Hz?
4. What compiler flag ensures reliable frame-pointer-based stack unwinding?
5. In a flame graph, where do you find the functions that are directly consuming CPU?
Chapter 5: Installing perf
On Your Linux Machine
# Ubuntu/Debian
sudo apt-get install linux-tools-common linux-tools-$(uname -r)
# RHEL/CentOS/Rocky
sudo yum install perf
# SUSE/SLES
sudo zypper install perf
# Verify installation
perf version
Version Matching
perf must match your kernel version. If you get "WARNING: perf not found for kernel X.Y.Z", install the matching linux-tools-X.Y.Z package.Permissions
By default, non-root users have limited access. The kernel parameter perf_event_paranoid controls this:
# Check current setting
cat /proc/sys/kernel/perf_event_paranoid
# Values:
# -1 = No restrictions (allow everything)
# 0 = Allow access to CPU-specific data
# 1 = Allow kernel profiling (default on many distros)
# 2 = Allow user-space profiling only
# 3 = Disallow all (some hardened systems)
# Temporarily allow full access (requires root)
sudo sysctl -w kernel.perf_event_paranoid=-1
# Or run perf as root
sudo perf record ...
In Docker/Containers
Containers share the host kernel, so you need:
# Option 1: Run container with SYS_ADMIN capability
docker run --cap-add SYS_ADMIN ...
# Option 2: Run privileged (not recommended for production)
docker run --privileged ...
# Option 3: Specific perf capabilities
docker run --cap-add SYS_PTRACE --cap-add SYS_ADMIN \
--security-opt seccomp=unconfined ...
Chapter 6: perf stat — Counting Events
Your First perf Command
perf stat counts hardware events without recording samples. It's the simplest way to get a performance overview:
# Profile a command
perf stat ./my_program
# Profile a running process for 10 seconds
perf stat -p $(pidof my_service) sleep 10
# Example output:
Performance counter stats for './my_program':
12,453.21 msec task-clock # 3.892 CPUs utilized
14,221 context-switches # 1.142 K/sec
312 cpu-migrations # 25.054 /sec
45,678 page-faults # 3.667 K/sec
38,234,567,890 cycles # 3.069 GHz
21,456,789,012 instructions # 0.56 insn per cycle
3,234,567,890 branches # 259.711 M/sec
123,456,789 branch-misses # 3.82% of all branches
3.200123456 seconds time elapsed
Key Metrics to Understand
| Metric | What It Means | Good Value |
|---|---|---|
| instructions per cycle (IPC) | How efficiently the CPU executes | >1.0 is good, >2.0 is great |
| branch-misses % | CPU mispredicted branches | <5% is normal |
| cache-misses % | L1/LLC cache miss rate | Depends on workload |
| context-switches | Kernel preempted your thread | Lower is better for latency |
Specific Event Groups
# Cache analysis
perf stat -e cache-references,cache-misses,L1-dcache-loads,L1-dcache-load-misses \
./my_program
# Memory bandwidth
perf stat -e LLC-loads,LLC-load-misses,LLC-stores,LLC-store-misses \
-p $(pidof my_service) sleep 5
# List all available events
perf list
perf stat before diving into flame graphs. If IPC is low (<0.5), you likely have a memory/cache problem. If IPC is high but throughput is low, you have an algorithmic problem. This guides where to look in the flame graph.Chapter 7: perf record & perf report
Recording Samples
# Basic recording with call graphs (most common usage)
perf record -g -p $(pidof my_service) sleep 30
# Flags explained:
# -g Enable call-graph (stack) recording
# -p PID Attach to running process
# sleep 30 Record for 30 seconds then stop
# Output: perf.data (in current directory)
# Record a specific command from start to finish
perf record -g --call-graph dwarf ./my_program --args
# Record all CPUs system-wide
sudo perf record -g -a sleep 10
# Higher frequency for short-lived programs
perf record -F 999 -g ./short_program
Important Flags
| Flag | Purpose | When to Use |
|---|---|---|
-g | Record call stacks | Always (needed for flame graphs) |
-F <hz> | Sampling frequency | Default 4000; use 99 for low overhead |
--call-graph dwarf | DWARF unwinding | When frame pointers are missing |
-p <pid> | Target process | Profiling a running service |
-a | All CPUs | System-wide profiling |
-o <file> | Output filename | When you want a specific name |
Analyzing with perf report
# Interactive TUI (terminal UI)
perf report
# Flat profile (no hierarchy)
perf report --stdio --sort=dso,symbol
# Show call graph as children (callee perspective)
perf report --children
# Filter to specific DSO (shared library)
perf report --dso=libmylib.so
Understanding perf report Output
# Example perf report --stdio output:
# Overhead Command Shared Object Symbol
# ........ ......... .................. .............................
23.45% my_service libprotobuf.so [.] google::protobuf::internal::WireFormat::ReadTag
12.34% my_service my_service [.] process_nas_message
8.76% my_service libc.so.6 [.] __memcpy_avx2
6.54% my_service my_service [.] encode_ngap_pdu
5.43% my_service [kernel.kallsyms] [k] copy_user_enhanced_fast_string
Column meanings:
- Overhead: Percentage of total samples in this function
- Command: Process name
- Shared Object: Binary/library containing the function
- [.] vs [k]: User-space vs kernel function
- Symbol: Function name (demangled for C++)
perf.data file contains raw samples. You can copy it to another machine for analysis — but you'll need the same binaries with debug symbols for proper symbol resolution.Chapter 8: perf Events Deep Dive
Event Types
perf can monitor many types of events beyond CPU cycles:
# List all available events
perf list
# Categories:
# 1. Hardware events (from PMU)
perf list hw
# cpu-cycles, instructions, cache-references, cache-misses,
# branch-instructions, branch-misses, bus-cycles
# 2. Software events (kernel counters)
perf list sw
# cpu-clock, task-clock, page-faults, context-switches,
# cpu-migrations, minor-faults, major-faults
# 3. Hardware cache events
perf list cache
# L1-dcache-loads, L1-dcache-load-misses,
# LLC-loads, LLC-load-misses, dTLB-loads, dTLB-load-misses
# 4. Tracepoints (kernel instrumentation points)
perf list tracepoint
# sched:sched_switch, syscalls:sys_enter_write, net:net_dev_xmit
# 5. Dynamic probes (you define them)
# uprobe: user-space function entry
# kprobe: kernel function entry
Profiling Different Bottlenecks
# CPU-bound: use default (cycles)
perf record -g -p $PID sleep 10
# Memory-bound: profile on cache misses
perf record -e cache-misses -g -p $PID sleep 10
# I/O-bound: profile on block device events
perf record -e block:block_rq_issue -g -a sleep 10
# Lock contention: profile on context switches
perf record -e context-switches -g -p $PID sleep 10
# Network: profile on network tracepoints
perf record -e net:net_dev_xmit -g -a sleep 10
cache-misses instead of cycles, the resulting flame graph shows you where cache misses happen — not where time is spent. This is incredibly powerful for memory-bound workloads.Chapter 9: Generating Flame Graphs
The Toolchain
Step-by-Step
# Step 0: Clone Brendan Gregg's FlameGraph tools
git clone https://github.com/brendangregg/FlameGraph.git
cd FlameGraph
# Step 1: Record (already done)
perf record -F 99 -g -p $(pidof my_service) sleep 30
# Step 2: Convert perf.data to text
perf script > out.perf
# Step 3: Fold stacks (collapse identical stacks into counts)
./stackcollapse-perf.pl out.perf > out.folded
# Step 4: Generate SVG
./flamegraph.pl out.folded > flamegraph.svg
# ─── Or as a one-liner: ───
perf script | ./stackcollapse-perf.pl | ./flamegraph.pl > flamegraph.svg
Customizing the Output
# Title and colors
./flamegraph.pl --title="My Service CPU Profile" \
--colors=java \
out.folded > flamegraph.svg
# Minimum width (hide tiny frames)
./flamegraph.pl --minwidth=0.5 out.folded > flamegraph.svg
# Reverse (icicle graph — root at top)
./flamegraph.pl --inverted out.folded > icicle.svg
# Count display (show sample counts, not percentages)
./flamegraph.pl --countname="samples" out.folded > flamegraph.svg
The Folded Stack Format
Understanding this intermediate format is key:
# Each line: semicolon-separated stack (bottom to top) followed by count
main;process_msg;decode_nas;asn1_parse 1234
main;process_msg;encode_ngap;alloc_buffer 567
main;handle_timer;check_expiry 89
main;idle_loop 4500
# This means:
# - 1234 samples had the stack: main → process_msg → decode_nas → asn1_parse
# - 567 samples had: main → process_msg → encode_ngap → alloc_buffer
# etc.
Modern Alternatives
# Firefox Profiler (web-based, interactive)
perf script -F +pid > out.perf
# Upload to https://profiler.firefox.com/
# Speedscope (web-based)
# https://www.speedscope.app/ — drag and drop perf script output
# Hotspot (Qt GUI for Linux)
# https://github.com/KDAB/hotspot
hotspot perf.data
Chapter 10: Reading Flame Graphs — Practical Guide
The Mental Model
Think of a flame graph as an X-ray of your program's execution. Every pixel of width represents CPU time.
Reading Strategy
- Look at the top edge first. Wide plateaus at the top = hot leaf functions. These are your optimization targets.
- Look for unexpected width. Is
malloc()15% wide? That's a lot of allocation. Ismemcpy()10%? You're copying too much data. - Trace down from hot spots. Click a hot function and look at its callers below. Who's calling it so much?
- Look for missing frames. Gaps or
[unknown]frames mean missing debug symbols.
Example Analysis
Common Patterns
| Pattern | What It Looks Like | Likely Cause |
|---|---|---|
| Wide malloc/free | Allocation functions dominate top | Too many small allocations; use pools |
| Wide memcpy | memcpy/memmove at top | Unnecessary data copying |
| Wide lock functions | pthread_mutex_lock at top | Lock contention (but see off-CPU) |
| Kernel frames dominate | [kernel.kallsyms] everywhere | Syscall-heavy; reduce syscalls |
| Single tall tower | One deep narrow stack | Recursive function or deep call chain |
Chapter 11: Differential Flame Graphs
Comparing Before and After
You made a code change. Did it help? Differential flame graphs show the difference between two profiles.
# Record before
perf record -F 99 -g -p $PID -o before.data sleep 30
# ... make your code change, restart service ...
# Record after
perf record -F 99 -g -p $PID -o after.data sleep 30
# Generate folded stacks for both
perf script -i before.data | ./stackcollapse-perf.pl > before.folded
perf script -i after.data | ./stackcollapse-perf.pl > after.folded
# Generate differential flame graph
./difffolded.pl before.folded after.folded | ./flamegraph.pl > diff.svg
Reading Differential Flame Graphs
- Red frames: Increased in the "after" profile (regression)
- Blue frames: Decreased in the "after" profile (improvement)
- White/neutral: No significant change
Chapter 12: Off-CPU Flame Graphs
The Missing Half
Standard perf record only captures where the CPU is running. But what about time spent:
- Waiting for a mutex/lock
- Sleeping on I/O (disk, network)
- Waiting for a condition variable
- Blocked on a futex
This is off-CPU time — and it's often where latency hides.
Capturing Off-CPU Data
# Method 1: perf with sched tracepoints
sudo perf record -e sched:sched_switch -g -p $PID sleep 30
sudo perf script > out.perf
./stackcollapse-perf.pl out.perf > out.folded
./flamegraph.pl --color=io --title="Off-CPU" out.folded > offcpu.svg
# Method 2: Using bpftrace (modern, lower overhead)
sudo bpftrace -e '
kprobe:finish_task_switch {
@start[tid] = nsecs;
}
kretprobe:finish_task_switch /@start[tid]/ {
@off[kstack, ustack, comm] = sum(nsecs - @start[tid]);
delete(@start[tid]);
}
' > offcpu.bt
# Method 3: Using BCC tools
sudo /usr/share/bcc/tools/offcputime -df -p $PID 30 > offcpu.folded
./flamegraph.pl --color=io offcpu.folded > offcpu.svg
When to Use Off-CPU Analysis
🧪 Quiz 2: perf Tool & Flame Graphs (Chapters 5–12)
1. What does perf stat do differently from perf record?
2. What is the correct pipeline to generate a flame graph from perf.data?
3. You see pthread_mutex_lock taking 30% in your on-CPU flame graph. What does this mean?
4. In a differential flame graph, what do RED frames indicate?
5. To profile cache misses specifically, which command would you use?
Chapter 13: Profiling in Containers
The Container Challenge
Containers add complexity to profiling because:
- The process runs in a different PID namespace
- Security profiles (seccomp, AppArmor) may block
perf_event_open() - The container filesystem doesn't have the host's kernel symbols
- Debug symbols may not be in the container image
How perf Works with Containers
Approach A: Profiling from the Host
# Find the container's PID on the host
docker inspect --format '{{.State.Pid}}' my_container
# Or for Kubernetes:
# Get the container ID, then use crictl
crictl inspect | jq '.info.pid'
# Profile using host PID
sudo perf record -g -p sleep 30
# Problem: symbols! perf looks for binaries in /proc//root/
# which maps to the container's filesystem
# Solution: use --symfs or copy binaries out
Approach B: Profiling from Inside
# Dockerfile addition for profiling
FROM my-base-image
RUN apt-get update && apt-get install -y linux-tools-generic
# Or for Alpine: apk add perf
# Run with required capabilities
docker run --cap-add SYS_ADMIN \
--cap-add SYS_PTRACE \
--security-opt seccomp=unconfined \
my_image
Chapter 14: Kubernetes Pod Profiling Setup
Option 1: Ephemeral Debug Container
Kubernetes 1.23+ supports ephemeral containers — temporary containers added to a running pod for debugging:
# Attach an ephemeral container with perf tools
kubectl debug -it pod/my-service-pod \
--image=ubuntu:22.04 \
--target=my-service-container \
-- bash
# Inside the ephemeral container:
apt-get update && apt-get install -y linux-tools-generic
# Now you share the PID namespace with the target container
perf record -g -p 1 sleep 30
EphemeralContainers feature gate (enabled by default since K8s 1.25) and the pod must have shareProcessNamespace: true for you to see the target process.Option 2: Privileged DaemonSet (Host Profiling)
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: perf-profiler
namespace: debug
spec:
selector:
matchLabels:
app: perf-profiler
template:
metadata:
labels:
app: perf-profiler
spec:
hostPID: true # See all host PIDs
hostNetwork: true
containers:
- name: profiler
image: ubuntu:22.04
command: ["sleep", "infinity"]
securityContext:
privileged: true # Full access to perf
volumeMounts:
- name: host-root
mountPath: /host
readOnly: true
- name: sys-kernel
mountPath: /sys/kernel
volumes:
- name: host-root
hostPath:
path: /
- name: sys-kernel
hostPath:
path: /sys/kernel
# Deploy and exec into it
kubectl apply -f perf-daemonset.yaml
kubectl exec -it -n debug perf-profiler-xxxxx -- bash
# Install perf
apt-get update && apt-get install -y linux-tools-generic linux-tools-$(uname -r)
# Find your target process (visible because hostPID: true)
ps aux | grep my_service
# Profile it
perf record -g -p -o /tmp/perf.data sleep 30
# Copy data out
kubectl cp debug/perf-profiler-xxxxx:/tmp/perf.data ./perf.data
Option 3: Sidecar Container
apiVersion: v1
kind: Pod
metadata:
name: my-service-profiled
spec:
shareProcessNamespace: true # Critical! Allows seeing other containers' processes
containers:
- name: my-service
image: my-service:latest
# ... normal config ...
- name: profiler
image: my-profiler-tools:latest # Image with perf + flamegraph tools
command: ["sleep", "infinity"]
securityContext:
capabilities:
add: ["SYS_ADMIN", "SYS_PTRACE"]
seccompProfile:
type: Unconfined
Security Considerations
| Capability | Why Needed | Risk |
|---|---|---|
| SYS_ADMIN | perf_event_open() syscall | High — broad capability |
| SYS_PTRACE | Reading other process memory/stacks | Medium |
| seccomp=unconfined | Default seccomp blocks perf syscalls | Medium |
| hostPID | See processes outside container | Medium |
Chapter 15: Collecting perf Data in Kubernetes
Complete Workflow: Pod → Flame Graph
# ═══════════════════════════════════════════════════════════════
# STEP 1: Identify the target pod and node
# ═══════════════════════════════════════════════════════════════
kubectl get pod my-service-pod -o wide
# Note the NODE column — you need to profile on that node
# ═══════════════════════════════════════════════════════════════
# STEP 2: Get the container's PID (from the node)
# ═══════════════════════════════════════════════════════════════
# Option A: If you have node access
ssh worker-node-1
# Find container ID
crictl ps | grep my-service
# Get PID from container ID
crictl inspect | jq '.info.pid'
# Option B: From a privileged DaemonSet on that node
kubectl exec -it perf-profiler- -- \
bash -c "ps aux | grep my_service | grep -v grep"
# ═══════════════════════════════════════════════════════════════
# STEP 3: Record profile
# ═══════════════════════════════════════════════════════════════
# From privileged DaemonSet or node:
perf record -F 99 -g -p -o /tmp/perf.data -- sleep 30
# Verify recording
perf report -i /tmp/perf.data --stdio | head -20
# ═══════════════════════════════════════════════════════════════
# STEP 4: Extract perf.data from the cluster
# ═══════════════════════════════════════════════════════════════
kubectl cp debug/perf-profiler-xxxxx:/tmp/perf.data ./perf.data
# ═══════════════════════════════════════════════════════════════
# STEP 5: Generate flame graph (on your workstation)
# ═══════════════════════════════════════════════════════════════
perf script -i perf.data | \
./FlameGraph/stackcollapse-perf.pl | \
./FlameGraph/flamegraph.pl --title="my-service K8s profile" \
> my-service-flamegraph.svg
# Open in browser
firefox my-service-flamegraph.svg
Handling Symbols in Containers
The biggest challenge: perf script needs the original binaries to resolve symbols.
# Problem: perf.data references /usr/bin/my_service inside the container
# but your workstation doesn't have that binary
# Solution 1: Copy binary from container
kubectl cp my-pod:/usr/bin/my_service ./my_service
# Then use --symfs
perf script -i perf.data --symfs=./symbols/ > out.perf
# Solution 2: Build a debug image with symbols
# In your Dockerfile, keep debug symbols:
# Don't strip! Or create a separate debug package
# Solution 3: Use buildid-based symbol resolution
# perf stores build-ids in perf.data
perf buildid-list -i perf.data
# Copy matching .debug files to ~/.debug/
Automation Script
#!/bin/bash
# profile-k8s-pod.sh — One-shot profiling of a K8s pod
set -e
POD_NAME=${1:?Usage: $0 [namespace] [duration]}
NAMESPACE=${2:-default}
DURATION=${3:-30}
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
OUTPUT="flamegraph_${POD_NAME}_${TIMESTAMP}.svg"
echo "==> Profiling pod $POD_NAME in namespace $NAMESPACE for ${DURATION}s"
# Create ephemeral debug container and profile
kubectl debug -it "pod/${POD_NAME}" \
-n "$NAMESPACE" \
--image=ubuntu:22.04 \
--target="$(kubectl get pod "$POD_NAME" -n "$NAMESPACE" -o jsonpath='{.spec.containers[0].name}')" \
-- bash -c "
apt-get update -qq && apt-get install -y -qq linux-tools-generic > /dev/null 2>&1
perf record -F 99 -g -p 1 -o /tmp/perf.data sleep $DURATION
perf script -i /tmp/perf.data
" > /tmp/perf_script_output.txt
# Generate flame graph locally
cat /tmp/perf_script_output.txt | \
./FlameGraph/stackcollapse-perf.pl | \
./FlameGraph/flamegraph.pl --title="$POD_NAME ($TIMESTAMP)" > "$OUTPUT"
echo "==> Flame graph saved to: $OUTPUT"
Chapter 16: Continuous Profiling
Why Continuous?
Ad-hoc profiling catches problems you know about. Continuous profiling catches problems you don't — regressions, gradual degradation, rare spikes.
Tools for Kubernetes
| Tool | Type | Language Support | Storage |
|---|---|---|---|
| Parca | Open source | C/C++/Go/Rust/Java | Built-in |
| Pyroscope | Open source (Grafana) | Many | Object storage |
| Polar Signals | Commercial (Parca) | Many | Cloud |
| Google Cloud Profiler | Commercial | Many | GCP |
Parca — Open Source Continuous Profiling
# Install Parca Agent as DaemonSet (uses eBPF, no code changes needed)
kubectl apply -f https://github.com/parca-dev/parca-agent/releases/latest/download/kubernetes-manifest.yaml
# Install Parca Server
helm repo add parca https://parca-dev.github.io/helm-charts
helm install parca parca/parca
# Access the UI
kubectl port-forward svc/parca 7070:7070
# Open http://localhost:7070
🧪 Quiz 3: Kubernetes Profiling (Chapters 13–16)
1. What Linux capability is required for perf_event_open() in a container?
2. What pod spec field allows containers in the same pod to see each other's processes?
3. What is the main advantage of eBPF-based continuous profiling (Parca Agent)?
4. When profiling a container from the host, what is the main challenge?
5. Which Kubernetes feature (1.23+) lets you attach a temporary debug container to a running pod?
Chapter 17: Symbols & Debug Info
Why Symbols Matter
Without symbols, your flame graph shows hex addresses instead of function names:
# Without symbols:
0x7f3a2b4c5d6e;0x7f3a2b4c1234;0x55a1b2c3d4e5 42
# With symbols:
main;process_message;decode_nas_pdu 42
Types of Symbol Information
| Type | Contains | How to Get |
|---|---|---|
| Symbol table (.symtab) | Function names + addresses | Default in most builds |
| Dynamic symbols (.dynsym) | Exported function names | Always present in shared libs |
| DWARF debug info | Line numbers, variables, types | -g flag |
| Separate debug files (.debug) | Stripped debug info | objcopy --only-keep-debug |
Checking Symbol Availability
# Check if binary has symbols
file my_service
# "not stripped" = has symbols
# "stripped" = no symbols
# List symbols
nm my_service | head
# Or for dynamic symbols in shared libs:
nm -D /usr/lib/libmylib.so | grep my_function
# Check DWARF info
readelf --debug-dump=info my_service | head
# Check build-id (used by perf for symbol matching)
readelf -n my_service | grep "Build ID"
Getting Symbols for Profiling
# Method 1: Compile with debug info (best)
gcc -g -fno-omit-frame-pointer -O2 -o my_service my_service.c
# Method 2: Separate debug file (production-friendly)
gcc -g -O2 -o my_service my_service.c
objcopy --only-keep-debug my_service my_service.debug
strip my_service
objcopy --add-gnu-debuglink=my_service.debug my_service
# Method 3: Install debuginfo packages
# RHEL/CentOS:
debuginfo-install my-package
# Ubuntu/Debian:
apt-get install my-package-dbgsym
# Method 4: Use perf's buildid cache
perf buildid-cache --add my_service.debug
Symbol Resolution Path
perf looks for symbols in this order:
~/.debug/directory (buildid-based)/usr/lib/debug/(system debug packages)- The binary itself (if not stripped)
--symfspath (if specified)/proc/(container filesystem)/root/
Chapter 18: Profiling C/C++ Services
Compilation Flags for Profiling
# Recommended CFLAGS for profilable builds:
CFLAGS = -O2 \ # Keep optimizations (profile real behavior)
-g \ # Debug symbols (DWARF)
-fno-omit-frame-pointer \ # Reliable stack unwinding
-mno-omit-leaf-frame-pointer # Even leaf functions get frame pointers
# In CMake:
set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -fno-omit-frame-pointer -g")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fno-omit-frame-pointer -g")
# In Makefile:
CFLAGS += -fno-omit-frame-pointer -g
-O0 (no optimization). The profile will be meaningless because the code structure is completely different from production. Always profile optimized code (-O2).Common C/C++ Profiling Scenarios
Scenario 1: High CPU in Message Processing
# Record during load test
perf record -F 99 -g --call-graph fp -p $(pidof amf_service) sleep 60
# Generate flame graph
perf script | stackcollapse-perf.pl | flamegraph.pl > cpu.svg
# Look for:
# - Serialization/deserialization (protobuf, ASN.1)
# - String operations (strcmp, strlen in loops)
# - Memory allocation (malloc/free churn)
# - Logging (fprintf, snprintf in hot path)
Scenario 2: Memory Allocation Overhead
# Profile malloc/free specifically
perf probe -x /lib/x86_64-linux-gnu/libc.so.6 malloc
perf record -e probe_libc:malloc -g -p $PID sleep 10
# Or use the allocation tracepoint
perf record -e kmem:kmalloc -g -p $PID sleep 10
# Better: use tcmalloc/jemalloc profiling
LD_PRELOAD=/usr/lib/libtcmalloc.so HEAPPROFILE=/tmp/heap ./my_service
Scenario 3: Lock Contention
# On-CPU: see spinning
perf record -g -p $PID sleep 30
# Look for pthread_mutex_lock, __lll_lock_wait in flame graph
# Off-CPU: see blocking
sudo perf record -e sched:sched_switch -g -p $PID sleep 30
# Or with BCC:
sudo offcputime-bpfcc -df -p $PID 30 > offcpu.folded
Annotating Hot Functions
# See assembly-level profile for a specific function
perf annotate -i perf.data -s decode_nas_message
# Example output:
Percent │ Disassembly of decode_nas_message
─────────┼────────────────────────────────────────
│ push %rbp
│ mov %rsp,%rbp
2.34% │ mov (%rdi),%eax ← 2.34% of samples here
15.67% │ call asn1_decode_ie ← 15.67% here!
8.90% │ test %eax,%eax
│ je 0x4012a0
12.45% │ call validate_ie ← another hot spot
│ ...
perf annotate is incredibly powerful for C developers. It shows you exactly which instructions are hot, helping you understand if the bottleneck is a function call, a memory access, or a branch.Chapter 19: Common Pitfalls
Pitfall 1: Missing Stacks / [unknown] Frames
# Symptom: flame graph shows [unknown] or very shallow stacks
# Cause: missing frame pointers or debug info
# Fix: recompile with frame pointers
gcc -fno-omit-frame-pointer -g -O2 ...
# Or use DWARF unwinding (slower but works without frame pointers)
perf record --call-graph dwarf -p $PID sleep 30
Pitfall 2: Profiling the Wrong Thing
# Symptom: flame graph looks "normal" but service is slow
# Cause: you're profiling on-CPU but the problem is off-CPU (I/O, locks)
# Fix: check if CPU is actually high
top -p $PID
# If CPU is low but latency is high → off-CPU profiling
Pitfall 3: Too Few Samples
# Symptom: flame graph is sparse, functions show 1-2 samples
# Cause: recording too short or frequency too low
# Fix: record longer or increase frequency
perf record -F 999 -g -p $PID sleep 60 # 999 Hz for 60 seconds
Pitfall 4: Kernel Symbols Missing
# Symptom: kernel frames show as hex addresses
# Cause: /proc/kallsyms is restricted
# Fix:
sudo sysctl -w kernel.kptr_restrict=0
# Or run perf as root
Pitfall 5: Container Symbol Mismatch
# Symptom: wrong function names or [unknown] in container profiles
# Cause: perf resolves symbols from host, not container filesystem
# Fix: point perf to container's root filesystem
perf script --symfs=/proc//root/ > out.perf
# Or copy binaries and use local symfs
mkdir -p ./symfs/usr/bin
cp container_binary ./symfs/usr/bin/
perf script --symfs=./symfs/ > out.perf
Pitfall 6: Inlined Functions Disappear
# Symptom: you know function X is hot but it doesn't appear
# Cause: compiler inlined it — it's merged into the caller
# Fix: use DWARF info to recover inlined frames
perf script --inline > out.perf
# Or compile with -fno-inline for profiling (but changes behavior!)
Chapter 20: Real-World Workflow
The Complete Profiling Workflow
Step-by-Step: Profiling a Service in Your K8s Cluster
# ═══════════════════════════════════════════════════════════════
# 1. Identify the problem
# ═══════════════════════════════════════════════════════════════
# Check Grafana: which pod has high CPU?
# Or: kubectl top pods -n my-namespace
# ═══════════════════════════════════════════════════════════════
# 2. Quick sanity check with perf stat
# ═══════════════════════════════════════════════════════════════
kubectl exec -it profiler-pod -- perf stat -p $PID sleep 5
# Check IPC: low IPC = memory-bound, high IPC = compute-bound
# ═══════════════════════════════════════════════════════════════
# 3. Record profile
# ═══════════════════════════════════════════════════════════════
kubectl exec -it profiler-pod -- \
perf record -F 99 -g --call-graph fp -p $PID -o /tmp/perf.data sleep 30
# ═══════════════════════════════════════════════════════════════
# 4. Extract and generate flame graph
# ═══════════════════════════════════════════════════════════════
kubectl exec profiler-pod -- perf script -i /tmp/perf.data > /tmp/out.perf
# On your workstation:
kubectl cp profiler-pod:/tmp/out.perf ./out.perf
./stackcollapse-perf.pl out.perf > out.folded
./flamegraph.pl --title="my-service $(date)" out.folded > flamegraph.svg
# ═══════════════════════════════════════════════════════════════
# 5. Analyze
# ═══════════════════════════════════════════════════════════════
# Open flamegraph.svg in browser
# Look for wide plateaus at the top
# Search for known hot functions (Ctrl+F in SVG)
# ═══════════════════════════════════════════════════════════════
# 6. After fixing — verify with differential flame graph
# ═══════════════════════════════════════════════════════════════
# Record again after fix
# Generate diff:
./difffolded.pl before.folded after.folded | \
./flamegraph.pl --title="Before vs After" > diff.svg
Quick Reference Card
# ─── ESSENTIAL COMMANDS ───────────────────────────────────────
# Quick CPU profile:
perf record -F 99 -g -p $PID sleep 30
# Generate flame graph:
perf script | stackcollapse-perf.pl | flamegraph.pl > flame.svg
# System-wide profile:
sudo perf record -F 99 -g -a sleep 10
# Profile with DWARF (no frame pointers):
perf record -F 99 --call-graph dwarf -p $PID sleep 30
# Off-CPU (what's blocking):
sudo offcputime-bpfcc -df -p $PID 30 > off.folded
# Cache miss profile:
perf record -e cache-misses -g -p $PID sleep 30
# Annotate hot function:
perf annotate -s hot_function_name
# ─── KUBERNETES ───────────────────────────────────────────────
# Debug a pod:
kubectl debug -it pod/NAME --image=ubuntu --target=CONTAINER -- bash
# Copy perf data out:
kubectl cp NAMESPACE/POD:/tmp/perf.data ./perf.data
alias flame='perf script | stackcollapse-perf.pl | flamegraph.pl > /tmp/flame.svg && xdg-open /tmp/flame.svg'Further Reading
- Brendan Gregg's Flame Graph page — the definitive resource
- Brendan Gregg's perf page — comprehensive perf examples
- perf Wiki — official documentation
- FlameGraph GitHub repo — the tools
- Parca documentation — continuous profiling for K8s
- "Systems Performance" by Brendan Gregg (book) — the bible of Linux performance
🧪 Quiz 4: Advanced & Workflow (Chapters 17–20)
1. Why should you NOT profile with -O0 (no optimization)?
2. A function you know is hot doesn't appear in the flame graph. Most likely cause?
3. What does perf annotate show you?
4. Your service has high latency but LOW CPU usage. What should you do?
5. What is the purpose of --symfs in perf script?