Perf & Flame Graphs — Complete Learning Guide

From zero to production profiling on Kubernetes clusters. Written for developers with C/telecom background.

Chapter 1: What is Profiling?

The Problem

Your microservice is using 80% CPU. Where is that time going? You have thousands of functions — which ones are hot? Profiling answers: "Where does my program spend its time?"

Profiling vs. Tracing vs. Logging

Technique	Answers	Overhead	Granularity
Logging	What happened?	Variable	Per-event
Tracing	What path did a request take?	Low-Medium	Per-request
Profiling	Where is time/CPU spent?	Very Low (1-5%)	Statistical

Two Approaches to Profiling

Instrumentation: Insert measurement code at function entry/exit. Accurate but high overhead. Changes program behavior (Heisenberg effect).

Sampling: Periodically interrupt the program and record the call stack. Low overhead, statistical accuracy. This is what perf does.

Sampling Profiling — Conceptual Model ═══════════════════════════════════════ Time ──────────────────────────────────────────────► Program: [func_A][func_B][func_A][func_C][func_A][func_A] Samples: ↑ ↑ ↑ ↑ ↑ A(1) B(1) A(2) C(1) A(3) Result: func_A = 60%, func_B = 20%, func_C = 20% (3/5 samples) (1/5 samples) (1/5 samples)

Sampling profiling is statistical. With enough samples (thousands), the results converge to the true distribution. The more samples, the more accurate.

Why This Matters for Telecom

In AMF/5GC services, you deal with high-throughput message processing (NAS, NGAP). A single hot function in the encoding/decoding path can dominate CPU. Profiling finds it in minutes instead of days of code review.

Chapter 2: The Linux perf Subsystem

What is perf?

perf is the official Linux kernel profiling tool. It's part of the kernel source tree (tools/perf/). It leverages hardware Performance Monitoring Units (PMUs) built into every modern CPU.

Architecture

┌─────────────────────────────────────────────────────────┐ │ User Space │ │ │ │ ┌──────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │ perf CLI │ │ perf record │ │ perf report │ │ │ └────┬─────┘ └──────┬───────┘ └──────┬───────┘ │ │ │ │ │ │ ├───────┼──────────────────┼───────────────────┼──────────┤ │ │ Kernel Space │ │ │ │ ▼ ▼ │ │ │ ┌─────────────────────────────┐ │ │ │ │ perf_event_open() syscall │ │ │ │ └─────────────┬───────────────┘ │ │ │ │ │ │ │ ┌─────────────▼───────────────┐ │ │ │ │ perf_events subsystem │ │ │ │ │ (kernel/events/core.c) │ │ │ │ └─────────────┬───────────────┘ │ │ │ │ │ │ │ ┌─────────────▼───────────────┐ │ │ │ │ Ring Buffer (per-CPU) │─────────────┘ │ │ └─────────────┬───────────────┘ │ │ │ │ ├────────────────┼────────────────────────────────────────┤ │ Hardware │ │ │ ┌─────────────▼───────────────┐ │ │ │ PMU (Performance │ │ │ │ Monitoring Unit) │ │ │ │ - Cycle counter │ │ │ │ - Cache miss counter │ │ │ │ - Branch mispredict ctr │ │ │ └─────────────────────────────┘ │ └─────────────────────────────────────────────────────────┘

Key Concepts

PMU (Performance Monitoring Unit): Hardware counters in the CPU. Can count cycles, cache misses, branch mispredictions, etc.
perf_event_open(): The syscall that configures what to monitor
Ring buffer: Kernel writes samples here; userspace reads them with minimal overhead
NMI (Non-Maskable Interrupt): Used for sampling — can't be blocked, so you get accurate samples even in interrupt-disabled code

perf vs. Other Tools

Tool	Mechanism	Pros	Cons
perf	PMU + kernel	Lowest overhead, kernel-integrated	Linux only, needs privileges
gprof	Instrumentation	Simple	High overhead, inaccurate
Valgrind/Callgrind	Emulation	Exact counts	20-50x slowdown
DTrace/BPF	Dynamic tracing	Programmable	More complex

For production profiling of C/C++ services in containers, perf is the gold standard. Its overhead is typically <2%.

Chapter 3: What Are Flame Graphs?

The Visualization Problem

perf report gives you a flat list or a tree. But with thousands of functions and deep call stacks, it's hard to see the big picture. Flame graphs solve this.

Invented by Brendan Gregg

Brendan Gregg (Netflix, now Intel) created flame graphs in 2011. They're now the standard way to visualize profiling data across all languages and platforms.

Anatomy of a Flame Graph

How to Read a Flame Graph ═════════════════════════ ┌──────────────────────────────────────────────────────┐ │ main() │ ← Bottom: entry point ├────────────────────────┬─────────────────────────────┤ │ process_msg() │ handle_timer() │ ← Callees above callers ├──────────┬─────────────┤─────────────────────────────┤ │ decode() │ encode() │ check_expiry() │ ├──────────┼─────┬───────┤ │ │ asn1_parse│ alloc│copy │ │ ← Top: leaf functions └──────────┴─────┴───────┴─────────────────────────────┘ (where CPU burns) ◄──────────────── WIDTH = TIME (% of samples) ──────────────────► KEY RULES: • Y-axis = stack depth (bottom=root, top=leaf) • X-axis = population (NOT time order!) — sorted alphabetically • Width of box = % of total samples containing that function • Color = random (or can encode: red=user, orange=kernel, etc.) • A wide box at the TOP = that function itself is hot • A wide box at the BOTTOM = many things called through it

What to Look For

Wide plateaus at the top: Functions that are themselves consuming CPU (leaf functions)
Wide towers: Deep call stacks that are frequently hit
Unexpected functions: Why is malloc() taking 15%? Memory allocation issue!
Missing frames: Gaps in the stack usually mean missing debug symbols

The X-axis is NOT a timeline! Frames are sorted alphabetically. Don't read left-to-right as "first this, then that." Width is the only meaningful horizontal metric.

Interactive SVG Flame Graphs

The standard output is an interactive SVG where you can:

Click a frame to zoom into that subtree
Hover to see exact sample count and percentage
Search (Ctrl+F) to highlight matching functions
Reset Zoom by clicking the bottom frame

Chapter 4: How Sampling Works

The Sampling Loop

When you run perf record, here's what happens at the hardware/kernel level:

Kernel programs a PMU counter to overflow after N events (e.g., every 10,000,000 CPU cycles)
When counter overflows → hardware generates an NMI (Non-Maskable Interrupt)
NMI handler captures: instruction pointer (IP), full call stack, PID, TID, timestamp
Sample is written to a per-CPU ring buffer
perf record process reads ring buffer and writes to perf.data file

Sample Rate vs. Frequency

# Fixed frequency: 99 samples/second (default)
perf record -F 99 -g ./my_program

# Fixed period: sample every 1,000,000 cycles
perf record -c 1000000 -g ./my_program

Why 99 Hz and not 100 Hz? To avoid lockstep sampling — if your program has a 100 Hz timer, sampling at 100 Hz would always hit the same code path. Using 99 Hz (a prime-ish number) avoids this aliasing.

Call Stack Unwinding

Getting the full call stack from a sample is non-trivial. Three methods:

Method	Flag	Pros	Cons
Frame pointers	`--call-graph fp`	Fast, simple	Requires `-fno-omit-frame-pointer`
DWARF	`--call-graph dwarf`	Works without frame pointers	Larger perf.data, slower
LBR	`--call-graph lbr`	Hardware-assisted, fast	Limited stack depth (~32)

For C programs compiled with GCC, add -fno-omit-frame-pointer to CFLAGS. This gives you perfect stacks with minimal overhead. Many distros now enable this by default (Fedora 38+, Ubuntu 24.04+).

How Many Samples Do You Need?

Statistical rule of thumb:

~1000 samples: Functions using >5% CPU are visible
~10,000 samples: Functions using >1% CPU are visible
~100,000 samples: Fine-grained analysis possible

At 99 Hz, recording for 10 seconds gives ~990 samples per CPU. For a 4-core system under load, that's ~4000 samples — usually enough for a first look.

Chapter 5: Installing perf

On Your Linux Machine

# Ubuntu/Debian
sudo apt-get install linux-tools-common linux-tools-$(uname -r)

# RHEL/CentOS/Rocky
sudo yum install perf

# SUSE/SLES
sudo zypper install perf

# Verify installation
perf version

Version Matching

perf must match your kernel version. If you get "WARNING: perf not found for kernel X.Y.Z", install the matching linux-tools-X.Y.Z package.

Permissions

By default, non-root users have limited access. The kernel parameter perf_event_paranoid controls this:

# Check current setting
cat /proc/sys/kernel/perf_event_paranoid

# Values:
#  -1 = No restrictions (allow everything)
#   0 = Allow access to CPU-specific data
#   1 = Allow kernel profiling (default on many distros)
#   2 = Allow user-space profiling only
#   3 = Disallow all (some hardened systems)

# Temporarily allow full access (requires root)
sudo sysctl -w kernel.perf_event_paranoid=-1

# Or run perf as root
sudo perf record ...

In Docker/Containers

Containers share the host kernel, so you need:

# Option 1: Run container with SYS_ADMIN capability
docker run --cap-add SYS_ADMIN ...

# Option 2: Run privileged (not recommended for production)
docker run --privileged ...

# Option 3: Specific perf capabilities
docker run --cap-add SYS_PTRACE --cap-add SYS_ADMIN \
           --security-opt seccomp=unconfined ...

Chapter 6: perf stat — Counting Events

Your First perf Command

perf stat counts hardware events without recording samples. It's the simplest way to get a performance overview:

# Profile a command
perf stat ./my_program

# Profile a running process for 10 seconds
perf stat -p $(pidof my_service) sleep 10

# Example output:
 Performance counter stats for './my_program':

         12,453.21 msec  task-clock                #    3.892 CPUs utilized
            14,221       context-switches          #    1.142 K/sec
               312       cpu-migrations            #   25.054 /sec
            45,678       page-faults               #    3.667 K/sec
    38,234,567,890       cycles                    #    3.069 GHz
    21,456,789,012       instructions              #    0.56  insn per cycle
     3,234,567,890       branches                  #  259.711 M/sec
       123,456,789       branch-misses             #    3.82% of all branches

       3.200123456 seconds time elapsed

Key Metrics to Understand

Metric	What It Means	Good Value
instructions per cycle (IPC)	How efficiently the CPU executes	>1.0 is good, >2.0 is great
branch-misses %	CPU mispredicted branches	<5% is normal
cache-misses %	L1/LLC cache miss rate	Depends on workload
context-switches	Kernel preempted your thread	Lower is better for latency

Specific Event Groups

# Cache analysis
perf stat -e cache-references,cache-misses,L1-dcache-loads,L1-dcache-load-misses \
    ./my_program

# Memory bandwidth
perf stat -e LLC-loads,LLC-load-misses,LLC-stores,LLC-store-misses \
    -p $(pidof my_service) sleep 5

# List all available events
perf list

Start with perf stat before diving into flame graphs. If IPC is low (<0.5), you likely have a memory/cache problem. If IPC is high but throughput is low, you have an algorithmic problem. This guides where to look in the flame graph.

Chapter 7: perf record & perf report

Recording Samples

# Basic recording with call graphs (most common usage)
perf record -g -p $(pidof my_service) sleep 30

# Flags explained:
#   -g              Enable call-graph (stack) recording
#   -p PID          Attach to running process
#   sleep 30        Record for 30 seconds then stop
#   Output: perf.data (in current directory)

# Record a specific command from start to finish
perf record -g --call-graph dwarf ./my_program --args

# Record all CPUs system-wide
sudo perf record -g -a sleep 10

# Higher frequency for short-lived programs
perf record -F 999 -g ./short_program

Important Flags

Flag	Purpose	When to Use
`-g`	Record call stacks	Always (needed for flame graphs)
`-F <hz>`	Sampling frequency	Default 4000; use 99 for low overhead
`--call-graph dwarf`	DWARF unwinding	When frame pointers are missing
`-p <pid>`	Target process	Profiling a running service
`-a`	All CPUs	System-wide profiling
`-o <file>`	Output filename	When you want a specific name

Analyzing with perf report

# Interactive TUI (terminal UI)
perf report

# Flat profile (no hierarchy)
perf report --stdio --sort=dso,symbol

# Show call graph as children (callee perspective)
perf report --children

# Filter to specific DSO (shared library)
perf report --dso=libmylib.so

Understanding perf report Output

# Example perf report --stdio output:
# Overhead  Command    Shared Object       Symbol
# ........  .........  ..................  .............................
    23.45%  my_service libprotobuf.so      [.] google::protobuf::internal::WireFormat::ReadTag
    12.34%  my_service my_service          [.] process_nas_message
     8.76%  my_service libc.so.6           [.] __memcpy_avx2
     6.54%  my_service my_service          [.] encode_ngap_pdu
     5.43%  my_service [kernel.kallsyms]   [k] copy_user_enhanced_fast_string

Column meanings:

Overhead: Percentage of total samples in this function
Command: Process name
Shared Object: Binary/library containing the function
[.] vs [k]: User-space vs kernel function
Symbol: Function name (demangled for C++)

The perf.data file contains raw samples. You can copy it to another machine for analysis — but you'll need the same binaries with debug symbols for proper symbol resolution.

Chapter 8: perf Events Deep Dive

Event Types

perf can monitor many types of events beyond CPU cycles:

# List all available events
perf list

# Categories:
# 1. Hardware events (from PMU)
perf list hw
#    cpu-cycles, instructions, cache-references, cache-misses,
#    branch-instructions, branch-misses, bus-cycles

# 2. Software events (kernel counters)
perf list sw
#    cpu-clock, task-clock, page-faults, context-switches,
#    cpu-migrations, minor-faults, major-faults

# 3. Hardware cache events
perf list cache
#    L1-dcache-loads, L1-dcache-load-misses,
#    LLC-loads, LLC-load-misses, dTLB-loads, dTLB-load-misses

# 4. Tracepoints (kernel instrumentation points)
perf list tracepoint
#    sched:sched_switch, syscalls:sys_enter_write, net:net_dev_xmit

# 5. Dynamic probes (you define them)
#    uprobe: user-space function entry
#    kprobe: kernel function entry

Profiling Different Bottlenecks

# CPU-bound: use default (cycles)
perf record -g -p $PID sleep 10

# Memory-bound: profile on cache misses
perf record -e cache-misses -g -p $PID sleep 10

# I/O-bound: profile on block device events
perf record -e block:block_rq_issue -g -a sleep 10

# Lock contention: profile on context switches
perf record -e context-switches -g -p $PID sleep 10

# Network: profile on network tracepoints
perf record -e net:net_dev_xmit -g -a sleep 10

When you profile on cache-misses instead of cycles, the resulting flame graph shows you where cache misses happen — not where time is spent. This is incredibly powerful for memory-bound workloads.

Chapter 9: Generating Flame Graphs

The Toolchain

perf record ──► perf.data ──► perf script ──► folded stacks ──► flamegraph.svg │ │ │ │ stackcollapse-perf.pl │ │ │ flamegraph.pl ▼ ▼ ▼ Raw text output One line per Interactive SVG (stack traces) unique stack

Step-by-Step

# Step 0: Clone Brendan Gregg's FlameGraph tools
git clone https://github.com/brendangregg/FlameGraph.git
cd FlameGraph

# Step 1: Record (already done)
perf record -F 99 -g -p $(pidof my_service) sleep 30

# Step 2: Convert perf.data to text
perf script > out.perf

# Step 3: Fold stacks (collapse identical stacks into counts)
./stackcollapse-perf.pl out.perf > out.folded

# Step 4: Generate SVG
./flamegraph.pl out.folded > flamegraph.svg

# ─── Or as a one-liner: ───
perf script | ./stackcollapse-perf.pl | ./flamegraph.pl > flamegraph.svg

Customizing the Output

# Title and colors
./flamegraph.pl --title="My Service CPU Profile" \
                --colors=java \
                out.folded > flamegraph.svg

# Minimum width (hide tiny frames)
./flamegraph.pl --minwidth=0.5 out.folded > flamegraph.svg

# Reverse (icicle graph — root at top)
./flamegraph.pl --inverted out.folded > icicle.svg

# Count display (show sample counts, not percentages)
./flamegraph.pl --countname="samples" out.folded > flamegraph.svg

The Folded Stack Format

Understanding this intermediate format is key:

# Each line: semicolon-separated stack (bottom to top) followed by count
main;process_msg;decode_nas;asn1_parse 1234
main;process_msg;encode_ngap;alloc_buffer 567
main;handle_timer;check_expiry 89
main;idle_loop 4500

# This means:
# - 1234 samples had the stack: main → process_msg → decode_nas → asn1_parse
# - 567 samples had: main → process_msg → encode_ngap → alloc_buffer
# etc.

You can manually create or edit folded stack files! This is useful for combining profiles, filtering, or creating synthetic test data.

Modern Alternatives

# Firefox Profiler (web-based, interactive)
perf script -F +pid > out.perf
# Upload to https://profiler.firefox.com/

# Speedscope (web-based)
# https://www.speedscope.app/ — drag and drop perf script output

# Hotspot (Qt GUI for Linux)
# https://github.com/KDAB/hotspot
hotspot perf.data

Chapter 10: Reading Flame Graphs — Practical Guide

The Mental Model

Think of a flame graph as an X-ray of your program's execution. Every pixel of width represents CPU time.

Reading Strategy

Look at the top edge first. Wide plateaus at the top = hot leaf functions. These are your optimization targets.
Look for unexpected width. Is malloc() 15% wide? That's a lot of allocation. Is memcpy() 10%? You're copying too much data.
Trace down from hot spots. Click a hot function and look at its callers below. Who's calling it so much?
Look for missing frames. Gaps or [unknown] frames mean missing debug symbols.

Example Analysis

Example: A 5GC NAS message handler ┌──────────────────────────────────────────────────────────────────────┐ │ main() │ ├──────────────────────────────────────┬───────────────────────────────┤ │ event_loop() │ signal_handler() │ ├──────────────────────────────────────┤ │ │ handle_nas_msg() │ │ ├─────────────────┬────────────────────┤ │ │ decode_nas() │ send_response() │ │ ├────────┬────────┼──────┬─────────────┤ │ │asn1_dec│ MALLOC │proto │ tcp_send() │ │ │ 15% │ 20% │ 10% │ 5% │ │ └────────┴────────┴──────┴─────────────┴───────────────────────────────┘ DIAGNOSIS: • malloc is 20% — excessive memory allocation in the decode path • asn1_dec is 15% — expected for NAS decoding, but check if it's re-parsing • signal_handler is wide — unexpected! Why is signal handling taking time?

Common Patterns

Pattern	What It Looks Like	Likely Cause
Wide malloc/free	Allocation functions dominate top	Too many small allocations; use pools
Wide memcpy	memcpy/memmove at top	Unnecessary data copying
Wide lock functions	pthread_mutex_lock at top	Lock contention (but see off-CPU)
Kernel frames dominate	[kernel.kallsyms] everywhere	Syscall-heavy; reduce syscalls
Single tall tower	One deep narrow stack	Recursive function or deep call chain

Lock contention trap: If threads are waiting on a lock, they're OFF-CPU and won't appear in a standard (on-CPU) flame graph! You need an off-CPU flame graph (Chapter 12) to see blocking.

Chapter 11: Differential Flame Graphs

Comparing Before and After

You made a code change. Did it help? Differential flame graphs show the difference between two profiles.

# Record before
perf record -F 99 -g -p $PID -o before.data sleep 30

# ... make your code change, restart service ...

# Record after
perf record -F 99 -g -p $PID -o after.data sleep 30

# Generate folded stacks for both
perf script -i before.data | ./stackcollapse-perf.pl > before.folded
perf script -i after.data  | ./stackcollapse-perf.pl > after.folded

# Generate differential flame graph
./difffolded.pl before.folded after.folded | ./flamegraph.pl > diff.svg

Reading Differential Flame Graphs

Red frames: Increased in the "after" profile (regression)
Blue frames: Decreased in the "after" profile (improvement)
White/neutral: No significant change

Differential flame graphs are perfect for code review. Attach one to your Gerrit change to show the performance impact of your optimization.

Chapter 12: Off-CPU Flame Graphs

The Missing Half

Standard perf record only captures where the CPU is running. But what about time spent:

Waiting for a mutex/lock
Sleeping on I/O (disk, network)
Waiting for a condition variable
Blocked on a futex

This is off-CPU time — and it's often where latency hides.

Capturing Off-CPU Data

# Method 1: perf with sched tracepoints
sudo perf record -e sched:sched_switch -g -p $PID sleep 30
sudo perf script > out.perf
./stackcollapse-perf.pl out.perf > out.folded
./flamegraph.pl --color=io --title="Off-CPU" out.folded > offcpu.svg

# Method 2: Using bpftrace (modern, lower overhead)
sudo bpftrace -e '
  kprobe:finish_task_switch {
    @start[tid] = nsecs;
  }
  kretprobe:finish_task_switch /@start[tid]/ {
    @off[kstack, ustack, comm] = sum(nsecs - @start[tid]);
    delete(@start[tid]);
  }
' > offcpu.bt

# Method 3: Using BCC tools
sudo /usr/share/bcc/tools/offcputime -df -p $PID 30 > offcpu.folded
./flamegraph.pl --color=io offcpu.folded > offcpu.svg

When to Use Off-CPU Analysis

Decision Tree: On-CPU vs Off-CPU ═════════════════════════════════ Is your service using high CPU? ├── YES → On-CPU flame graph (standard perf record) │ "Where is CPU time going?" │ └── NO, but latency is high → Off-CPU flame graph "Where is the service WAITING?" │ ├── Waiting on mutex → Lock contention ├── Waiting on read() → I/O bottleneck ├── Waiting on futex → Thread synchronization └── Waiting on poll/epoll → No work available

Chapter 13: Profiling in Containers

The Container Challenge

Containers add complexity to profiling because:

The process runs in a different PID namespace
Security profiles (seccomp, AppArmor) may block perf_event_open()
The container filesystem doesn't have the host's kernel symbols
Debug symbols may not be in the container image

How perf Works with Containers

Container Profiling — Two Approaches ═════════════════════════════════════ Approach A: Profile FROM THE HOST ┌─────────────────────────────────────────────┐ │ Host │ │ ┌─────────────────────────────────────┐ │ │ │ perf record -g -p │ │ │ └──────────────────┬──────────────────┘ │ │ │ sees real PID │ │ ┌──────────────────▼──────────────────┐ │ │ │ Container (PID namespace) │ │ │ │ ┌──────────────────────────────┐ │ │ │ │ │ my_service (PID 1 inside) │ │ │ │ │ │ (PID 12345 on host) │ │ │ │ │ └──────────────────────────────┘ │ │ │ └─────────────────────────────────────┘ │ └─────────────────────────────────────────────┘ Approach B: Profile FROM INSIDE the container ┌─────────────────────────────────────────────┐ │ Container (needs SYS_ADMIN + seccomp off) │ │ ┌─────────────────────────────────────┐ │ │ │ perf record -g -p 1 sleep 30 │ │ │ │ (perf binary must be in container) │ │ │ └─────────────────────────────────────┘ │ └─────────────────────────────────────────────┘

Approach A: Profiling from the Host

# Find the container's PID on the host
docker inspect --format '{{.State.Pid}}' my_container
# Or for Kubernetes:
# Get the container ID, then use crictl
crictl inspect  | jq '.info.pid'

# Profile using host PID
sudo perf record -g -p  sleep 30

# Problem: symbols! perf looks for binaries in /proc//root/
# which maps to the container's filesystem
# Solution: use --symfs or copy binaries out

Approach B: Profiling from Inside

# Dockerfile addition for profiling
FROM my-base-image
RUN apt-get update && apt-get install -y linux-tools-generic
# Or for Alpine: apk add perf

# Run with required capabilities
docker run --cap-add SYS_ADMIN \
           --cap-add SYS_PTRACE \
           --security-opt seccomp=unconfined \
           my_image

In Kubernetes, you typically use Approach B with a sidecar or ephemeral container, or Approach A from a privileged DaemonSet. We'll cover both in the next chapters.

Chapter 14: Kubernetes Pod Profiling Setup

Option 1: Ephemeral Debug Container

Kubernetes 1.23+ supports ephemeral containers — temporary containers added to a running pod for debugging:

# Attach an ephemeral container with perf tools
kubectl debug -it pod/my-service-pod \
  --image=ubuntu:22.04 \
  --target=my-service-container \
  -- bash

# Inside the ephemeral container:
apt-get update && apt-get install -y linux-tools-generic
# Now you share the PID namespace with the target container
perf record -g -p 1 sleep 30

Ephemeral containers require the EphemeralContainers feature gate (enabled by default since K8s 1.25) and the pod must have shareProcessNamespace: true for you to see the target process.

Option 2: Privileged DaemonSet (Host Profiling)

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: perf-profiler
  namespace: debug
spec:
  selector:
    matchLabels:
      app: perf-profiler
  template:
    metadata:
      labels:
        app: perf-profiler
    spec:
      hostPID: true          # See all host PIDs
      hostNetwork: true
      containers:
      - name: profiler
        image: ubuntu:22.04
        command: ["sleep", "infinity"]
        securityContext:
          privileged: true   # Full access to perf
        volumeMounts:
        - name: host-root
          mountPath: /host
          readOnly: true
        - name: sys-kernel
          mountPath: /sys/kernel
      volumes:
      - name: host-root
        hostPath:
          path: /
      - name: sys-kernel
        hostPath:
          path: /sys/kernel

# Deploy and exec into it
kubectl apply -f perf-daemonset.yaml
kubectl exec -it -n debug perf-profiler-xxxxx -- bash

# Install perf
apt-get update && apt-get install -y linux-tools-generic linux-tools-$(uname -r)

# Find your target process (visible because hostPID: true)
ps aux | grep my_service

# Profile it
perf record -g -p  -o /tmp/perf.data sleep 30

# Copy data out
kubectl cp debug/perf-profiler-xxxxx:/tmp/perf.data ./perf.data

Option 3: Sidecar Container

apiVersion: v1
kind: Pod
metadata:
  name: my-service-profiled
spec:
  shareProcessNamespace: true  # Critical! Allows seeing other containers' processes
  containers:
  - name: my-service
    image: my-service:latest
    # ... normal config ...
  - name: profiler
    image: my-profiler-tools:latest  # Image with perf + flamegraph tools
    command: ["sleep", "infinity"]
    securityContext:
      capabilities:
        add: ["SYS_ADMIN", "SYS_PTRACE"]
      seccompProfile:
        type: Unconfined

Security Considerations

Capability	Why Needed	Risk
SYS_ADMIN	perf_event_open() syscall	High — broad capability
SYS_PTRACE	Reading other process memory/stacks	Medium
seccomp=unconfined	Default seccomp blocks perf syscalls	Medium
hostPID	See processes outside container	Medium

Never leave profiling capabilities enabled in production permanently. Use them temporarily for debugging, then remove. Consider using a separate namespace with RBAC restrictions.

Chapter 15: Collecting perf Data in Kubernetes

Complete Workflow: Pod → Flame Graph

# ═══════════════════════════════════════════════════════════════
# STEP 1: Identify the target pod and node
# ═══════════════════════════════════════════════════════════════
kubectl get pod my-service-pod -o wide
# Note the NODE column — you need to profile on that node

# ═══════════════════════════════════════════════════════════════
# STEP 2: Get the container's PID (from the node)
# ═══════════════════════════════════════════════════════════════
# Option A: If you have node access
ssh worker-node-1
# Find container ID
crictl ps | grep my-service
# Get PID from container ID
crictl inspect  | jq '.info.pid'

# Option B: From a privileged DaemonSet on that node
kubectl exec -it perf-profiler- -- \
  bash -c "ps aux | grep my_service | grep -v grep"

# ═══════════════════════════════════════════════════════════════
# STEP 3: Record profile
# ═══════════════════════════════════════════════════════════════
# From privileged DaemonSet or node:
perf record -F 99 -g -p  -o /tmp/perf.data -- sleep 30

# Verify recording
perf report -i /tmp/perf.data --stdio | head -20

# ═══════════════════════════════════════════════════════════════
# STEP 4: Extract perf.data from the cluster
# ═══════════════════════════════════════════════════════════════
kubectl cp debug/perf-profiler-xxxxx:/tmp/perf.data ./perf.data

# ═══════════════════════════════════════════════════════════════
# STEP 5: Generate flame graph (on your workstation)
# ═══════════════════════════════════════════════════════════════
perf script -i perf.data | \
  ./FlameGraph/stackcollapse-perf.pl | \
  ./FlameGraph/flamegraph.pl --title="my-service K8s profile" \
  > my-service-flamegraph.svg

# Open in browser
firefox my-service-flamegraph.svg

Handling Symbols in Containers

The biggest challenge: perf script needs the original binaries to resolve symbols.

# Problem: perf.data references /usr/bin/my_service inside the container
# but your workstation doesn't have that binary

# Solution 1: Copy binary from container
kubectl cp my-pod:/usr/bin/my_service ./my_service
# Then use --symfs
perf script -i perf.data --symfs=./symbols/ > out.perf

# Solution 2: Build a debug image with symbols
# In your Dockerfile, keep debug symbols:
# Don't strip! Or create a separate debug package

# Solution 3: Use buildid-based symbol resolution
# perf stores build-ids in perf.data
perf buildid-list -i perf.data
# Copy matching .debug files to ~/.debug/

Automation Script

#!/bin/bash
# profile-k8s-pod.sh — One-shot profiling of a K8s pod
set -e

POD_NAME=${1:?Usage: $0  [namespace] [duration]}
NAMESPACE=${2:-default}
DURATION=${3:-30}
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
OUTPUT="flamegraph_${POD_NAME}_${TIMESTAMP}.svg"

echo "==> Profiling pod $POD_NAME in namespace $NAMESPACE for ${DURATION}s"

# Create ephemeral debug container and profile
kubectl debug -it "pod/${POD_NAME}" \
  -n "$NAMESPACE" \
  --image=ubuntu:22.04 \
  --target="$(kubectl get pod "$POD_NAME" -n "$NAMESPACE" -o jsonpath='{.spec.containers[0].name}')" \
  -- bash -c "
    apt-get update -qq && apt-get install -y -qq linux-tools-generic > /dev/null 2>&1
    perf record -F 99 -g -p 1 -o /tmp/perf.data sleep $DURATION
    perf script -i /tmp/perf.data
  " > /tmp/perf_script_output.txt

# Generate flame graph locally
cat /tmp/perf_script_output.txt | \
  ./FlameGraph/stackcollapse-perf.pl | \
  ./FlameGraph/flamegraph.pl --title="$POD_NAME ($TIMESTAMP)" > "$OUTPUT"

echo "==> Flame graph saved to: $OUTPUT"

Chapter 16: Continuous Profiling

Why Continuous?

Ad-hoc profiling catches problems you know about. Continuous profiling catches problems you don't — regressions, gradual degradation, rare spikes.

Tools for Kubernetes

Tool	Type	Language Support	Storage
Parca	Open source	C/C++/Go/Rust/Java	Built-in
Pyroscope	Open source (Grafana)	Many	Object storage
Polar Signals	Commercial (Parca)	Many	Cloud
Google Cloud Profiler	Commercial	Many	GCP

Parca — Open Source Continuous Profiling

# Install Parca Agent as DaemonSet (uses eBPF, no code changes needed)
kubectl apply -f https://github.com/parca-dev/parca-agent/releases/latest/download/kubernetes-manifest.yaml

# Install Parca Server
helm repo add parca https://parca-dev.github.io/helm-charts
helm install parca parca/parca

# Access the UI
kubectl port-forward svc/parca 7070:7070
# Open http://localhost:7070

Continuous Profiling Architecture (Parca) ══════════════════════════════════════════ ┌─────────────────────────────────────────────────────────────┐ │ Kubernetes Cluster │ │ │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │ │ Pod A │ │ Pod B │ │ Pod C │ │ │ │ (my-svc) │ │ (my-svc) │ │ (other) │ │ │ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │ │ │ │ │ │ │ ┌──────▼─────────────────▼─────────────────▼──────┐ │ │ │ Parca Agent (DaemonSet on each node) │ │ │ │ - Uses eBPF to sample all processes │ │ │ │ - No code changes, no sidecars needed │ │ │ │ - ~1% overhead │ │ │ └──────────────────────┬──────────────────────────┘ │ │ │ gRPC (profiles every 10s) │ │ ┌──────────────────────▼──────────────────────────┐ │ │ │ Parca Server │ │ │ │ - Stores profiles (columnar format) │ │ │ │ - Query UI with flame graphs │ │ │ │ - Diff between time ranges │ │ │ │ - Label-based filtering (pod, container, node) │ │ │ └─────────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────────┘

Continuous profiling with eBPF-based agents (Parca, Pyroscope) is the modern approach for Kubernetes. No code changes, no capabilities needed on your pods, and always-on with minimal overhead.

Chapter 17: Symbols & Debug Info

Why Symbols Matter

Without symbols, your flame graph shows hex addresses instead of function names:

# Without symbols:
0x7f3a2b4c5d6e;0x7f3a2b4c1234;0x55a1b2c3d4e5 42

# With symbols:
main;process_message;decode_nas_pdu 42

Types of Symbol Information

Type	Contains	How to Get
Symbol table (.symtab)	Function names + addresses	Default in most builds
Dynamic symbols (.dynsym)	Exported function names	Always present in shared libs
DWARF debug info	Line numbers, variables, types	`-g` flag
Separate debug files (.debug)	Stripped debug info	`objcopy --only-keep-debug`

Checking Symbol Availability

# Check if binary has symbols
file my_service
# "not stripped" = has symbols
# "stripped" = no symbols

# List symbols
nm my_service | head
# Or for dynamic symbols in shared libs:
nm -D /usr/lib/libmylib.so | grep my_function

# Check DWARF info
readelf --debug-dump=info my_service | head

# Check build-id (used by perf for symbol matching)
readelf -n my_service | grep "Build ID"

Getting Symbols for Profiling

# Method 1: Compile with debug info (best)
gcc -g -fno-omit-frame-pointer -O2 -o my_service my_service.c

# Method 2: Separate debug file (production-friendly)
gcc -g -O2 -o my_service my_service.c
objcopy --only-keep-debug my_service my_service.debug
strip my_service
objcopy --add-gnu-debuglink=my_service.debug my_service

# Method 3: Install debuginfo packages
# RHEL/CentOS:
debuginfo-install my-package
# Ubuntu/Debian:
apt-get install my-package-dbgsym

# Method 4: Use perf's buildid cache
perf buildid-cache --add my_service.debug

For Kubernetes profiling, build your container images with a debug variant that includes symbols. Use multi-stage builds: one stage for the stripped production binary, another for the debug binary. Deploy the debug variant when you need to profile.

Symbol Resolution Path

perf looks for symbols in this order:

~/.debug/ directory (buildid-based)
/usr/lib/debug/ (system debug packages)
The binary itself (if not stripped)
--symfs path (if specified)
/proc//root/ (container filesystem)

Chapter 18: Profiling C/C++ Services

Compilation Flags for Profiling

# Recommended CFLAGS for profilable builds:
CFLAGS = -O2 \                    # Keep optimizations (profile real behavior)
         -g \                      # Debug symbols (DWARF)
         -fno-omit-frame-pointer \ # Reliable stack unwinding
         -mno-omit-leaf-frame-pointer  # Even leaf functions get frame pointers

# In CMake:
set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -fno-omit-frame-pointer -g")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fno-omit-frame-pointer -g")

# In Makefile:
CFLAGS += -fno-omit-frame-pointer -g

Do NOT profile with -O0 (no optimization). The profile will be meaningless because the code structure is completely different from production. Always profile optimized code (-O2).

Common C/C++ Profiling Scenarios

Scenario 1: High CPU in Message Processing

# Record during load test
perf record -F 99 -g --call-graph fp -p $(pidof amf_service) sleep 60

# Generate flame graph
perf script | stackcollapse-perf.pl | flamegraph.pl > cpu.svg

# Look for:
# - Serialization/deserialization (protobuf, ASN.1)
# - String operations (strcmp, strlen in loops)
# - Memory allocation (malloc/free churn)
# - Logging (fprintf, snprintf in hot path)

Scenario 2: Memory Allocation Overhead

# Profile malloc/free specifically
perf probe -x /lib/x86_64-linux-gnu/libc.so.6 malloc
perf record -e probe_libc:malloc -g -p $PID sleep 10

# Or use the allocation tracepoint
perf record -e kmem:kmalloc -g -p $PID sleep 10

# Better: use tcmalloc/jemalloc profiling
LD_PRELOAD=/usr/lib/libtcmalloc.so HEAPPROFILE=/tmp/heap ./my_service

Scenario 3: Lock Contention

# On-CPU: see spinning
perf record -g -p $PID sleep 30
# Look for pthread_mutex_lock, __lll_lock_wait in flame graph

# Off-CPU: see blocking
sudo perf record -e sched:sched_switch -g -p $PID sleep 30
# Or with BCC:
sudo offcputime-bpfcc -df -p $PID 30 > offcpu.folded

Annotating Hot Functions

# See assembly-level profile for a specific function
perf annotate -i perf.data -s decode_nas_message

# Example output:
 Percent │      Disassembly of decode_nas_message
─────────┼────────────────────────────────────────
         │      push   %rbp
         │      mov    %rsp,%rbp
   2.34% │      mov    (%rdi),%eax        ← 2.34% of samples here
  15.67% │      call   asn1_decode_ie     ← 15.67% here!
   8.90% │      test   %eax,%eax
         │      je     0x4012a0
  12.45% │      call   validate_ie        ← another hot spot
         │      ...

perf annotate is incredibly powerful for C developers. It shows you exactly which instructions are hot, helping you understand if the bottleneck is a function call, a memory access, or a branch.

Chapter 19: Common Pitfalls

Pitfall 1: Missing Stacks / [unknown] Frames

# Symptom: flame graph shows [unknown] or very shallow stacks
# Cause: missing frame pointers or debug info

# Fix: recompile with frame pointers
gcc -fno-omit-frame-pointer -g -O2 ...

# Or use DWARF unwinding (slower but works without frame pointers)
perf record --call-graph dwarf -p $PID sleep 30

Pitfall 2: Profiling the Wrong Thing

# Symptom: flame graph looks "normal" but service is slow
# Cause: you're profiling on-CPU but the problem is off-CPU (I/O, locks)

# Fix: check if CPU is actually high
top -p $PID
# If CPU is low but latency is high → off-CPU profiling

Pitfall 3: Too Few Samples

# Symptom: flame graph is sparse, functions show 1-2 samples
# Cause: recording too short or frequency too low

# Fix: record longer or increase frequency
perf record -F 999 -g -p $PID sleep 60  # 999 Hz for 60 seconds

Pitfall 4: Kernel Symbols Missing

# Symptom: kernel frames show as hex addresses
# Cause: /proc/kallsyms is restricted

# Fix:
sudo sysctl -w kernel.kptr_restrict=0
# Or run perf as root

Pitfall 5: Container Symbol Mismatch

# Symptom: wrong function names or [unknown] in container profiles
# Cause: perf resolves symbols from host, not container filesystem

# Fix: point perf to container's root filesystem
perf script --symfs=/proc//root/ > out.perf

# Or copy binaries and use local symfs
mkdir -p ./symfs/usr/bin
cp container_binary ./symfs/usr/bin/
perf script --symfs=./symfs/ > out.perf

Pitfall 6: Inlined Functions Disappear

# Symptom: you know function X is hot but it doesn't appear
# Cause: compiler inlined it — it's merged into the caller

# Fix: use DWARF info to recover inlined frames
perf script --inline > out.perf
# Or compile with -fno-inline for profiling (but changes behavior!)

Chapter 20: Real-World Workflow

The Complete Profiling Workflow

Production Profiling Workflow ════════════════════════════ ┌─────────────┐ ┌──────────────┐ ┌───────────────┐ │ 1. OBSERVE │────►│ 2. MEASURE │────►│ 3. PROFILE │ │ │ │ │ │ │ │ • Grafana │ │ • perf stat │ │ • perf record │ │ • Alerts │ │ • top/htop │ │ • 30-60s │ │ • SLO breach│ │ • IPC check │ │ • With -g │ └─────────────┘ └──────────────┘ └───────┬───────┘ │ ┌─────────────┐ ┌──────────────┐ ┌───────▼───────┐ │ 6. VERIFY │◄────│ 5. FIX │◄────│ 4. ANALYZE │ │ │ │ │ │ │ │ • Re-profile│ │ • Code change│ │ • Flame graph │ │ • Diff flame│ │ • Review │ │ • Find hot │ │ • Load test │ │ • Test │ │ • Root cause │ └─────────────┘ └──────────────┘ └───────────────┘

Step-by-Step: Profiling a Service in Your K8s Cluster

# ═══════════════════════════════════════════════════════════════
# 1. Identify the problem
# ═══════════════════════════════════════════════════════════════
# Check Grafana: which pod has high CPU?
# Or: kubectl top pods -n my-namespace

# ═══════════════════════════════════════════════════════════════
# 2. Quick sanity check with perf stat
# ═══════════════════════════════════════════════════════════════
kubectl exec -it profiler-pod -- perf stat -p $PID sleep 5
# Check IPC: low IPC = memory-bound, high IPC = compute-bound

# ═══════════════════════════════════════════════════════════════
# 3. Record profile
# ═══════════════════════════════════════════════════════════════
kubectl exec -it profiler-pod -- \
  perf record -F 99 -g --call-graph fp -p $PID -o /tmp/perf.data sleep 30

# ═══════════════════════════════════════════════════════════════
# 4. Extract and generate flame graph
# ═══════════════════════════════════════════════════════════════
kubectl exec profiler-pod -- perf script -i /tmp/perf.data > /tmp/out.perf
# On your workstation:
kubectl cp profiler-pod:/tmp/out.perf ./out.perf
./stackcollapse-perf.pl out.perf > out.folded
./flamegraph.pl --title="my-service $(date)" out.folded > flamegraph.svg

# ═══════════════════════════════════════════════════════════════
# 5. Analyze
# ═══════════════════════════════════════════════════════════════
# Open flamegraph.svg in browser
# Look for wide plateaus at the top
# Search for known hot functions (Ctrl+F in SVG)

# ═══════════════════════════════════════════════════════════════
# 6. After fixing — verify with differential flame graph
# ═══════════════════════════════════════════════════════════════
# Record again after fix
# Generate diff:
./difffolded.pl before.folded after.folded | \
  ./flamegraph.pl --title="Before vs After" > diff.svg

Quick Reference Card

# ─── ESSENTIAL COMMANDS ───────────────────────────────────────
# Quick CPU profile:
perf record -F 99 -g -p $PID sleep 30

# Generate flame graph:
perf script | stackcollapse-perf.pl | flamegraph.pl > flame.svg

# System-wide profile:
sudo perf record -F 99 -g -a sleep 10

# Profile with DWARF (no frame pointers):
perf record -F 99 --call-graph dwarf -p $PID sleep 30

# Off-CPU (what's blocking):
sudo offcputime-bpfcc -df -p $PID 30 > off.folded

# Cache miss profile:
perf record -e cache-misses -g -p $PID sleep 30

# Annotate hot function:
perf annotate -s hot_function_name

# ─── KUBERNETES ───────────────────────────────────────────────
# Debug a pod:
kubectl debug -it pod/NAME --image=ubuntu --target=CONTAINER -- bash

# Copy perf data out:
kubectl cp NAMESPACE/POD:/tmp/perf.data ./perf.data

Pro tip: Create a shell alias for the full pipeline:

alias flame='perf script | stackcollapse-perf.pl | flamegraph.pl > /tmp/flame.svg && xdg-open /tmp/flame.svg'