strace

debuggingLinuxsystem-callsmonitoringtracingperformancetroubleshooting

Debugging Tool

strace

Overview

strace is a powerful system call tracer for Linux. It captures and records all system calls made by a process and signals received, providing insights into how programs interact with the kernel.

Details

strace is a diagnostic, instructional, and debugging tool that traces system calls and signals, originally developed for SunOS and later ported to Linux. First appearing in 1991, it has become an indispensable tool for system administrators and developers troubleshooting programs without access to source code. strace works by using the ptrace system call to attach to a target process, allowing it to intercept and log every interaction between the process and the Linux kernel.

The fundamental principle of strace is that every user-space program must communicate with the kernel through system calls to perform operations like file I/O, network communication, memory allocation, and process management. By intercepting these calls, strace provides a complete picture of what a program is actually doing at the system level. This makes it invaluable for diagnosing issues like "Why won't this program start?", "What files is it trying to access?", or "Why is it running slowly?"

strace operates non-intrusively, requiring no modification to the target program. It can attach to running processes or start new ones under its control. The tool displays system call names, arguments, return values, and error codes, making it possible to understand program behavior without examining source code. It supports filtering specific system calls, saving output to files, and generating statistical summaries of system call usage, helping identify performance bottlenecks and resource usage patterns.

While strace is extremely powerful for debugging and learning how programs work, it introduces significant overhead - traced processes can run 10-50 times slower. This makes it unsuitable for production performance analysis, where newer tools like perf-trace or eBPF-based alternatives are preferred. However, for understanding program behavior, diagnosing configuration issues, and learning about Linux system programming, strace remains one of the most important tools in a developer's arsenal.

Pros and Cons

Pros

  • No Source Required: Works with any binary without source code
  • Non-intrusive: No need to modify or recompile programs
  • Comprehensive Tracing: Captures all system calls and signals
  • Real-time Monitoring: Shows system calls as they happen
  • Process Attachment: Can attach to running processes
  • Detailed Information: Shows arguments, return values, and errno
  • Educational Value: Excellent for learning system programming
  • Wide Availability: Pre-installed or easily available on most Linux distributions

Cons

  • Performance Impact: Slows down traced processes significantly (10-50x)
  • Linux Specific: Only works on Linux and some Unix-like systems
  • Output Volume: Can generate overwhelming amounts of output
  • Interpretation Needed: Requires knowledge of system calls
  • Not for Production: Too slow for production performance analysis
  • Limited to System Calls: Doesn't show internal program logic
  • No Real-time Systems: Unsuitable for debugging real-time applications

Key Links

Usage Examples

Basic System Call Tracing

# Simple program to trace
# hello.c
#include <stdio.h>
#include <fcntl.h>
#include <unistd.h>

int main() {
    printf("Hello, strace!\n");
    
    // File operations
    int fd = open("test.txt", O_CREAT | O_WRONLY, 0644);
    if (fd != -1) {
        write(fd, "Testing strace\n", 15);
        close(fd);
    }
    
    return 0;
}

# Compile and trace
gcc -o hello hello.c

# Basic strace usage
strace ./hello

# Output shows system calls:
# execve("./hello", ["./hello"], 0x7ffe3f6a8230 /* 48 vars */) = 0
# brk(NULL)                               = 0x55de27e97000
# arch_prctl(0x3001 /* ARCH_??? */, 0x7ffee377e6b0) = -1 EINVAL
# access("/etc/ld.so.preload", R_OK)      = -1 ENOENT
# openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
# ...
# write(1, "Hello, strace!\n", 15)        = 15
# openat(AT_FDCWD, "test.txt", O_WRONLY|O_CREAT, 0644) = 3
# write(3, "Testing strace\n", 15)        = 15
# close(3)                                = 0
# exit_group(0)                           = ?

# Save output to file
strace -o strace_output.txt ./hello

# Follow child processes
strace -f ./hello

# Show timestamps
strace -t ./hello                       # Time of day
strace -r ./hello                       # Relative time between calls
strace -T ./hello                       # Time spent in each call

Filtering Specific System Calls

# Trace only file operations
strace -e trace=file ls /tmp
# Traces: open, stat, chmod, unlink, etc.

# Trace only network operations
strace -e trace=network curl https://example.com
# Traces: socket, connect, send, recv, etc.

# Trace only process operations
strace -e trace=process bash -c "echo hello | grep h"
# Traces: fork, execve, wait4, exit, etc.

# Trace only memory operations
strace -e trace=memory ./memory_intensive_app
# Traces: mmap, munmap, brk, etc.

# Trace specific system calls
strace -e open,read,write,close cat /etc/passwd

# Exclude specific calls
strace -e trace=\!futex,poll ./application

# Multiple filters combined
strace -e trace=open,openat,read,write -e signal=none ./hello

# Common filter expressions:
# -e trace=file     # File operations
# -e trace=process  # Process management
# -e trace=network  # Network operations
# -e trace=signal   # Signal handling
# -e trace=ipc      # IPC operations
# -e trace=desc     # File descriptor operations
# -e trace=memory   # Memory mapping

Attaching to Running Processes

# Start a long-running process
# server.py
import socket
import time

server = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
server.bind(('localhost', 8080))
server.listen(5)
print(f"Server listening on port 8080, PID: {os.getpid()}")

while True:
    client, addr = server.accept()
    data = client.recv(1024)
    client.send(b"Echo: " + data)
    client.close()

# Run the server
python3 server.py &
SERVER_PID=$!

# Attach strace to running process
sudo strace -p $SERVER_PID

# Attach and follow children
sudo strace -p $SERVER_PID -f

# Attach with specific filters
sudo strace -p $SERVER_PID -e trace=network

# Detach with Ctrl+C
# The process continues running after detachment

# Trace a process tree
sudo strace -p $(pgrep -f "apache2") -f

# Trace with output to file
sudo strace -p $SERVER_PID -o server_trace.log

# Attach to multiple PIDs
sudo strace -p $PID1 -p $PID2 -p $PID3

Debugging File Access Issues

# Program with file access issues
# file_access.c
#include <stdio.h>
#include <fcntl.h>
#include <errno.h>
#include <string.h>

int main() {
    // Try to open configuration file
    int fd = open("/etc/myapp/config.conf", O_RDONLY);
    if (fd == -1) {
        printf("Failed to open config: %s\n", strerror(errno));
    }
    
    // Try to access user file
    fd = open("~/.myapp/data.txt", O_RDONLY);
    if (fd == -1) {
        printf("Failed to open user data: %s\n", strerror(errno));
    }
    
    return 0;
}

# Trace to see actual file paths and errors
strace ./file_access 2>&1 | grep -E "open|access"

# Output reveals:
# openat(AT_FDCWD, "/etc/myapp/config.conf", O_RDONLY) = -1 ENOENT (No such file or directory)
# openat(AT_FDCWD, "~/.myapp/data.txt", O_RDONLY) = -1 ENOENT (No such file or directory)
# Note: ~ is not expanded - literal path used!

# Check library loading issues
strace -e trace=open,openat,access ./application 2>&1 | grep -v ENOENT

# Find where program looks for files
strace -e trace=stat,lstat,access,open,openat ./application 2>&1 | grep -E "(conf|config|rc)"

# Debug permission issues
strace -e trace=open,openat,access -o perms.log ./application
grep "EACCES\|EPERM" perms.log

Performance Analysis with strace

# Generate system call statistics
strace -c ls -la /usr/bin > /dev/null

# Output:
# % time     seconds  usecs/call     calls    errors syscall
# ------ ----------- ----------- --------- --------- ----------------
#  30.00    0.000120           2        48           mmap
#  20.00    0.000080           1        53           close
#  15.00    0.000060           1        45           openat
#  10.00    0.000040           0        97           lstat
#   8.00    0.000032           0       106           getdents64
# ...

# Trace with timing information
strace -T -o timing.log ./application
# Each line shows time spent: <syscall> = result <time>

# Find slow system calls
strace -T ./application 2>&1 | awk -F'<|>' '$2 > 0.001 {print $0}'

# Analyze I/O patterns
strace -e trace=read,write,pread,pwrite -T ./application 2>&1 | \
    awk -F'[<>]' '{sum+=$2; count++} END {print "Avg I/O time:", sum/count}'

# Create syscall histogram
strace -c -S time ./application
# Sorts by time spent in each syscall

# Trace with syscall counts
strace -C ./application
# Shows both regular output and summary

Network Debugging with strace

# Debug network connections
# network_client.c
#include <stdio.h>
#include <string.h>
#include <sys/socket.h>
#include <arpa/inet.h>
#include <unistd.h>

int main() {
    int sock = socket(AF_INET, SOCK_STREAM, 0);
    
    struct sockaddr_in server;
    server.sin_addr.s_addr = inet_addr("8.8.8.8");
    server.sin_family = AF_INET;
    server.sin_port = htons(80);
    
    if (connect(sock, (struct sockaddr *)&server, sizeof(server)) < 0) {
        perror("connect failed");
        return 1;
    }
    
    char *message = "GET / HTTP/1.0\r\n\r\n";
    send(sock, message, strlen(message), 0);
    
    char response[2000];
    recv(sock, response, 2000, 0);
    
    close(sock);
    return 0;
}

# Trace network operations
strace -e trace=network ./network_client

# Detailed network debugging
strace -e trace=socket,connect,bind,listen,accept,send,recv,sendto,recvfrom \
       -e read=all -e write=all ./network_client 2>&1

# Show full data transferred
strace -e trace=network -s 1024 -x ./network_client

# Debug DNS resolution
strace -e trace=network,open,read dig google.com

# Trace HTTP client
strace -e trace=network -f curl https://example.com 2>&1 | \
    grep -E "socket|connect|send|recv"

# Monitor connection attempts
strace -e connect -f ./application 2>&1 | \
    awk -F'[{}]' '/connect/ {print $2}'

Advanced Debugging Techniques

#!/bin/bash
# advanced_strace.sh

# 1. Debug program startup issues
debug_startup() {
    local program="$1"
    echo "=== Debugging startup of $program ==="
    
    # Check library dependencies
    strace -e trace=open,openat "$program" 2>&1 | \
        grep -E "\.so|\.conf" | grep -v ENOENT
    
    # Check environment access
    strace -e trace=access,stat "$program" 2>&1 | \
        grep -E "PATH|HOME|USER"
}

# 2. Monitor file modifications
monitor_file_changes() {
    local program="$1"
    strace -e trace=open,openat,creat,unlink,rename \
           -e signal=none "$program" 2>&1 | \
        grep -v "O_RDONLY"
}

# 3. Trace signal handling
trace_signals() {
    local pid="$1"
    strace -e trace=signal -p "$pid" 2>&1
}

# 4. Debug permission problems
debug_permissions() {
    local program="$1"
    strace -e trace=access,open,openat,stat,lstat \
           "$program" 2>&1 | \
        grep -E "EACCES|EPERM"
}

# 5. Find configuration files
find_config_files() {
    local program="$1"
    strace -e trace=open,openat,access,stat \
           "$program" 2>&1 | \
        grep -E "(config|conf|rc|ini)" | \
        grep -v ENOENT | \
        awk -F'"' '{print $2}' | sort -u
}

# 6. Analyze system call patterns
analyze_patterns() {
    local logfile="$1"
    echo "=== System Call Analysis ==="
    
    # Most frequent syscalls
    echo "Top 10 system calls:"
    awk '{print $1}' "$logfile" | \
        sort | uniq -c | sort -rn | head -10
    
    # Error analysis
    echo -e "\nErrors by syscall:"
    grep -E "= -[0-9]+ E" "$logfile" | \
        awk '{print $1, $NF}' | \
        sort | uniq -c | sort -rn
}

# 7. Compare two program runs
compare_traces() {
    local prog1="$1"
    local prog2="$2"
    
    strace -c "$prog1" 2>&1 | grep -A50 "syscall" > trace1.tmp
    strace -c "$prog2" 2>&1 | grep -A50 "syscall" > trace2.tmp
    
    diff -u trace1.tmp trace2.tmp
    rm -f trace1.tmp trace2.tmp
}

# Usage examples
# debug_startup /usr/bin/myapp
# monitor_file_changes "./batch_processor"
# find_config_files /usr/sbin/nginx

Production-Safe Alternatives

# For production environments, consider these alternatives:

# 1. Using perf-trace (lower overhead)
perf trace -e syscalls:sys_enter_open* ./application

# 2. Using BPF tools (minimal overhead)
# Install bcc-tools first
opensnoop-bpfcc          # Trace open() syscalls
execsnoop-bpfcc          # Trace new processes
tcpconnect-bpfcc         # Trace TCP connections
biolatency-bpfcc         # Block I/O latency

# 3. Using SystemTap (scriptable)
stap -e 'probe syscall.open { printf("%s opened %s\n", execname(), filename) }'

# 4. Using ftrace (kernel tracer)
echo 'sys_open' > /sys/kernel/debug/tracing/set_ftrace_filter
echo function > /sys/kernel/debug/tracing/current_tracer
echo 1 > /sys/kernel/debug/tracing/tracing_on
cat /sys/kernel/debug/tracing/trace

# 5. Container-aware tracing with traceloop
# For Kubernetes/container environments
traceloop --pod my-pod-name
traceloop --container my-container-id

# 6. Selective strace with reduced overhead
# Use seccomp-bpf acceleration (Linux 4.8+)
strace --seccomp-bpf -e trace=open,close ./application

# Compare overhead
time ./application                       # Baseline
time strace -c ./application 2>/dev/null # With strace
time perf trace -s ./application         # With perf

Practical Debugging Scenarios

# Scenario 1: "Why won't this program start?"
strace ./failing_program 2>&1 | head -50
# Look for failed opens, missing libraries, permission errors

# Scenario 2: "What files is this program accessing?"
strace -e trace=open,openat,access -f ./program 2>&1 | \
    grep -v ENOENT | cut -d'"' -f2 | sort -u

# Scenario 3: "Why is this program slow?"
strace -c -S time ./slow_program
# Identify syscalls taking the most time

# Scenario 4: "What's this program doing right now?"
sudo strace -p $(pgrep program_name) -s 80
# Shows current activity with 80-char string limit

# Scenario 5: "Debug daemon startup"
strace -o /tmp/daemon.trace -f service mydaemon start
# Review /tmp/daemon.trace for issues

# Scenario 6: "Trace database connections"
strace -e trace=network -f -s 200 ./database_app 2>&1 | \
    grep -E "connect.*3306|connect.*5432"

# Scenario 7: "Find memory leaks indications"
strace -e trace=brk,mmap,munmap ./application 2>&1 | \
    awk '/brk/ {brk++} /mmap/ {mmap++} /munmap/ {munmap++} 
         END {print "brk:", brk, "mmap:", mmap, "munmap:", munmap}'

# Scenario 8: "Debug container issues"
docker run --rm --cap-add SYS_PTRACE \
    --security-opt seccomp=unconfined \
    alpine strace -e trace=all echo "Hello from container"