# BPFtrace and eBPF Tools Guide ## Table of Contents 1. [Introduction](#introduction) 2. [System Layer Overview](#system-layer-overview) 3. [Tool Categories](#tool-categories) 4. [Detailed Tool Analysis](#detailed-tool-analysis) 5. [Command Reference](#command-reference) ## Introduction This guide covers the comprehensive set of bpftrace and eBPF tools available for Linux system analysis and performance monitoring across different layers of the system stack. ## System Layer Overview The tools are organized across these main layers: 1. Applications & Runtimes 2. System Libraries 3. System Call Interface 4. Kernel Subsystems: - VFS (Virtual File System) - Network Stack (Sockets, TCP/UDP, IP) - Scheduler - Virtual Memory - Device Drivers ## Tool Categories ### Application Level Tools | Tool | Purpose | Layer | |------|---------|-------| | opensnoop | Trace file opens | Application | | statsnoop | Trace stat() syscalls | Application | | syncsnoop | Trace sync operations | Application | | bashreadline | Trace bash commands | Application | | gethostlatency | DNS latency analysis | System Libraries | ### System Call Interface Tools | Tool | Purpose | Layer | |------|---------|-------| | syscount | Count syscalls | System Call | | execsnoop | Trace new processes | System Call | | killsnoop | Trace kill() syscalls | System Call | | pidpersec | New processes per second | System Call | ### File System Tools | Tool | Purpose | Layer | |------|---------|-------| | vfscount | VFS operation counts | VFS | | vfsstat | VFS operation stats | VFS | | writeback | Trace file writeback | File Systems | | xfsdist | XFS operation latency | File Systems | | mdflush | Trace md RAID flush events | Volume Manager | ### Block Device Tools | Tool | Purpose | Layer | |------|---------|-------| | biosnoop | Trace block I/O | Block Device | | biolatency | Block I/O latency | Block Device | | bitesize | Block I/O size analysis | Block Device | ### Network Tools | Tool | Purpose | Layer | |------|---------|-------| | tcpconnect | Trace TCP connections | TCP/UDP | | tcpaccept | Trace TCP accepts | TCP/UDP | | tcpretrans | Trace TCP retransmits | TCP/UDP | | tcpdrop | Trace TCP drops | TCP/UDP | ### CPU/Scheduler Tools | Tool | Purpose | Layer | |------|---------|-------| | cpuwalk | CPU instruction analysis | Scheduler | | runqlat | Run queue latency | Scheduler | | runqlen | Run queue length | Scheduler | | offcputime | Off-CPU analysis | Scheduler | ### Memory Management Tools | Tool | Purpose | Layer | |------|---------|-------| | oomkill | Trace OOM killer | Virtual Memory | | capable | Trace capability checks | System | ## Detailed Tool Analysis ### Application Monitoring Tools #### opensnoop ```bash # Trace all file opens opensnoop # Trace specific process opensnoop -p 1234 # Include stack traces opensnoop --stack # Filter by file name opensnoop -n "*.txt" ``` #### statsnoop ```bash # Trace all stat() calls statsnoop # Show failed stats only statsnoop -x # Filter by process name statsnoop -n "nginx" # Include extended details statsnoop -v ``` #### bashreadline ```bash # Trace all bash commands bashreadline # Include timestamps bashreadline -t # Trace specific shell PID bashreadline -p 1234 ``` ### Network Analysis Tools #### tcpconnect ```bash # Trace all TCP connections tcpconnect # Show port numbers tcpconnect -p # Include timestamps tcpconnect -t # Filter by port tcpconnect -P 80 ``` #### tcpretrans ```bash # Trace TCP retransmissions tcpretrans # Include TCP state tcpretrans -s # Show stack traces tcpretrans --stack # Filter by IP tcpretrans -i 192.168.1.1 ``` ### File System Analysis #### vfscount ```bash # Count VFS operations vfscount # Group by operation type vfscount -g # Include stack traces vfscount --stack ``` #### writeback ```bash # Trace file writeback writeback # Show per-device stats writeback -d # Include process info writeback -p ``` ### Block Device Analysis #### biosnoop ```bash # Trace block I/O biosnoop # Show queued time biosnoop -q # Filter by device biosnoop -d sda # Include process info biosnoop -p ``` #### biolatency ```bash # Show block I/O latency biolatency # Use microsecond units biolatency -u # Create histogram biolatency -h # Filter by device biolatency -d sda ``` ### CPU and Scheduler Analysis #### runqlat ```bash # Show run queue latency runqlat # Use microsecond units runqlat -u # Filter by CPU runqlat -c 0 # Create histogram runqlat --hist ``` #### offcputime ```bash # Trace off-CPU time offcputime # Filter by process offcputime -p 1234 # Set duration offcputime -d 10 # Include user stacks offcputime -u ``` ## Command Reference ### General Options Most bpftrace tools support these common options: ```bash -h # Show help message -v # Verbose output -d # Debug output -p PID # Filter by process ID -t # Include timestamps --stack # Show stack traces ``` ### Advanced Usage #### Custom Scripts ```bash # Create custom bpftrace script cat > custom.bt << 'EOF' #!/usr/bin/bpftrace tracepoint:syscalls:sys_enter_open { printf("%s opened %s\n", comm, str(args->filename)); } EOF # Run custom script bpftrace custom.bt ``` #### Performance Monitoring ```bash # Monitor system calls syscount -i 1 # Monitor process creation pidpersec -i 5 # Track OOM kills oomkill -t ``` ### Best Practices 1. **Resource Usage** - Be cautious with stack traces in production - Use sampling for high-frequency events - Monitor overhead with top/htop 2. **Filtering** - Use specific filters to reduce overhead - Combine multiple conditions when possible - Consider using time-based filters 3. **Output Control** - Use appropriate output formats - Consider logging to files for analysis - Use aggregation for high-volume data 4. **Troubleshooting** - Start with broad tools - Narrow down to specific events - Use multiple tools for correlation ## Performance Considerations ### Overhead Management ```bash # Reduce overhead with sampling biolatency --sample-rate 10 # Use efficient filters opensnoop -n '*.log' # Limit stack traces tcpconnect --stack --stack-storage-size 1024 ``` ### Production Usage 1. Test tools in development first 2. Use appropriate filtering 3. Monitor system impact 4. Set appropriate buffer sizes 5. Use time-based execution limits ## Common Use Cases ### Performance Analysis ```bash # Analyze disk I/O biolatency -h biosnoop -p # Network performance tcpretrans -s tcpconnect -t # CPU scheduling runqlat --hist offcputime -p 1234 ``` ### Troubleshooting ```bash # File system issues opensnoop -t vfscount # Network problems tcpdrop tcpretrans # Memory issues oomkill -t ``` ### Security Monitoring ```bash # Track capability checks capable -v # Monitor process creation execsnoop -t # Track file access opensnoop -t ```