VOOZH about

URL: https://dzone.com/articles/advanced-linux-troubleshooting-techniques-for-sres

โ‡ฑ Advanced Linux Troubleshooting Techniques for SREs


Related

  1. DZone
  2. Software Design and Architecture
  3. Performance
  4. Advanced Linux Troubleshooting Techniques for Site Reliability Engineers

Advanced Linux Troubleshooting Techniques for Site Reliability Engineers

Explore advanced troubleshooting techniques beyond basic tools and commands, focusing on kernel debugging, system call tracing, performance analysis, and eBPF.

Likes
Comment
Save
4.8K Views

Join the DZone community and get the full member experience.

Join For Free

In Site Reliability Engineering (SRE), the ability to quickly and effectively troubleshoot issues within Linux systems is crucial. This article explores advanced troubleshooting techniques beyond basic tools and commands, focusing on kernel debugging, system call tracing, performance analysis, and using the Extended Berkeley Packet Filter (eBPF) for real-time data gathering.

Kernel Debugging

Kernel debugging is a fundamental skill for any SRE working with Linux. It allows for deep inspection of the kernel's behavior, which is critical when diagnosing system crashes or performance bottlenecks.

Tools and Techniques

GDB (GNU Debugger)

GDB can debug kernel modules and the Linux kernel. It allows setting breakpoints, stepping through the code, and inspecting variables. 

KGDB

The kernel debugger allows the kernel to be debugged using GDB over a serial connection or a network. Using kgdb, kdb, and the kernel debugger internals provides a detailed explanation of how kgdb can be enabled and configured.

Dynamic Debugging (dyndbg)

Linux's dynamic debug feature enables real-time debugging messages that help trace kernel operations without rebooting the system. The official Dynamic Debug page describes how to use the dynamic debug (dyndbg) feature.

Tracing System Calls With strace

strace is a powerful diagnostic tool that monitors the system calls used by a program and the signals received by a program. It is instrumental in understanding the interaction between applications and the Linux kernel.

Usage

To trace system calls, strace can be attached to a running process or start a new process under strace. It logs all system calls, which can be analyzed to find faults in system operations.

Example:

Shell
root@ubuntu:~# strace -p 2009
strace: Process 2009 attached
munmap(0xe02057400000, 134221824) = 0
mmap(NULL, 134221824, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xe02057400000
munmap(0xe02057400000, 134221824) = 0
mmap(NULL, 134221824, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xe02057400000
munmap(0xe02057400000, 134221824) = 0
mmap(NULL, 134221824, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xe02057400000
munmap(0xe02057400000, 134221824) = 0
mmap(NULL, 134221824, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xe02057400000
munmap(0xe02057400000, 134221824) = 0
mmap(NULL, 134221824, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xe02057400000
munmap(0xe02057400000, 134221824) = 0
mmap(NULL, 134221824, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xe02057400000
munmap(0xe02057400000, 134221824) = 0
mmap(NULL, 134221824, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xe02057400000
munmap(0xe02057400000, 134221824)       = 0


In the above example, the -p flag is the process, and 2009 is the pid. Similarly, you can use the -o flag to log the output to a file instead of dumping everything on the screen. You can review the following article to understand system calls on Linux with strace.

Performance Analysis With perf

perf is a versatile tool used for system performance analysis. It provides a rich set of commands to collect, analyze, and report on hardware and software events.

Key Features

  1. perf record: Gathers performance data into a file, perf.data, which can be further analyzed using perf report to identify hotspots
  2. perf report: This report analyzes the data collected by perf record and displays where most of the time was spent, helping identify performance bottlenecks.
  3. Event-based sampling: perf can record data based on specific events, such as cache misses or CPU cycles, which helps pinpoint performance issues more accurately.

Example:

Shell
root@ubuntu:/tmp# perf record
^C[ perf record: Woken up 17 times to write data ]
[ perf record: Captured and wrote 4.619 MB perf.data (83123 samples) ]

root@ubuntu:/tmp#

root@ubuntu:/tmp# perf report
Samples: 83K of event 'cpu-clock:ppp', Event count (approx.): 20780750000
Overhead Command Shared Object Symbol
 17.74% swapper [kernel.kallsyms] [k] cpuidle_idle_call
 8.36% stress [kernel.kallsyms] [k] __do_softirq
 7.17% stress [kernel.kallsyms] [k] finish_task_switch.isra.0
 6.90% stress [kernel.kallsyms] [k] el0_da
 5.73% stress libc.so.6 [.] random_r
 3.92% stress [kernel.kallsyms] [k] flush_end_io
 3.87% stress libc.so.6 [.] random
 3.71% stress libc.so.6 [.] 0x00000000001405bc
 2.71% kworker/0:2H-kb [kernel.kallsyms] [k] ata_scsi_queuecmd
 2.58% stress libm.so.6 [.] __sqrt_finite
 2.45% stress stress [.] 0x0000000000000f14
 1.62% stress stress [.] 0x000000000000168c
 1.46% stress [kernel.kallsyms] [k] __pi_clear_page
 1.37% stress libc.so.6 [.] rand
 1.34% stress libc.so.6 [.] 0x00000000001405c4
 1.22% stress stress [.] 0x0000000000000e94
 1.20% stress [kernel.kallsyms] [k] folio_batch_move_lru
 1.20% stress stress [.] 0x0000000000000f10
 1.16% stress libc.so.6 [.] 0x00000000001408d4
 0.84% stress [kernel.kallsyms] [k] handle_mm_fault
 0.77% stress [kernel.kallsyms] [k] release_pages
 0.65% stress [kernel.kallsyms] [k] super_lock
 0.62% stress [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore
 0.61% stress [kernel.kallsyms] [k] blk_done_softirq
 0.61% stress [kernel.kallsyms] [k] _raw_spin_lock
 0.60% stress [kernel.kallsyms] [k] folio_add_lru
 0.58% kworker/0:2H-kb [kernel.kallsyms] [k] finish_task_switch.isra.0
 0.55% stress [kernel.kallsyms] [k] __rcu_read_lock
 0.52% stress [kernel.kallsyms] [k] percpu_ref_put_many.constprop.0
 0.46% stress stress [.] 0x00000000000016e0
 0.45% stress [kernel.kallsyms] [k] __rcu_read_unlock
 0.45% stress [kernel.kallsyms] [k] dynamic_might_resched
 0.42% stress [kernel.kallsyms] [k] _raw_spin_unlock
 0.41% stress [kernel.kallsyms] [k] __mod_memcg_lruvec_state
 0.40% stress [kernel.kallsyms] [k] mas_walk
 0.39% stress [kernel.kallsyms] [k] arch_counter_get_cntvct
 0.39% stress [kernel.kallsyms] [k] rwsem_read_trylock
 0.39% stress [kernel.kallsyms] [k] up_read
 0.38% stress [kernel.kallsyms] [k] down_read
 0.37% stress [kernel.kallsyms] [k] get_mem_cgroup_from_mm
 0.36% stress [kernel.kallsyms] [k] free_unref_page_commit
 0.34% stress [kernel.kallsyms] [k] memset
 0.32% stress libc.so.6 [.] 0x00000000001408c8
 0.30% stress [kernel.kallsyms] [k] sync_inodes_sb
 0.29% stress [kernel.kallsyms] [k] iterate_supers
 0.29% stress [kernel.kallsyms] [k] percpu_counter_add_batch


Real-Time Data Gathering With eBPF

eBPF allows for creating small programs that run on the Linux kernel in a sandboxed environment. These programs can track system calls and network messages, providing real-time insights into system behavior.

Applications

  • Network monitoring: eBPF can monitor network traffic in real-time, providing insights into packet flow and protocol usage without significant performance overhead.
  • Security: eBPF helps implement security policies by monitoring system calls and network activity to detect and prevent malicious activities.
  • Performance monitoring: It can track application performance by monitoring function calls and system resource usage, helping SREs optimize performance.

Conclusion

Advanced troubleshooting in Linux involves a combination of tools and techniques that provide deep insights into system operations. Tools like GDB, strace, perf, and eBPF are essential for any SRE looking to enhance their troubleshooting capabilities. By leveraging these tools, SREs can ensure the high reliability and performance of Linux systems in production environments.

Linux kernel Site reliability engineering Linux (operating system) Performance Debug (command)

Opinions expressed by DZone contributors are their own.

Related

  • Building a 300 Channel Video Encoding Server
  • Recent Linux Kernel Features Relevant to System Design
  • Python Async/Sync: Advanced Blocking Detection and Best Practices (Part 2)
  • Top Book Picks for Site Reliability Engineers

Partner Resources

ร—

Comments

The likes didn't load as expected. Please refresh the page and try again.

Let's be friends: