Using user-space tracepoints with BPF
Much has been written on LWN about dynamically instrumenting kernel code. These features are also available to user-space code with a special kind of probe known as a User Statically-Defined Tracing (USDT) probe. These probes provide a low-overhead way of instrumenting user-space code and provide a convenient way to debug applications running in production. In this final article of the BPF and BCC series we'll look at where USDT probes come from and how you can use them to understand the behavior of your own applications.
The origins of USDT probes can be found in Sun's DTrace utility. While DTrace can't claim to have invented static tracepoints (various implementations are described in the "related work" section of the original DTrace paper), it certainly made them much more popular. With the emergence of DTrace, many applications began adding USDT probes to important functions to aid with tracing and diagnosing run-time behavior. Given that, it's perhaps not surprising that these probes are usually enabled (as part of configuring the build) with the --enable-dtrace switch.
For example, MySQL provides a number of probes to aid database administrators and to help them understand who is connecting to the database, which SQL commands they're running, and low-level details on data transferred between clients and MySQL servers. Other popular tools such as Java, PostgreSQL, Node.js, and even the GNU C Library also come with the option of enabling probes. These probes cover a wide range of activities, from memory allocation to garbage-collection events.
There is a variety of tools on Linux to work with USDT probes. SystemTap is a popular choice and an alternative to DTrace since it's only recently that DTrace has been supported on Linux. Support for USDT probes (termed "statically defined traces" inside the kernel) for perf was merged in v4.8-rc1, and even LTTng has been able to emit USDT-compatible probes since 2012. The most recent additions to the developer's USDT tool chest — and arguably the most user-friendly — are the tools and scripts in the BPF Compiler Collection (BCC).
BCC tools for working with USDT probes
BCC has had support for USDT probes since March 2016, when Sasha Goldshtein sent a pull request to the GitHub project adding support to the existing tplist and trace tools.
The tplist tool allows you to see which probes (if any) are available for the kernel, an application, or a library, and it can be used to discover names of probes to enable with trace. Running it on a version of the C library compiled with SDT support shows:
# tplist.py -l /lib64/libc-2.27.so /lib64/libc-2.27.so libc:setjmp /lib64/libc-2.27.so libc:longjmp /lib64/libc-2.27.so libc:longjmp_target /lib64/libc-2.27.so libc:memory_mallopt_arena_max /lib64/libc-2.27.so libc:memory_mallopt_arena_test /lib64/libc-2.27.so libc:memory_tunable_tcache_max_bytes /lib64/libc-2.27.so libc:memory_tunable_tcache_count /lib64/libc-2.27.so libc:memory_tunable_tcache_unsorted_limit /lib64/libc-2.27.so libc:memory_mallopt_trim_threshold /lib64/libc-2.27.so libc:memory_mallopt_top_pad [ ... ]
The -l parameter tells tplist that the file argument is a library or executable. Omitting -l instructs tplist to print the list of kernel tracepoints.
A filter can be applied to the list of tracepoints and probes, which helps to shorten the (potentially very long) default output of tplist. For example, using the filter sbrk prints only those probes with the string "sbrk" in their name. And using the -vv parameter prints the arguments available at the probe site. For example:
./tplist.py -vv -l /lib64/libc-2.27.so sbrk /lib64/libc-2.27.so libc:memory_sbrk_less [sema 0x0] location #1 0x816dd argument #1 8 unsigned bytes @ ax argument #2 8 signed bytes @ bp /lib64/libc-2.27.so libc:memory_sbrk_more [sema 0x0] location #1 0x826af argument #1 8 unsigned bytes @ ax argument #2 8 signed bytes @ r12
The argument details are necessary to understand which registers contain function parameters. Knowing the location of arguments allows us to print their contents with the BCC trace tool with a command like:
# trace.py 'u:/lib64/libc-2.27.so:memory_sbrk_more "%u", arg1' -T TIME PID TID COMM FUNC - 21:46:51 12781 12781 ls memory_sbrk_more 114974720
The trace utility takes a number of arguments and accepts a probe specifier, an example of which was used above. Probe specifiers allow users to describe exactly what they want to be printed when the probe fires. A list of examples (and a more thorough explanation of the format) is provided in the trace_example.txt file in the BCC repository. The output above shows one hit when a process running ls hit the memory_sbrk_more probe.
Additional tools in BCC enable USDT probes for popular high-level languages like Java, Python, Ruby, and PHP — lib/calls.py summarizes method calls, lib/uflow.py traces function entry and exit and prints a visual flow graph, lib/ugc.py traces garbage-collection events, lib/uobjnew.py prints summary statistics for new object allocations, and lib/uthreads.py prints details on thread creation. lib/ustat.py is a monitoring tool that pulls all of these together and displays their events with a top-like interface:
# ustat.py Tracing... Output every 10 secs. Hit Ctrl-C to end 12:17:17 loadavg: 0.33 0.08 0.02 5/211 26284 PID CMDLINE METHOD/s GC/s OBJNEW/s CLOAD/s EXC/s THR/s 3018 node/node 0 3 0 0 0 0
The output above shows that pid 3018, a node process, generated three garbage collection events within a ten-second period. Like most of these scripts, ustat.py runs until interrupted by the user.
In addition to the language-specific tools, BCC also includes specialized scripts for specific applications. For example, bashreadline.py prints commands from all running bash shells:
# bashreadline.py TIME PID COMMAND 05:28:25 21176 ls -l 05:28:28 21176 date 05:28:35 21176 echo hello world 05:28:43 21176 foo this command failed 05:28:45 21176 df -h 05:29:04 3059 echo another shell 05:29:13 21176 echo first shell again
dbslower.py prints database (MySQL or PostgreSQL) operations with a latency that exceeds a specified threshold:
# dbslower.py mysql -m 1000 Tracing database queries for pids 25776 slower than 1000 ms... TIME(s) PID MS QUERY 1.421264 25776 2002.183 call getproduct(97) 3.572617 25776 2001.381 call getproduct(97) 5.661411 25776 2001.867 call getproduct(97) 7.748296 25776 2001.329 call getproduct(97)
Adding USDT probes to your application
SystemTap provides an API for adding static probes to an application. To create them, you'll need the systemtap-sdt-devel package, which provides the sys/sdt.h header file. The documentation for the SystemTap project provides an example of adding a probe, but we'll add one to a simple C program and use the BCC tools to list and enable the probe:
#include <sys/sdt.h> #include <sys/time.h> #include <unistd.h> int main(int argc, char **argv) { struct timeval tv; while(1) { gettimeofday(&tv, NULL); DTRACE_PROBE1(test-app, test-probe, tv.tv_sec); sleep(1); } return 0; }
This simple program runs until interrupted. It fires a probe and then calls sleep() to wait for one second until the loop starts again. The DTRACE_PROBE() macro is used to create probe points at desired locations, in this case, immediately before sleeping. This macro takes a provider name, probe name, and arguments as parameters. There's a separate DTRACE_PROBEn() macro for each argument count. For example, if your probe has three arguments, you need to use DTRACE_PROBE3().
The DTRACE_PROBEn() macros are implemented by placing a no-op assembly instruction at the probe site and writing an ELF note in the application image that includes things like the probe address and name. Since the runtime overhead of an inactive probe is the cost of executing a no-op instruction and, given that the ELF note isn't loaded into memory, the impact on performance is minimal.
The provider name allows you to create a namespace for your probe. The most common value (and the one suggested in the SystemTap example) is to use the name of your application or library. In the example above I've used test-app, and the probe name was imaginatively titled, test-probe. The one and only argument to the probe is the time in seconds.
Using tplist, we can see the probe and its argument:
# tplist.py -vv -l ./simple-c simple-c test-app:test-probe [sema 0x0] location #1 0x40057b argument #1 8 signed bytes @ ax
We can then construct a probe specifier to print the first argument with trace, assuming the simple-c program above is running:
# trace.py 'u:./simple-c:test-probe "%u", arg1' -T -p $(pidof simple-c) TIME PID TID COMM FUNC - 21:55:44 13450 13450 simple-c test-probe 1524430544 21:55:45 13450 13450 simple-c test-probe 1524430545 21:55:46 13450 13450 simple-c test-probe 1524430546
The final column in the output shows the current time, in seconds, when the probe fires. This is information passed as the first argument in the probe declaration. You do need to be aware of the data type of the probe arguments, since it is reflected in the printf()-style format string used in the probe specifier.
Conclusion
USDT probes bring the flexibility of kernel tracepoints to user-space
applications. Thanks to the rise of DTrace, many popular applications
and high-level programming languages grew support for USDT probes. BCC
provides simple tools for working with probes that allow developers to
list available probes in libraries and applications, and trace them to
print diagnostic data. Adding probes to your own code is possible with
SystemTap's API and the collection of DTRACE_PROBE() macros. USDT
probes can help you troubleshoot your applications in production with
minimal run-time overhead.
Index entries for this article | |
---|---|
Kernel | BPF/Tracing |
Kernel | Tracing/with BPF |
GuestArticles | Fleming, Matt |
Posted May 11, 2018 21:55 UTC (Fri)
by unixbhaskar (guest, #44758)
[Link]
Posted May 14, 2018 9:58 UTC (Mon)
by Sesse (subscriber, #53779)
[Link]
Posted May 16, 2018 2:20 UTC (Wed)
by _joel_ (subscriber, #112763)
[Link] (1 responses)
Btw, one of the drawbacks brought up about USDT is that fact that there is a context-switch to kernel mode when the tracepoint is hit which can effect performance for performance sensitive applications. At my work place, there happens to be more interest in letting userspace tracepoints be traced within userspace itself. The other drawback that's brought up is the fact that USDT is tightly coupled with ELF thus limiting its ability to trace dynamic languages. None of these drawbacks should effect users of USDT who do find it useful and don't fall into the above categories.
Posted May 19, 2018 16:37 UTC (Sat)
by cuviper (subscriber, #56273)
[Link]
Posted May 16, 2018 12:30 UTC (Wed)
by zse (guest, #120483)
[Link] (1 responses)
Though at least for tracing (and dumping the trace into a file) LTTng should be a lot more efficient with its high performance buffering of the tracing events as far as I understand. Is this correct? Are there advantages of using this over LTTng?
Posted May 18, 2018 18:00 UTC (Fri)
by compudj (subscriber, #43335)
[Link]
LTTng-UST relies on buffering, which is done purely in user-space, so the runtime overhead when instrumentation is hit is very small. LTTng-UST writes trace data into per-cpu buffers. This provides a detailed log to look into for further analysis, and is very useful in cases where the problem you are tackling is hard to reproduce. In those situations, you really want to capture as much data as possible surrounding the problem, and then look at the trace with various analyzes and visualize them in a graphical tool.
We can also envision use-cases where both types of tools are used together: eBPF scripts can be used as triggers to identify outlier situations (problems), which in turn gathers a snapshot of LTTng-UST buffers. Those contain the detailed trace data about the time frame just before each issue is encountered. With this kind of integration, it's then possible to extract detailed relevant trace data without wasting disk or network I/O needlessly.
I personally see both tools as complementary.
Thanks,
Mathieu (LTTng-UST maintainer)
Posted May 28, 2018 10:34 UTC (Mon)
by myllynen (guest, #55412)
[Link]
Using user-space tracepoints with BPF
Using user-space tracepoints with BPF
Using user-space tracepoints with BPF
Using user-space tracepoints with BPF
Using user-space tracepoints with BPF
Using user-space tracepoints with BPF
FWIW, in PCP (Performance Co-Pilot, http://pcp.io/) we're adding support for providing metrics of any kernel tracepoint / uprobe / USDT/dtrace probe hit counts based on a simple text file configuration. This will be available in PCP 4.1.0 to be released in mid June.
Here's a quick screenshot of the configuration and the results (on command line, the results could be forwarded into or displayed with various external tools like Elasticsearch, Grafana, Vector, Zabbix, etc, see e.g. the references in the pmrep(1) manual page):
Using user-space tracepoints with BPF
# cat bcc-tracepoint.conf
#
# BCC PMDA tracepoint hits module config file - see pmdabcc(1) and perf-list(1)
#
random:urandom_read
tlb:tlb_flush
# cat bcc-uprobe.conf
#
# BCC PMDA uprobe hits module config file - see pmdabcc(1) and perf-probe(1)
#
c:malloc
c:strlen
# cat bcc-usdt.conf
#
# BCC PMDA USDT hits module config file - see pmdabcc(1) and tplist(8)
#
/etc/alternatives/jre/lib/server/libjvm.so:gc__begin
/etc/alternatives/jre/lib/server/libjvm.so:thread__start
# pminfo -f bcc
bcc.uprobe.hits
inst [0 or "c::strlen"] value 309
inst [1 or "c::malloc"] value 48673
bcc.usdt.hits
inst [1 or "/etc/alternatives/jre/lib/server/libjvm.so::gc__begin"] value 4
bcc.tracepoint.hits
inst [0 or "tlb:tlb_flush"] value 46613
inst [1 or "random:urandom_read"] value 355