Using user-space tracepoints with BPF

May 11, 2018

This article was contributed by Matt Fleming

Much has been written on LWN about dynamically instrumenting kernel code. These features are also available to user-space code with a special kind of probe known as a User Statically-Defined Tracing (USDT) probe. These probes provide a low-overhead way of instrumenting user-space code and provide a convenient way to debug applications running in production. In this final article of the BPF and BCC series we'll look at where USDT probes come from and how you can use them to understand the behavior of your own applications.

The origins of USDT probes can be found in Sun's DTrace utility. While DTrace can't claim to have invented static tracepoints (various implementations are described in the "related work" section of the original DTrace paper), it certainly made them much more popular. With the emergence of DTrace, many applications began adding USDT probes to important functions to aid with tracing and diagnosing run-time behavior. Given that, it's perhaps not surprising that these probes are usually enabled (as part of configuring the build) with the --enable-dtrace switch.

For example, MySQL provides a number of probes to aid database administrators and to help them understand who is connecting to the database, which SQL commands they're running, and low-level details on data transferred between clients and MySQL servers. Other popular tools such as Java, PostgreSQL, Node.js, and even the GNU C Library also come with the option of enabling probes. These probes cover a wide range of activities, from memory allocation to garbage-collection events.

There is a variety of tools on Linux to work with USDT probes. SystemTap is a popular choice and an alternative to DTrace since it's only recently that DTrace has been supported on Linux. Support for USDT probes (termed "statically defined traces" inside the kernel) for perf was merged in v4.8-rc1, and even LTTng has been able to emit USDT-compatible probes since 2012. The most recent additions to the developer's USDT tool chest — and arguably the most user-friendly — are the tools and scripts in the BPF Compiler Collection (BCC).

BCC tools for working with USDT probes

BCC has had support for USDT probes since March 2016, when Sasha Goldshtein sent a pull request to the GitHub project adding support to the existing tplist and trace tools.

The tplist tool allows you to see which probes (if any) are available for the kernel, an application, or a library, and it can be used to discover names of probes to enable with trace. Running it on a version of the C library compiled with SDT support shows:

    # tplist.py -l /lib64/libc-2.27.so
    /lib64/libc-2.27.so libc:setjmp
    /lib64/libc-2.27.so libc:longjmp
    /lib64/libc-2.27.so libc:longjmp_target
    /lib64/libc-2.27.so libc:memory_mallopt_arena_max
    /lib64/libc-2.27.so libc:memory_mallopt_arena_test
    /lib64/libc-2.27.so libc:memory_tunable_tcache_max_bytes
    /lib64/libc-2.27.so libc:memory_tunable_tcache_count
    /lib64/libc-2.27.so libc:memory_tunable_tcache_unsorted_limit
    /lib64/libc-2.27.so libc:memory_mallopt_trim_threshold
    /lib64/libc-2.27.so libc:memory_mallopt_top_pad
    [ ... ]

The -l parameter tells tplist that the file argument is a library or executable. Omitting -l instructs tplist to print the list of kernel tracepoints.

A filter can be applied to the list of tracepoints and probes, which helps to shorten the (potentially very long) default output of tplist. For example, using the filter sbrk prints only those probes with the string "sbrk" in their name. And using the -vv parameter prints the arguments available at the probe site. For example:

    ./tplist.py -vv -l /lib64/libc-2.27.so sbrk
    /lib64/libc-2.27.so libc:memory_sbrk_less [sema 0x0]
    location #1 0x816dd
    argument #1 8 unsigned bytes @ ax
    argument #2 8 signed   bytes @ bp
    /lib64/libc-2.27.so libc:memory_sbrk_more [sema 0x0]
    location #1 0x826af
    argument #1 8 unsigned bytes @ ax
    argument #2 8 signed   bytes @ r12

The argument details are necessary to understand which registers contain function parameters. Knowing the location of arguments allows us to print their contents with the BCC trace tool with a command like:

    # trace.py 'u:/lib64/libc-2.27.so:memory_sbrk_more "%u", arg1' -T
    TIME     PID     TID     COMM            FUNC             -
    21:46:51 12781   12781   ls              memory_sbrk_more 114974720

The trace utility takes a number of arguments and accepts a probe specifier, an example of which was used above. Probe specifiers allow users to describe exactly what they want to be printed when the probe fires. A list of examples (and a more thorough explanation of the format) is provided in the trace_example.txt file in the BCC repository. The output above shows one hit when a process running ls hit the memory_sbrk_more probe.

Additional tools in BCC enable USDT probes for popular high-level languages like Java, Python, Ruby, and PHP — lib/calls.py summarizes method calls, lib/uflow.py traces function entry and exit and prints a visual flow graph, lib/ugc.py traces garbage-collection events, lib/uobjnew.py prints summary statistics for new object allocations, and lib/uthreads.py prints details on thread creation. lib/ustat.py is a monitoring tool that pulls all of these together and displays their events with a top-like interface:

    # ustat.py
    Tracing... Output every 10 secs. Hit Ctrl-C to end
    12:17:17 loadavg: 0.33 0.08 0.02 5/211 26284

    PID    CMDLINE      METHOD/s   GC/s   OBJNEW/s   CLOAD/s  EXC/s  THR/s
    3018   node/node    0          3      0          0        0      0

The output above shows that pid 3018, a node process, generated three garbage collection events within a ten-second period. Like most of these scripts, ustat.py runs until interrupted by the user.

In addition to the language-specific tools, BCC also includes specialized scripts for specific applications. For example, bashreadline.py prints commands from all running bash shells:

    # bashreadline.py
    TIME      PID    COMMAND
    05:28:25  21176  ls -l
    05:28:28  21176  date
    05:28:35  21176  echo hello world
    05:28:43  21176  foo this command failed
    05:28:45  21176  df -h
    05:29:04  3059   echo another shell
    05:29:13  21176  echo first shell again

dbslower.py prints database (MySQL or PostgreSQL) operations with a latency that exceeds a specified threshold:

    # dbslower.py mysql -m 1000
    Tracing database queries for pids 25776 slower than 1000 ms...
    TIME(s)        PID          MS QUERY
    1.421264       25776  2002.183 call getproduct(97)
    3.572617       25776  2001.381 call getproduct(97)
    5.661411       25776  2001.867 call getproduct(97)
    7.748296       25776  2001.329 call getproduct(97)

Adding USDT probes to your application

SystemTap provides an API for adding static probes to an application. To create them, you'll need the systemtap-sdt-devel package, which provides the sys/sdt.h header file. The documentation for the SystemTap project provides an example of adding a probe, but we'll add one to a simple C program and use the BCC tools to list and enable the probe:

    #include <sys/sdt.h>
    #include <sys/time.h>
    #include <unistd.h>

    int main(int argc, char **argv)
    {
        struct timeval tv;

        while(1) {
            gettimeofday(&tv, NULL);
            DTRACE_PROBE1(test-app, test-probe, tv.tv_sec);
            sleep(1);
        }
        return 0;
    }

This simple program runs until interrupted. It fires a probe and then calls sleep() to wait for one second until the loop starts again. The DTRACE_PROBE() macro is used to create probe points at desired locations, in this case, immediately before sleeping. This macro takes a provider name, probe name, and arguments as parameters. There's a separate DTRACE_PROBEn() macro for each argument count. For example, if your probe has three arguments, you need to use DTRACE_PROBE3().

The DTRACE_PROBEn() macros are implemented by placing a no-op assembly instruction at the probe site and writing an ELF note in the application image that includes things like the probe address and name. Since the runtime overhead of an inactive probe is the cost of executing a no-op instruction and, given that the ELF note isn't loaded into memory, the impact on performance is minimal.

The provider name allows you to create a namespace for your probe. The most common value (and the one suggested in the SystemTap example) is to use the name of your application or library. In the example above I've used test-app, and the probe name was imaginatively titled, test-probe. The one and only argument to the probe is the time in seconds.

Using tplist, we can see the probe and its argument:

    # tplist.py -vv -l ./simple-c
    simple-c test-app:test-probe [sema 0x0]
    location #1 0x40057b
    argument #1 8 signed   bytes @ ax

We can then construct a probe specifier to print the first argument with trace, assuming the simple-c program above is running:

    # trace.py 'u:./simple-c:test-probe "%u", arg1' -T -p $(pidof simple-c)
    TIME     PID     TID     COMM            FUNC             -
    21:55:44 13450   13450   simple-c        test-probe       1524430544
    21:55:45 13450   13450   simple-c        test-probe       1524430545
    21:55:46 13450   13450   simple-c        test-probe       1524430546

The final column in the output shows the current time, in seconds, when the probe fires. This is information passed as the first argument in the probe declaration. You do need to be aware of the data type of the probe arguments, since it is reflected in the printf()-style format string used in the probe specifier.

Conclusion

USDT probes bring the flexibility of kernel tracepoints to user-space applications. Thanks to the rise of DTrace, many popular applications and high-level programming languages grew support for USDT probes. BCC provides simple tools for working with probes that allow developers to list available probes in libraries and applications, and trace them to print diagnostic data. Adding probes to your own code is possible with SystemTap's API and the collection of DTRACE_PROBE() macros. USDT probes can help you troubleshoot your applications in production with minimal run-time overhead.

Index entries for this article
Kernel	BPF/Tracing
Kernel	Tracing/with BPF
GuestArticles	Fleming, Matt

Using user-space tracepoints with BPF

Posted May 11, 2018 21:55 UTC (Fri) by unixbhaskar (guest, #44758) [Link]

Cool! thanks for writing this Matt.

Using user-space tracepoints with BPF

Posted May 14, 2018 9:58 UTC (Mon) by Sesse (subscriber, #53779) [Link]

MySQL removed their probes in 8.0 (essentially, nobody used them, and they were not very useful in general).

Using user-space tracepoints with BPF

Posted May 16, 2018 2:20 UTC (Wed) by _joel_ (subscriber, #112763) [Link] (1 responses)

Nice write up Matt!

Btw, one of the drawbacks brought up about USDT is that fact that there is a context-switch to kernel mode when the tracepoint is hit which can effect performance for performance sensitive applications. At my work place, there happens to be more interest in letting userspace tracepoints be traced within userspace itself. The other drawback that's brought up is the fact that USDT is tightly coupled with ELF thus limiting its ability to trace dynamic languages. None of these drawbacks should effect users of USDT who do find it useful and don't fall into the above categories.

Using user-space tracepoints with BPF

Posted May 19, 2018 16:37 UTC (Sat) by cuviper (subscriber, #56273) [Link]

SystemTap's "--dyninst" mode can instrument SDT within the process, incurring no context switch at all when probes are hit.

Using user-space tracepoints with BPF

Posted May 16, 2018 12:30 UTC (Wed) by zse (guest, #120483) [Link] (1 responses)

Thanks for the article.

Though at least for tracing (and dumping the trace into a file) LTTng should be a lot more efficient with its high performance buffering of the tracing events as far as I understand. Is this correct? Are there advantages of using this over LTTng?

Using user-space tracepoints with BPF

Posted May 18, 2018 18:00 UTC (Fri) by compudj (subscriber, #43335) [Link]

Hooking eBPF on USDT probes will allow aggregating information directly in the kernel without having to go through buffering. This is useful if you know in advance how you want to aggregate your data. A good scenario where this is very useful is investigation of an issue which can be easily reproduced. You can therefore be more specific about what instrumentation you enable, which may counterbalance the trap overhead each time the instrumentation is hit. Another advantage of this approach is that a script can react quickly when a problem is detected.

LTTng-UST relies on buffering, which is done purely in user-space, so the runtime overhead when instrumentation is hit is very small. LTTng-UST writes trace data into per-cpu buffers. This provides a detailed log to look into for further analysis, and is very useful in cases where the problem you are tackling is hard to reproduce. In those situations, you really want to capture as much data as possible surrounding the problem, and then look at the trace with various analyzes and visualize them in a graphical tool.

We can also envision use-cases where both types of tools are used together: eBPF scripts can be used as triggers to identify outlier situations (problems), which in turn gathers a snapshot of LTTng-UST buffers. Those contain the detailed trace data about the time frame just before each issue is encountered. With this kind of integration, it's then possible to extract detailed relevant trace data without wasting disk or network I/O needlessly.

I personally see both tools as complementary.

Thanks,

Mathieu (LTTng-UST maintainer)

Using user-space tracepoints with BPF

Posted May 28, 2018 10:34 UTC (Mon) by myllynen (guest, #55412) [Link]

FWIW, in PCP (Performance Co-Pilot, http://pcp.io/) we're adding support for providing metrics of any kernel tracepoint / uprobe / USDT/dtrace probe hit counts based on a simple text file configuration. This will be available in PCP 4.1.0 to be released in mid June. Here's a quick screenshot of the configuration and the results (on command line, the results could be forwarded into or displayed with various external tools like Elasticsearch, Grafana, Vector, Zabbix, etc, see e.g. the references in the pmrep(1) manual page):

# cat bcc-tracepoint.conf
#
# BCC PMDA tracepoint hits module config file - see pmdabcc(1) and perf-list(1)
#

random:urandom_read
tlb:tlb_flush
# cat  bcc-uprobe.conf
#
# BCC PMDA uprobe hits module config file - see pmdabcc(1) and perf-probe(1)
#

c:malloc
c:strlen
# cat  bcc-usdt.conf
#
# BCC PMDA USDT hits module config file - see pmdabcc(1) and tplist(8)
#

/etc/alternatives/jre/lib/server/libjvm.so:gc__begin
/etc/alternatives/jre/lib/server/libjvm.so:thread__start
# pminfo -f bcc          

bcc.uprobe.hits
    inst [0 or "c::strlen"] value 309
    inst [1 or "c::malloc"] value 48673

bcc.usdt.hits
    inst [1 or "/etc/alternatives/jre/lib/server/libjvm.so::gc__begin"] value 4

bcc.tracepoint.hits
    inst [0 or "tlb:tlb_flush"] value 46613
    inst [1 or "random:urandom_read"] value 355