An introduction to KProbes

April 18, 2005

This article was contributed by Sudhanshu Goswami

Introduction

KProbes is a debugging mechanism for the Linux kernel which can also be used for monitoring events inside a production system. You can use it to weed out performance bottlenecks, log specific events, trace problems etc. KProbes was developed by IBM as an underlying mechanism for another higher level tracing tool called DProbes. DProbes adds a number of features, including its own scripting language for the writing of probe handlers. However, only KProbes has been merged into the standard kernel.

In this article I will describe the implementation of KProbes as present in the 2.6.11.7 kernel. KProbes heavily depends on processor architecture specific features and uses slightly different mechanisms depending on the architecture on which it's being executed. The following discussion pertains only to the x86 architecture. This article assumes a certain familiarity with the x86 architecture regarding interrupts and exceptions handling. KProbes is available on the following architectures however: ppc64, x86_64, sparc64 and i386.

A kernel probe is a set of handlers placed on a certain instruction address. There are two types of probes in the kernel as of now, called "KProbes" and "JProbes." A KProbe is defined by a pre-handler and a post-handler. When a KProbe is installed at a particular instruction and that instruction is executed, the pre-handler is executed just before the execution of the probed instruction. Similarly, the post-handler is executed just after the execution of the probed instruction. JProbes are used to get access to a kernel function's arguments at runtime. A JProbe is defined by a JProbe handler with the same prototype as that of the function whose arguments are to be accessed. When the probed function is executed the control is first transferred to the user-defined JProbe handler, followed by the transfer of execution to the original function. The KProbes package has been designed in such a way that tools for debugging, tracing and logging could be built by extending it.

[KProbes architecture] The figure to the right describes the architecture of KProbes. On the x86, KProbes makes use of the exception handling mechanisms and modifies the standard breakpoint, debug and a few other exception handlers for its own purpose. Most of the handling of the probes is done in the context of the breakpoint and the debug exception handlers which make up the KProbes architecture dependent layer. The KProbes architecture independent layer is the KProbes manager which is used to register and unregister probes. Users provide probe handlers in kernel modules which register probes through the KProbes manager.

KProbes Interface

The data structures and functions implementing the KProbes interface have been defined in the file <linux/kprobes.h>. The following data structure describes a KProbe.

struct kprobe {
    struct hlist_node hlist;                    /* Internal */
    kprobe_opcode_t addr;                       /* Address of probe */
    kprobe_pre_handler_t pre_handler;           /* Address of pre-handler */
    kprobe_post_handler_t post_handler;         /* Address of post-handler */
    kprobe_fault_handler_t fault_handler;       /* Address of fault handler */
    kprobe_break_handler_t break_handler;       /* Internal */
    kprobe_opcode_t opcode;                     /* Internal */        
    kprobe_opcode_t insn[MAX_INSN_SIZE];        /* Internal */
};

Let's first talk about registering a KProbe. Users can insert their own probe inside a running kernel by writing a kernel module which implements the pre-handler and the post-handler for the probe. In case a fault occurs while executing a probe handler function, the user can handle the fault by defining a fault-handler and passing its address in struct kprobe. The prototypes for these are defined as below.

typedef int (*kprobe_pre_handler_t)(struct kprobe*, struct pt_regs*);
typedef void (*kprobe_post_handler_t)(struct kprobe*, struct pt_regs*, 
              unsigned long flags);
typedef int (*kprobe_fault_handler_t)(struct kprobe*, struct pt_regs*, 
             int trapnr);

As can be seen the pre-handler and the post-handler both receive a reference to the probe as well as the registers saved for the context in which the probe was hit. These values can be used in the pre-handler or post-handler or if required, they can be modified before returning control to the subsequent instruction. This also means that the same handlers can be used for multiple probe locations. The flags parameter is currently unused. The trapnr parameter (for the fault handler function) contains the exception number which occurred while handling the KProbe. A user defined fault handler can return 0 to let KProbe handle the fault further. It returns 1 if it has handled the fault and wants to let the execution of the probe handler continue.

Note that currently the pre-handler cannot be NULL for a probe, although the use of post-handler is optional. This is considered a bug since there may be cases where the pre-handler may not be required but a post-handler is needed. In such situations the user will still have to define a pre-handler. Another bug (which can oops the kernel) is related to probes which are activated on the ret/lret instructions. Yet another bug is related to probes activated on int3 instructions. All of these problems should be fixed in the 2.6.12 release of the kernel. However, these bugs can be easily avoided so they do not present any serious issues for someone who wants to use KProbes immediately without applying patches.

The KProbe registration functions are defined as shown below.

int register_kprobe(struct kprobe *p);
int unregister_kprobe(struct kprobe *p);

The registration function takes a reference to the KProbe structure describing the probe. Note that the user's module which registers the probe should keep a reference to the structure until the probe is unregistered. Since access to KProbes is serialized, a probe can be registered or unregistered anytime except from inside the probe handlers themselves, which will deadlock the system. This is because probe handlers execute after the spinlock used for locking KProbes has been acquired. The same spinlock is locked just before unregistering the probe. So if an attempt is made to unregister a probe inside a probe handler the same path will try to lock the spinlock twice.

Multiple probes cannot be placed on the same address as of now. However, a patch has been submitted to the kernel mailing list which allows multiple probes to be registered at the same address through another interface. It might be included in the next release of the kernel. Until then, if such an attempt is made register_kprobe() returns -EEXIST.

JProbes are used to give access to a function's arguments at runtime. This is achieved by providing a JProbe handler with the same prototype as that of the function being probed. At runtime, when the original function is executed, control is transferred to the JProbe handler after copying the process's context. On return from the JProbe handler, the context - consisting of the process's registers and the stack - is restored, so any modifications to the context of the process in the JProbe handler are lost. The execution continues from the point at which the probe was placed with the original saved state. A JProbe is represented by the structure given below.

struct jprobe {
    struct kprobe kp;
    kprobe_opcode_t *entry; 	/* user-defined JProbe handler address */
};

The user places the address of the function which will handle this probe in the entry field. The addr field in struct kprobe should be populated with the address of the function whose arguments are to be accessed. The functions used to register and unregister a JProbe are given below.

int register_jprobe(struct jprobe *p);
void unregister_jprobe(struct jprobe *p);

The JProbe handler which is written by the user should call jprobe_return() when it wants to return instead of the return statement.

KProbes Manager

The KProbes Manager is responsible for registering and unregistering KProbes and JProbes. The file kernel/kprobes.c implements the KProbes manager. Each probe is described by the struct kprobe structure and stored in a hash table hashed by the address at which the probe is placed. Access to this hash table is serialized by the spinlock kprobe_lock. This spinlock is locked before a new probe is registered, an existing probe is unregistered or when a probe is hit. This prevents these operations from executing simultaneously on a SMP machine. Whenever a probe is hit, the probe handler is called with interrupts disabled. Interrupts are disabled because handling a probe is a multiple step process which involves breakpoint handling and single-step execution of the probed instruction. There is no easy way to save the state between these operations hence interrupts are kept disabled during probe handling.

The manager is composed of these functions which are followed by a simplified description of what they do. These functions are architecture independent. A side-by-side reading of the code in kernel/kprobes.c and these steps will clarify the whole implementation.

void lock_kprobes(void): Locks KProbes and records the CPU on which it was locked
void unlock_kprobes(void): Resets the recorded CPU and unlocks KProbes
struct kprobe *get_kprobe(void *addr): Using the address of the probed instruction, returns the probe from hash table
int register_kprobe(struct kprobe *p): This function registers a probe at a given address. Registration involves copying the instruction at the probe address in a probe specific buffer. On x86 the maximum instruction size is 16 bytes hence 16 bytes are copied at the given address. Then it replaces the instruction at the probed address with the breakpoint instruction.
void unregister_kprobe(struct kprobe *p): This function unregisters a probe. It restores the original instruction at the address and removes the probe structure from the hash table.
int register_jprobe(struct jprobe *jp): This function registers a JProbe at a function address. JProbes use the KProbes mechanism. In the KProbe pre_handler it stores its own handler setjmp_pre_handler and in the break_handler stores the address of longjmp_break_handler. Then it registers struct kprobe jp->kp by calling register_kprobe()
void unregister_jprobe(struct jprobe *jp): Unregisters the struct kprobe used by this JProbe

What happens when a KProbe is hit?

[Kprobe execution diagram] The steps involved in handling a probe are architecture dependent; they are handled by the functions defined in the file arch/i386/kernel/kprobes.c. After the probes are registered, the addresses at which they are active contain the breakpoint instruction (int3 on x86). As soon as execution reaches a probed address the int3 instruction is executed, causing the control to reach the breakpoint handler do_int3() in arch/i386/kernel/traps.c. do_int3() is called through an interrupt gate therefore interrupts are disabled when control reaches there. This handler notifies KProbes that a breakpoint occurred; KProbes checks if the breakpoint was set by the registration function of KProbes. If no probe is present at the address at which the probe was hit it simply returns 0. Otherwise the registered probe function is called.

What happens when a JProbe is hit?

[JProbe execution diagram] A JProbe has to transfer control to another function which has the same prototype as the function on which the probe was placed and then give back control to the original function with the same state as there was before the JProbe was executed. A JProbe leverages the mechanism used by a KProbe. Instead of calling a user-defined pre-handler a JProbe specifies its own pre-handler called setjmp_pre_handler() and uses another handler called a break_handler. This is a three-step process.

In the first step, when the breakpoint is hit control reaches kprobe_handler() which calls the JProbe pre-handler (setjmp_pre_handler()). This saves the stack contents and the registers before changing the eip to the address of the user-defined function. Then it returns 1 which tells kprobe_handler() to simply return instead of setting up single-stepping as for a KProbe. On return control reaches the user-defined function to access the arguments of the original function. When the user defined function is done it calls jprobe_return() instead of doing a normal return.

In the second step jprobe_return() truncates the current stack frame and generates a breakpoint which transfers control to kprobe_handler() through do_int3(). kprobe_handler() finds that the generated breakpoint address (address of int3 instruction in jprobe_handler()) does not have a registered probe however KProbes is active on the current CPU. It assumes that the breakpoint must have been generated by JProbes and hence calls the break_handler of the current_kprobe which it saved earlier. The break_handler restores the stack contents and the registers that were saved before transferring control to the user-defined function and returns.

In the third step kprobe_handler() then sets up single-stepping of the instruction at which the JProbe was set and the rest of the sequence is the same as that of a KProbe.

Possible problems

There could be several possible problems which could occur when a probe is handled by KProbes. The first possibility is that several probes are handled in parallel on a SMP system. However, there's a common hash table shared by all probes which needs to be protected against corruption in such a case. In this case kprobe_lock serializes the probe handling across processors.

Another problem occurs if a probe is placed inside KProbes code, causing KProbes to enter probe handling code recursively. This problem is taken care of in kprobe_handler() by checking if KProbes is already running on the current CPU. In this case the recursing probe is disabled silently and control returns back to the previous probe handling code.

If preemption occurs when KProbes is executing it can context switch to another process while a probe is being handled. The other process could cause another probe to fire which will cause control to reach kprobe_handler() again while the previous probe was not handled completely. This may result in disarming the new probe when KProbes discovers it's recursing. To avoid this problem, preemption is disabled when probes are handled.

Similarly, interrupts are disabled by causing the breakpoint handler and the debug handler to be invoked through interrupt gates rather than trap gates. This disables interrupts as soon as control is transferred to the breakpoint or debug handler. These changes are made in the file arch/i386/kernel/traps.c.

A fault might occur during the handling of a probe. In this case, if the user has defined a fault handler for the probe, control is transferred to the fault handler. If the user-defined fault handler returns 0 the fault is handled by the kernel. Otherwise, it's assumed that the fault was handled by the fault handler and control reaches back to the probe handlers.

Conclusion

KProbes is an excellent tool for debugging and tracing; it can also be used for performance measuring. Developers can use it to trace the path of their programs inside the kernel for debugging purposes. System administrators can use it to trace events inside the kernel on production systems. KProbes can also be used for non-critical performance measurements. The current KProbes implementation, however, introduces some latency of its own in handling probes. The cause behind this latency is the single kprobe_lock which serializes the execution of probes across all CPUs on a SMP machine. Another reason is the mechanism used by KProbes which uses multiple exceptions to handle a single probe. Exception handling is an expensive operation which causes its own delays. Work needs to be done in this area to improve SMP scalability and improving the probe handling time to make KProbes a viable performance measuring tool.

KProbes however cannot be used directly for these purposes. In the raw form a user can write a kernel module implementing the probe handlers. However higher level tools are necessary for making it more convenient to use. Such tools could contain standard probe handlers implementing the desired features or they could contain a means to produce probe-handlers given simple descriptions of them in a scripting language like DProbes.

Acknowledgements

The author will like to thank his editor Jonathan Corbet, Kalyan T.B. (HP), Siddharth Seth (IIITB) and Bharata B. Rao (HP) for going through this article and giving their feedback, comments, suggestions etc. and helping to improve this article.

Index entries for this article
Kernel	KProbes
GuestArticles	Goswami, Sudhanshu

Wow.

Posted Apr 21, 2005 20:35 UTC (Thu) by ncm (guest, #165) [Link] (1 responses)

What a remarkably thorough article.

Wow.

Posted Apr 21, 2005 22:12 UTC (Thu) by munozga (subscriber, #26290) [Link]

I second that Wow. Great article, especially considering it's Sudhanshu's first tech article.

An introduction to KProbes

Posted Apr 22, 2005 16:11 UTC (Fri) by melevittfl (guest, #5409) [Link]

So, which part of SCO's SVR4 was this taken from then? (JOKE)

Too long...

Posted Apr 22, 2005 16:16 UTC (Fri) by mmutz (guest, #5642) [Link] (3 responses)

Thorough as it might be, this article is simply too long for LWN. It would
have been better placed on developerworks. An article that starts with
"This is article assumes that you are familiar with..." doesn't belong in
LWN, IMO.

Note: I didn't say the article was bad or something. All I say is that it
appeared in the wrong publication.

Marc

Too long...

Posted Apr 22, 2005 23:30 UTC (Fri) by lolster (guest, #29209) [Link]

On the one hand I have to agree with you. On the other hand LWN's kernel page usually assumes the reader is quite familiar with the Linux kernel's source - just without saying so.

Too long...

Posted Apr 24, 2005 11:50 UTC (Sun) by pkolloch (subscriber, #21709) [Link]

I don't agree. I think that is a question of personal preference. I am not familiar with the kernel (and least not with implementation details). Still, I very much prefer to superificially glance over an in-depth article than reading a superficial article in depth.

Too long...

Posted Apr 26, 2005 18:57 UTC (Tue) by daniel (guest, #3181) [Link]

"Thorough as it might be, this article is simply too long for LWN"

I don't agree. If an article needs to be long in order to cover the subject in the depth LWN readers have come to expect and appreciate, then so be it. If your attention span is shorter than the article, just move on to the next article, it's your choice.

Regards,

Daniel

An introduction to KProbes

Posted Oct 3, 2016 16:59 UTC (Mon) by rvk (guest, #111525) [Link]

Great article! Regarding "To avoid this problem, preemption is disabled when probes are handled.", to save the function parameters from a jprobe handler and use them in a kretprobe does this mean that the kretprobe handle is guaranteed to always be called right after the jprobe handler, without any interruption? Ie. Register a jprobe(sys_symlink) -> handle_pre_symlink and a kretprobe(sys_symlink) -> handle_post_symlink. If not, is it possible to match a jprobe handler to a kretprobe handler?

jprobe 1
jprobe 2
kretprobe 2
kretprobe 1

An introduction to KProbes

Posted Nov 19, 2018 19:22 UTC (Mon) by rkv (guest, #128587) [Link]

Are you able to copy_from_user within a jprobe entry handler (where you have access to the system call's arguments)? If not, are there any techniques to bypass this restriction (ie. disabling interrupts) or another way to get memory from userspace while in this context?

An introduction to KProbes

Posted Jan 22, 2021 11:59 UTC (Fri) by SandeshKa (guest, #142172) [Link]

A great article.

An introduction to KProbes

Introduction

KProbes Interface

KProbes Manager

What happens when a KProbe is hit?

Possible problems

Conclusion

Related Links

Acknowledgements

Wow.

Wow.

An introduction to KProbes

Too long...

Too long...

Too long...

Too long...

An introduction to KProbes

An introduction to KProbes

An introduction to KProbes