kprobes tutorial
This tutorial was developed for the 2006 Ottawa Linux Symposium. I’m hoping it will be useful as a general resource.
There is documentation in Documentation/kprobes.txt in the kernel source.
Why?kprobes is both useful and cool. Unfortunately most of the kernel developers I’ve talked to are confused both about why its useful and how to use it. I’m hoping to answer both of those here. I should add that I have no skin in the game personally, I’m not a kprobes developer. I should probably turn this into a HOWTO…
What is kprobes?Simply, kprobes allows you to write modules that can add debug information to the kernel. It is an alternative to building custom kernels or custom modules. I think the most useful case is when you are dealing with some remote machine (your grandmother or a tester) who hits some bug that you can’t figure out by looking at /var/log/messages. Build a kprobes module and have them insmod it on their kernel.
This tutorial deals with kernel kprobing. There is additional work around user-land kprobing that will not be discussed here.
There are three kinds of kprobes:
- jprobes
- Call a function on the entry to a routine. All the arguments to the routine are passed.
- kretprobes
- Call a function on the exit from a routine. The registers are passed
- kprobe
- The guts of kprobes. Any arbitrary kernel instruction can be probed. A function is called passing the registers.
kprobes has been in mainline since 2.6.9. There are some kernel configuration flags that need to be set to use kprobes. Current enterprise kernels (such as SLES 10) have these turned on, and so does FC5. Some others (cough, Ubuntu, cough) do not. The flags that are required are:
- CONFIG_KPROBES
- duh
- CONFIG_MODULES and CONFIG_MODULE_UNLOAD
- kprobes do not require any code changes to the source kernel (thats kind of the idea.) They are loaded into existing kernels as modules. Obviously you need modules configured. You don’t actually require MODULE_UNLOAD, but it makes life easier.
- CONFIG_KALLSYMS and CONFIG_KALSYMS_ALL
- You can set a kprobe by using an address from System.map, but it is easier to code kallsyms_lookup_name().
To build a module (any module) you need to have access to the kernel headers and a suitable compiler.
Simple Case #1The simplest case, useful in 99% of cases is the jprobe case, where your function gains control on the entry to some arbitrary routine in the kernel.
In this example we will trace do_execve in the kernel. Start from the makefile and source in Documentation/kprobes.txt.
1: / Trace do_execv. Taken basically from Documentation/kprobes.txt /2: #include <linux/kernel.h>
3: #include <linux/module.h>
4: #include <linux/sched.h>
5: #include <linux/kprobes.h>
6: #include <linux/kallsyms.h>
7:
8: /
9: Pre-entry point for do_execve.
10: /
11: static int my_do_execve(char filename,
12: char user *user argv,
13: char __user user envp,
14: struct pt_regs regs)
15: {
16: printk("do_execve for %s from %sn", filename, current->comm);
17: / Always end with a call to jprobe_return(). /
18: jprobe_return();
19: /NOTREACHED/
20: return 0;
21: }
22:
23: static struct jprobe my_jprobe = {
24: .entry = (kprobe_opcode_t ) my_do_execve
25: };
26:
27: int init_module(void)
28: {
29: int ret;
30: my_jprobe.kp.addr =
31: (kprobe_opcode_t ) kallsyms_lookup_name("do_execve");
32: if (!my_jprobe.kp.addr) {
33: printk("Couldn’t find %s to plant jproben", "do_execve");
34: return -1;
35: }
36:
37: if ((ret = register_jprobe(&my_jprobe)) <0) {
38: printk("register_jprobe failed, returned %dn", ret);
39: return -1;
40: }
41: printk("Planted jprobe at %p, handler addr %pn",
42: my_jprobe.kp.addr, my_jprobe.entry);
43: return 0;
44: }
45:
46: void cleanup_module(void)
47: {
48: unregister_jprobe(&my_jprobe);
49: printk("jprobe unregisteredn");
50: }
51:
52: MODULE_LICENSE("GPL");
Note line 11 Give YOUR routine a different name than the one you are tracing, otherwise kallsyms_lookup_name will get confused.
Note line 18 where the jprobe handler calls jprobe_return(). I mean REALLY note that (don’t just return from a jprobe.)
The Makefile (also straight out of Documentation/kprobes.txt) is
# This is taken straight from Documentation/kprobes.txtobj-m := trace-exec.o
KDIR := /lib/modules/$(shell uname -r)/build
PWD := $(shell pwd)
default:
$(MAKE) -C $(KDIR) SUBDIRS=$(PWD) modules
clean:
rm -f .mod.c .ko .o
Easy. insmod the module and away you go! Here’s the output on my thinkad
Jul 16 19:20:46 hound kernel: [17213292.188000] do_execve for /bin/sh from wcstatusdJul 16 19:20:46 hound kernel: [17213292.204000] do_execve for /sbin/ifconfig from sh
Jul 16 19:20:46 hound kernel: [17213292.216000] do_execve for /bin/sh from wcstatusd
Jul 16 19:20:46 hound kernel: [17213292.232000] do_execve for /sbin/ifconfig from sh
Jul 16 19:20:46 hound kernel: [17213292.248000] do_execve for /bin/sh from wcstatusd
Jul 16 19:20:46 hound kernel: [17213292.264000] do_execve for /sbin/ifconfig from sh
Jul 16 19:20:46 hound kernel: [17213292.296000] do_execve for /bin/sh from wcstatusd
Jul 16 19:20:46 hound kernel: [17213292.376000] do_execve for /sbin/ifconfig from sh
Jul 16 19:20:46 hound kernel: [17213292.392000] do_execve for /bin/sh from wcstatusd
Jul 16 19:20:46 hound kernel: [17213292.456000] do_execve for /sbin/ifconfig from sh
Jul 16 19:20:46 hound kernel: [17213292.472000] do_execve for /bin/sh from wcstatusd
Hmm…what the heck is wcstatusd doing all the time…
powerpc and ia64 problemsIf you try the above on powerpc and (I think) ia64 it will fail. And you will bang your head. The reason is that when you take the address of a routine you get the address in a jump table, not the actual code address.
You can fix this by something like the following:
38: if ((ret = register_jprobe(&my_jprobe)) <0) {39: printk("register_jprobe failed, returned %d. "
40: "trying dereferenced addressn", ret);
41: my_jprobe.kp.addr = ((kprobe_opcode_t )my_jprobe.kp.addr);
42: if ((ret = register_jprobe(&my_jprobe)) <0) {
43: printk("registerjprobe failed again, returned %dn",
44: ret);
45: return -1;
46: }
47: }
Also beware of the 32-bit compat
functions. In the above example, do_execve is actually never called on powerpc unless you are running 64-bit processes. compat_do_execve is called instead.When good traces go badLimiting your outputI usually do something like the following to limit the amount of data that comes out (depending on what I am tracing)
…8: static atomic_t trace_limit;
…
15: if (atomic_inc_return(&trace_limit) > 100)
16: jprobe_return();
…Handling Faults
It can help to have a fault handler in your kprobes. Unless you plan to always write bug-less kprobes.
…10: / fault_handler: this is called if an exception is generated for any
11: instruction within the pre- or post-handler, or when Kprobes
12: single-steps the probed instruction.
13: /
14: int handler_fault(struct kprobe p, struct pt_regs regs, int trapnr)
15: {
16: printk("fault_handler: p->addr=0x%p, trap #%dn",
17: p->addr, trapnr);
18: / Return 0 because we don’t handle the fault. */
19: return 0;
20: }
…
54: my_jprobe.kp.fault_handler = handler_fault;
…Simple Case #2
My Thinkpad has a Centrino processor with speedstep CPU frequency controls.  CPU frequency is controlled by the speedstep_centrino module.  Lets trace all the calls to centrino_target in that module.
First, here is the routine we want to trace from arch/i386/kernel/cpu/cpufreq/speedstep-centrino.c.
584: /
585: centrino_setpolicy - set a new CPUFreq policy
586: @policy: new policy
587: @target_freq: the target frequency
588: @relation: how that frequency relates to achieved frequency (CPUFREQ_RELATION_L or CPUFREQ_RELATION_H)
589:
590: Sets a new CPUFreq policy.
591: /
592: static int centrino_target (struct cpufreq_policy policy,
593: unsigned int target_freq,
594: unsigned int relation)
595: {
596: unsigned int newstate = 0;
597: unsigned int msr, oldmsr, h, cpu = policy->cpu;
598: struct cpufreq_freqs freqs;
599: cpumask_t saved_mask;
600: int retval;
601:
602: if (centrino_model[cpu] == NULL)
603: return -ENODEV;
604:
…
Some things to note. First, this routine is in a module, not in the kernel proper. Secondly, the routine is marked static. No problem. if KALLSYMS_ALL is configured, we will still get the address for it.
1: / Trace do_execv. Taken basically from Documentation/kprobes.txt /2: #include <linux/kernel.h>
3: #include <linux/module.h>
4: #include <linux/cpufreq.h>
5: #include <linux/kprobes.h>
6: #include <linux/kallsyms.h>
7:
8: static atomic_t trace_limit;
9:
10: / fault_handler: this is called if an exception is generated for any
11: instruction within the pre- or post-handler, or when Kprobes
12: single-steps the probed instruction.
13: /
14: int handler_fault(struct kprobe p, struct pt_regs regs, int trapnr)
15: {
16: printk("fault_handler: p->addr=0x%p, trap #%dn",
17: p->addr, trapnr);
18: / Return 0 because we don’t handle the fault. /
19: return 0;
20: }
21:
22: /
23: Pre-entry point for centrino_target.
24: /
25: static int my_centrino_target (struct cpufreq_policy policy,
26: unsigned int target_freq,
27: unsigned int relation)
28: {
29: if (atomic_inc_return(&trace_limit) > 100)
30: jprobe_return();
31:
32: printk("centrino_target %u relation %un", target_freq, relation);
33: dump_stack();
34:
35: / Always end with a call to jprobe_return(). /
36: jprobe_return();
37: /NOTREACHED/
38: return 0;
39: }
40:
41: static struct jprobe my_jprobe = {
42: .entry = (kprobe_opcode_t ) my_centrino_target,
43: };
44:
45: int init_module(void)
46: {
47: int ret;
48: my_jprobe.kp.addr =
49: (kprobe_opcode_t ) kallsyms_lookup_name("centrino_target");
50: if (!my_jprobe.kp.addr) {
51: printk("Couldn’t find %s to plant jproben", "centrino_target");
52: return -1;
53: }
54:
55: my_jprobe.kp.fault_handler = handler_fault;
56:
57: if ((ret = register_jprobe(&my_jprobe)) <0) {
58: printk("register_jprobe failed, returned %dn", ret);
59: return -1;
60: }
61: printk("Planted jprobe at %p, handler addr %pn",
62: my_jprobe.kp.addr, my_jprobe.entry);
63: return 0;
64: }
65:
66: void cleanup_module(void)
67: {
68: unregister_jprobe(&my_jprobe);
69: printk("jprobe unregisteredn");
70: }
71:
72: MODULE_LICENSE("GPL");
Note the use of dump_stack().  The output in /var/log/messages
Jul 16 19:43:01 hound kernel: [17179786.852000] [pg0+949055590/1069523968] my_centrino_target+0x36/0x50 [trace_speedstep]
Jul 16 19:43:01 hound kernel: [17179786.852000] [pg0+946753796/1069523968] store_speed+0xb4/0xd0 [cpufreq_userspace]
Jul 16 19:43:01 hound kernel: [17179786.852000] [neigh_get_next+135/160] store+0x47/0x60
Jul 16 19:43:01 hound kernel: [17179786.852000] [ldm_partition+4123/4304] sysfs_write_file+0x9b/0xf0
Jul 16 19:43:01 hound kernel: [17179786.852000] [vfs_read+206/432] vfs_write+0xde/0x1b0
Jul 16 19:43:01 hound kernel: [17179786.852000] [do_sendfile+699/752] sys_write+0x4b/0x80
Jul 16 19:43:01 hound kernel: [17179786.852000] [need_resched+12/33] sysenter_past_esp+0x54/0x75
Jul 16 19:43:11 hound kernel: [17179797.328000] centrino_target 1400000 relation 0
Jul 16 19:43:11 hound kernel: [17179797.328000] [pg0+949055590/1069523968] my_centrino_target+0x36/0x50 [trace_speedstep]
Jul 16 19:43:11 hound kernel: [17179797.328000] [pg0+946753796/1069523968] store_speed+0xb4/0xd0 [cpufreq_userspace]
Jul 16 19:43:11 hound kernel: [17179797.328000] [neigh_get_next+135/160] store+0x47/0x60
Jul 16 19:43:11 hound kernel: [17179797.328000] [ldm_partition+4123/4304] sysfs_write_file+0x9b/0xf0
Jul 16 19:43:11 hound kernel: [17179797.328000] [vfs_read+206/432] vfs_write+0xde/0x1b0
Jul 16 19:43:11 hound kernel: [17179797.328000] [do_sendfile+699/752] sys_write+0x4b/0x80
Jul 16 19:43:11 hound kernel: [17179797.328000] [need_resched+12/33] sysenter_past_esp+0x54/0x75Inlines
kprobes patches a single address (the address used above in kallsyms_lookup_name().) This doesn’t handle inlines. There is no automatic way to handle inlines.
Return values (kretprobes)
The second most useful thing to do is track the return value of a routine.  kprobes uses kretprobes to handle that.  A classic case is tracking memory allocations (kmalloc.)
kprobes is smart enough to trap on any return from a routine (it actually places a trap at the place the routine was called from rather than on every return instruction
kprobes can miss kretprobes.  You can check my_kretprobe.nmissed to see if you missed any retprobes.
2: #include <linux/kernel.h>
3: #include <linux/module.h>
4: #include <linux/cpufreq.h>
5: #include <linux/kprobes.h>
6: #include <linux/kallsyms.h>
7:
8: static atomic_t trace_limit;
9:
10: /
11: Pre-entry point for kmalloc_target.
12: */
13: <a name=”mykmalloc”>static void my__kmalloc(size_t size, gfp_t flags)
14: {
15: if (atomic_inc_return(&trace_limit) > 100)
16: jprobe_return();
17:
18: printk("kmalloc call: %u in %sn", size, current->comm);
19:
20: / Always end with a call to jprobe_return(). /
21: jprobe_return();
22: /NOTREACHED/
23: return 0;
24: }
25:
26: static struct jprobe my_jprobe = {
27: .entry = (kprobe_opcode_t ) mykmalloc,
28: };
29:
30: / Return-probe handler: If the probed function fails, log the return value. /
31: static int ret_handler(struct kretprobe_instance ri, struct pt_regs regs)
32: {
33: if (atomic_inc_return(&trace_limit) > 100)
34: return 0;
35: printk("kmalloc returns %lxn", regs->eax);
36: return 0;
37: }
38:
39: static struct kretprobe my_kretprobe = {
40: .handler = ret_handler,
41: / Probe up to 20 instances concurrently. /
42: .maxactive = 20
43: };
44:
45: int init_module(void)
46: {
47: int ret;
48: my_jprobe.kp.addr = my_kretprobe.kp.addr =
49: (kprobe_opcode_t *) kallsyms_lookup_name("kmalloc");
50: if (!my_jprobe.kp.addr) {
51: printk("Couldn’t find %s to plant jproben", "__kmalloc");
52: return -1;
53: }
54:
55: if ((ret = register_jprobe(&my_jprobe)) <0) {
56: printk("register_jprobe failed, returned %dn", ret);
57: return -1;
58: }
59: if ((ret = register_kretprobe(&my_kretprobe)) < 0) {
60: printk("register_kretprobe failed, returned %dn", ret);
61: unregister_jprobe(&my_jprobe);
62: return -1;
63: }
64: printk("Planted jprobe at %p, handler addr %pn",
65: my_jprobe.kp.addr, my_jprobe.entry);
66: return 0;
67: }
68:
69: void cleanup_module(void)
70: {
71: unregister_jprobe(&my_jprobe);
72: unregister_kretprobe(&my_kretprobe);
73: printk("jprobe unregisteredn");
74: }
75:
76: MODULE_LICENSE("GPL");
Note that you just return() from kretprobes.
Given SMP, preemption, etc. there isn’t an easy way to correlate a jprobe and a kretprobe.
Raw kprobesjprobes and kretprobes are based on raw kprobes. A raw kprobe just takes an arbitrary address and sets a breakpoint there. Frequently useful in debugging a panic (where you know the faulting address) and inassociation with a disassembled kernel (objdump -D)
kprobes provide "pre" handlers tha run before the specific instruction, and "post" handlers that run afterwards. The following is a generic kprobe that dumps state when an arbitrary address is hit.
1: / generic kprobe /2: #include <linux/kernel.h>
3: #include <linux/module.h>
4: #include <linux/kprobes.h>
5: #include <linux/kallsyms.h>
6: #include <linux/sched.h>
7:
8: static unsigned long probe_addr;
9:
10: module_param_named(probe_addr, probe_addr, ulong, S_IRUGO | S_IWUSR);
11:
12: /For each probe you need to allocate a kprobe structure/
13: static struct kprobe kp;
14:
15: static void dump_state(struct pt_regs regs)
16: {
17: print_symbol(KERN_INFO "EIP is at %sn", regs->eip);
18: printk(KERN_INFO "eax: %08lx ebx: %08lx ecx: %08lx edx: %08lxn",
19: regs->eax, regs->ebx, regs->ecx, regs->edx);
20: printk(KERN_INFO "esi: %08lx edi: %08lx ebp: %08lx esp: %08lxn",
21: regs->esi, regs->edi, regs->ebp, regs->esp);
22: printk(KERN_INFO "ds: %04x es: %04xn",
23: regs->xds & 0xffff, regs->xes & 0xffff);
24: printk(KERN_INFO "Process %s (pid: %d, threadinfo=%p task=%p)",
25: current->comm, current->pid, current_thread_info(), current);
26:
27: }
28:
29: /kprobe pre_handler: called just before the probed instruction is executed/
30: int handler_pre(struct kprobe p, struct pt_regs regs)
31: {
32:
33: printk("pre_handler: p->addr=0x%pn",
34: p->addr);
35: dump_state(regs);
36: dump_stack();
37: return 0;
38: }
39:
40: /kprobe post_handler: called after the probed instruction is executed/
41: void handler_post(struct kprobe p, struct pt_regs regs, unsigned long flags)
42: {
43: printk("post_handler: p->addr=0x%pn",
44: p->addr);
45: dump_state(regs);
46: return;
47: }
48:
49: / fault_handler: this is called if an exception is generated for any
50: instruction within the pre- or post-handler, or when Kprobes
51: single-steps the probed instruction.
52: /
53: int handler_fault(struct kprobe p, struct pt_regs regs, int trapnr)
54: {
55: printk("fault_handler: p->addr=0x%p, trap #%dn",
56: p->addr, trapnr);
57: / Return 0 because we don’t handle the fault. /
58: return 0;
59: }
60:
61: int init_module(void)
62: {
63: int ret;
64:
65: if (!probe_addr) {
66: printk("trace-generic: provide probe_addr paramtern");
67: return -1;
68: }
69:
70: kp.pre_handler = handler_pre;
71: kp.post_handler = handler_post;
72: kp.fault_handler = handler_fault;
73: kp.addr = (kprobe_opcode_t) probe_addr;
74:
75: printk("setting probe at address %pn",kp.addr);
76:
77: / register the kprobe now /
78: if (!kp.addr) {
79: printk("Couldn’t find %s to plant kproben", "do_fork");
80: return -1;
81: }
82: if ((ret = register_kprobe(&kp) < 0)) {
83: printk("register_kprobe failed, returned %dn", ret);
84: return -1;
85: }
86: printk("kprobe registeredn");
87: return 0;
88: }
89:
90: void cleanup_module(void)
91: {
92: unregister_kprobe(&kp);
93: printk("kprobe unregisteredn");
94: }
95:
96: MODULE_LICENSE("GPL");initrd
There’s no reason you can’t add a kprobes module to your initrds to trace things happening reasonably early during the boot process. Similarly, you can actually add a kprobe to a distro installation if the installer allows you to load extra modules before doing the install.
relayfsrelayfs (http://relayfs.sourceforge.net) provides examples of tying kprobes and relayfs together. Rather than dumping all your trace data into dmesg.