Finding Realtime Linux Kernel Latencies (for RT linux kernels >= 2.6.33) --------------------------------------- Unexplained delays while processing events are one of the biggest problems in the Realtime Linux kernel (or any realtime system for that matter). This document describes one mechanism for locating delays in the kernel code. It presumes some moderate knowledge of how to configure and build Linux kernels and kernel terminology. 1. What is "latency"? ------------------------------------------------------------------------------ The term latency, when used in the context of the RT Kernel, is the time interval between the occurance of an event and the time when that event is "handled" (typically "handled" means running some thread as a result of the event). Latencys that are of interest to kernel programmers (and application programmers) are: - the time between when an interrupt occurs and the thread waiting for that interrupt is run - the time between a timer expiration and the thread waiting for that timer to run - The time between the receipt of a network packet and when the thread waiting for that packet runs Yes, the timer and network example above are usually examples of the more general interrupt case (most timers signal expiration with an interrupt and most network interface cards signal packet arrival with an interrupt as well), but the main idea is that an "event" occurs and there is some elapsed time interval which concludes with the kernel successfully handling the event. So, latency in and of itself is not a bad thing; there is always a delay between occurance and completion of an event. What is bad is when latency becomes excessive, meaning that the delay exceeds some arbitrary threshold. What is this threshold? That's for each application to define. A threshold or "deadline" is what defines a real time application: meeting deadlines means success, missing deadlines (exceeding the threshold) means failing to be real time. 2. Tools for measuring latency ------------------------------------------------------------------------------ How do you measure event latency in the Linux kernel? Since there are many types of events that occur in the kernel, there's no one way to measure event latency. Usually a program is written to measure a specific type of latency. One such tool is called "cyclictest". Cyclictest was written by Thomas Gleixner and is now maintained by Clark Williams (with assistance from John Kacur, Carsten Emde, Uwe Klein-Koeig and the rest of the gang on the linux-rt-users mailing list) as part of the rt-tests package. This group of test programs may be found at the git repository: git://git.kernel.org/pub/scm/utils/rt-tests/rt-tests.git or as a tarfile at: https://git.kernel.org/pub/scm/utils/rt-tests/rt-tests.git Cyclictest measures the amount of time that passes between when a timer expires and when the thread which set the timer actually runs. It does this by taking a time snapshot just prior to waiting for a specific time interval (t1), then taking another time snapshot after the timer finishes (t2), then comparing the theoretical wakeup time with the actual wakeup time (t2 -(t1 + sleep_time)). This value is the latency for that timer wakeup. Here is some sample output from a typical cyclictest run: $ cyclictest --smp -p95 -m policy: fifo: loadavg: 0.04 0.01 0.00 1/338 31976 T: 0 (31974) P:95 I:1000 C: 4990 Min: 9 Act: 37 Avg: 31 Max: 59 T: 1 (31975) P:95 I:1500 C: 3322 Min: 10 Act: 68 Avg: 57 Max: 90 ^C This shows cyclictest being run on a system with two cpu cores, with one measurement thread per core, each running as SCHED_FIFO priority 95, and with all memory locked. In the above ouput, cpu0 has had a max latency of 59 microseconds with an average latency of 31 microseconds, while cpu1 has a max of 90 microseconds with an average of 57 microseconds. Cyclictest has many options for adjusting how measurements are made, such as how many measurement threads are run, scheduling policies for measurement threads, etc., which we're going to use in the next section to try and pinpoint excessive latency sources within the kernel. 3. Tracing ------------------------------------------------------------------------------ The RT kernel contains a subsystem that is useful for locating timing problems, called the "ftrace" system. You can find more information on ftrace at the rt wiki site: http://rt.wiki.kernel.org/index.php/Ftrace There is also an interesting article on the Linux Weekly News (LWN) site: http://lwn.net/Articles/322666/ For now, rather than try to explain the ins and outs of ftrace, we'll just use it. To make use of frace, it must be configured into your kernel. The following configuration options will get you the features we'll use in the rest of this howto: CONFIG_FTRACE=y CONFIG_FUNCTION_TRACER=y CONFIG_FUNCTION_GRAPH_TRACER=y CONFIG_SCHED_TRACER=y CONFIG_FTRACE_SYSCALLS=y CONFIG_STACK_TRACER=y CONFIG_DYNAMIC_FTRACE=y CONFIG_FUNCTION_PROFILER=y CONFIG_DEBUG_FS=y Once you have an RT kernel with ftace enable and that kernel is booted, you must mount the DebugFS filesystem. This is done by convention at /sys/kernel/debug, like so: # mount -t debugfs debugfs /sys/kernel/debug Nothing prevents you from mounting it somwhere with a shorter path, such as /debug. Be aware that not all tools are smart enough to discover where the DebugFS has been mounted if it's not at it's traditional place (cyclictest is though). Once the DebugFS has been mounted, you should see a 'tracing' subdirectory in /sys/kernel/debug. This directory is where all the interface files to frace reside. # ls /sys/kernel/debug/tracing available_events ksym_profile sysprof_sample_period available_filter_functions ksym_trace_filter trace available_tracers options trace_clock buffer_size_kb per_cpu trace_marker current_tracer printk_formats trace_options dyn_ftrace_total_info README trace_pipe events saved_cmdlines trace_stat failures set_event tracing_cpumask function_profile_enabled set_ftrace_filter tracing_enabled kprobe_events set_ftrace_notrace tracing_on kprobe_profile set_ftrace_pid Now you're ready to run cyclictest with some tracing options. The easiest way to do that is like this: # cyclictest --smp -p95 -f -b 200 That tells cyclictest to run with one measurement thread per core, each at realtime priority (SCHED_FIFO) 95 and to run the function tracer, with a 'break threshold' of 200 microseconds. What this means is that cyclictest will setup for tracing by writing values into the tracing control files: tracing/current_tracer <- "function" tracing/tracing_thres <- "200" tracing/tracing_enabled <- "1" It will then start its normal measurement run, displaying the latency values as shown above, but when a latency value exceeds the break threshold, cyclictest will stop the trace (by writing a "0" into tracing/tracing_enabled) and then stop measuring, reporting the id of the thread that first exceeded the threshold. # cyclictest --smp -p95 -f -b 200 could not set latency_hist/wakeup_latency/reset to 1 policy: fifo: loadavg: 0.16 0.23 0.18 3/336 4961 T: 0 ( 4960) P:95 I:1000 C: 0 Min:1000000 Act: 0 Avg: 0 Max:-1000000 T: 1 ( 4961) P:95 I:1500 C: 0 Min:1000000 Act: 0 Avg: 0 Max:-1000000 # Thread Ids: 04960 04961 # Break thread: 4960 The above run shows that the specified threshold (200us) was hit by thread 4960 even before the display code could update the screen. The next step is to retrieve the trace data, which is done by reading the contents of tracing/trace: # cat /sys/kernel/debug/tracing/trace >tracedata.txt This output could be many megabytes in size, so make sure you send the output to a filesystem with plenty of space.