Tuning with OProfile on Fedora 12 William Cohen wcohen@redhat.com 3/16/2010 INTRODUCTION OProfile is a low overhead, system-wide profiler for Linux that uses performance monitoring hardware on the processor to find the locations and causes of performance bottlenecks. OProfile is available for Fedora 12. This document describes how to determine whether OProfile is installed on the system and how to use OProfile to search for common performance problems in code. The software packages used for this tutorial were oprofile-0.9.5-4.fc12.x86_64 and kernel-2.6.32.9-70.fc12.x86_64. OPROFILE INSTALLATION OProfile is closely tied to the Linux kernel and the processor architecture. Currently, OProfile supports the processors listed in Table 1. The TIMER_INT is a fallback mechanism for processor without supported performance monitoring hardware. The TIMER_INT mechanism may also be used in environments where the performance counters are not available, such as virtual environments such as Qemu and KVM. First, check to see that appropriate Linux kernel and OProfile packages are installed: #checking oprofile installed rpm -q oprofile #checking kernel rpm -q kernel For this tutorial oprofile-0.9.5-4.fc12.x86_64 and kernel-2.6.32.9-70.fc12.x86_64 were used. If the OProfile is not on the machine, it can be installed using the following command: yum install oprofile If you want to get information about the kernel, you will need to install the kernel-debuginfo package with the following command (you may need to change "kernel-debuginfo" to "kernel-PAE-debuginfo" or what ever matchs your currently installed kernel) : debuginfo-install kernel At this point you should be able to load the OProfile module and set up the oprofile filesystem by running the following command as root: opcontrol --init You should be able to verify that the oprofile module is loaded with "/sbin/lsmod". The oprofile file system should be available as "/dev/oprofile". The contents of "/dev/oprofile" should look similar to the following: ls /dev/oprofile 0 12 16 2 23 27 30 6 backtrace_depth cpu_buffer_size pointer_size 1 13 17 20 24 28 31 7 buffer cpu_type stats 10 14 18 21 25 29 4 8 buffer_size dump time_slice 11 15 19 22 26 3 5 9 buffer_watershed enable The file "/dev/oprofile/cpu_type" contains a string to indicate the processor type oprofile is going to use. You can "cat /dev/oprofile/cpu_type" to get the string. Table 1 lists the possible strings and the number of events that can be monitored concurrently and the number of numerical counters supported by each processor. Table 1. OProfile processors cpu_type #counters Intel Arch. Perfmon i386/arch_perfmon 2 Intel Atom i386/atom 2 Intel Core i386/core 2 Intel Core 2 i386/core_2 2 Intel Core i7 i386/core_i7 4 Intel Pentium 4 (non-HT)i386/p4 8 Intel Pentium 4 (HT) i386/p4-ht 4 Intel Pentium M i386/p6_mobile 2 AMD Athlon i386/athlon 4 AMD Family10 x86-64/family10 4 AMD Family 11h x86-64/family11h 4 AMD AMD64 x86-64/hammer 4 IBM PowerPC 970 ppc64/970 8 IBM PowerPC 970MP ppc64/970MP 8 Cell Broadband Engine ppc64/cell-be 4 IBM Power 4 ppc64/power4 8 IBM Power 5 ppc64/power5 6 IBM Power 5+ ppc64/power5+ 6 IBM Power 5++ ppc64/power5++ 4 IBM Power 6 ppc64/power6 6 IBM Power 7 ppc64/power7 6 IBM power compat ver.1 ppc64/ibm-compat-v1 4 TIMER_INT timer 1 (not user settable) IBM iseries timer 1 (not user settable) IBM pseries timer 1 (not user settable) IBM s390 timer 1 (not user settable) IBM s390x timer 1 (not user settable) After the verifying that the OProfile module is installed and the OProfile file system is mounted the next step is to configure OProfile to collect data. OPROFILE CONFIGURATION The configuration of oprofile for monitoring user-space code consists of indicating what events to monitor. For TIMER_INT (/dev/cpu_type "timer") there is no choice on the events to monitor or the rate. The following will set up the profiling for the TIMER_INT of userspace code: opcontrol --setup --no-vmlinux The Intel, AMD, and IBM processors which have support for the performance monitoring hardware can be set up in the same way as the TIMER_INT above. In the default setup a time-based event set is selected to produce about 100,000 time-based events per second. Once the OProfile has been initialized with the "opcontrol --init" command, the default event can be displayed with the following command: ophelp --get-default-event Events and sample rates can be set explicitly. The following is an example for the AMD64 to get time-based measurements for both kernel-space and user-space: opcontrol --setup --no-vmlinux --event=CPU_CLK_UNHALTED:1000000:0x0:1:1 This command selects the AMD64 time-based sampling with a sample generated every 1,000,000 clock cycles. Smaller values for the count produce more frequent sampling; larger values produce less frequent sampling. For rare events a lower count may be needed to get a sufficient number of samples. For the Pentium 4 time-based event, GLOBAL_POWER_EVENTS, the unit mask will (the 0x0 in the above AMD64 example) will need to be set to "0x1". Thus, for a Pentium 4 based machine the command would be: opcontrol --setup --no-vmlinux --event=GLOBAL_POWER_EVENTS:1000000:0x1:1:1 Table 2 lists processors and time-based metrics for other cpu types. The "ophelp" command lists the events that are available on the machine. The "--setup" option causes OProfile to store the configuration information in /home/root/.oprofile/daemonrc, but it does not start the daemon. Table 2. Time-based events event unit-mask Intel Core, Core 2, Core i7 CPU_CLK_UNHALTED Pentium 4 (HT and non-HT) GLOBAL_POWER_EVENTS 1 AMD processors CPU_CLK_UNHALTED PowerPC 970/Power 4 CYCLES Power 5/Power 5+/Power 6 CYCLES TIMER_INT (nothing to set) In some cases it is useful to associate the samples for shared libraries with the executable that called the shared library. Adding the "--separate=library" option to the opcontrol setup line will cause the OProfile daemon to prepend the application executable to the name and allow the analysis tools to determine which application executable those samples are associated with. Often users are interested in getting detailed profiles of the kernel. The setup will need need to include a pointer to an uncompressed and unstripped version of the kernel running on the machine. Fedora packages the vmlinux file with that information in a separate debuginfo RPM, kernel-debuginfo-2.6.32.9-70.fc12.x86_64.rpm. The debuginfo RPMs are very specific. If the running kernel came from kernel-2.6.32.9-70.fc12.x86_64.rpm, then the kernel-debuginfo-2.6.32.9-70.fc12.x86_64.rpm from that same build will be needed; kernel-debuginfo rpms from other architectures or other builds will not work. Note that the installed kernel debuginfo files can require a signicant amount of space. Assuming the matching debuginfo RPM for the kernel is installed, the AMD64 can be configured with: opcontrol --setup \ --vmlinux=/usr/lib/debug/lib/modules/`uname -r`/vmlinux \ --event=CPU_CLK_UNHALTED:1000000:0x0:1:1 Once OProfile is configured, the dameon is started with: opcontrol --start The end of the output from the "opcontrol --start" states the following: Using 2.6+ OProfile kernel interface. Reading module info. Using log file /var/lib/oprofile/oprofiled.log Daemon started. Profiler running. Check that oprofile daemon is running: ps -aux|grep oprofiled The daemon can be shut down with: opcontrol --shutdown One can also use the GUI program oprof_start from the oprofile-gui RPM to control and to configure oprofile. COLLECTING AND ANALYZING DATA Once the daemon is running, OProfile will collect data on all executables run on the machine. You can run the code that you want to measure. OProfile writes the profiling data to /var/lib/oprofile/samples at its convenience. To ensure that all the current profiling data is flushed to the sample files before each analysis of profiling data, you should type in: opcontrol --dump You can check for sample files in /var/lib/oprofile/samples/current. Each executable that has samples should be in that directory. Using "find" you should be able to see files that have names of the executables that ran. For this OProfile tutorial a Fortran Linpack benchmark was downloaded from http://netlib.org/benchmark/1000d. The program was saved as linpack1000d.f. An "external ran" had to be added to the subroutine matgen to compile with gfortran. The program was compiled with the following command: gfortran -g -O3 linpack1000d.f -o linpack1000d To start with a clean slate we can, shutdown oprofile, save the previously collected samples into /var/lib/oprofile/samples/junk, and start oprofile again: opcontrol --shutdown opcontrol --save=junk opcontrol --start OProfile automatically collects data on any program that runs while oprofile is operating. Thus, the application is run normally: ./linpack1000d VIEWING OVERALL SYSTEM PROFILE Once the profiling data is dumped to the sample files with "opcontrol --dump", you can analyze the data and see which executables are being run and a breakdown of the number of samples with "opreport" for things that have 1% or more of samples. opreport --long-filenames --threshold 1 Produces: CPU: AMD64 processors, speed 2000 MHz (estimated) Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 1000000 CPU_CLK_UNHALT...| samples| %| ------------------ 5049 92.8125 /root/linpack1000d 293 5.3860 /usr/lib/debug/lib/modules/2.6.32.9-70.fc12.x86_64/vmlinux We see that there are 5049 samples for /home/wcohen/linpack1000d, the application executable. The second column shows the relative amount of time the system spent in the various executables. EXAMINING A SINGLE EXECUTABLE'S PROFILE In most cases a more detailed analysis of an application executable is desired. This requires debugging information to map the address in the program back to the source code. Thus, GCC's "-g" option should be used when producing executables. Note for GCC "-g" does not change the executable code, but just adds the mapping information needed by the debugger to map addresses back to line numbers. Most RPMs are distributed without debugging information to save space, so you will not be able to get detailed profiling information from just the RPM. As a part of the build process debuginfo RPMs are produced for each RPM. OProfile supplied by Red Hat can read the information installed by the debuginfo RPMs to allow better characterization of the executables distributed in RPMs. The all the debuginfo RPMs related to the core-utils (RPM that contains ls, cp, dd, and man other commands) be installed with the following command from the yum-utils RPM: debuginfo-install -y coreutils For the example linpack100d program we can find where it spends its time: opreport -l image:/root/linpack1000d --threshold .1 CPU: AMD64 processors, speed 2000 MHz (estimated) Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 1000000 samples % symbol name 4901 97.0687 daxpy_ 92 1.8221 ran_ 27 0.5348 matgen_ 11 0.2179 idamax_ 9 0.1783 dmxpy_ 6 0.1188 dgefa_ We can see that most of the time is spent in the "daxpy_" subroutine. To get the detailed view of the addresses in the particular function in the executable: opreport -l image:/root/linpack1000d --include-symbols=daxpy_ --details| more CPU: AMD64 processors, speed 2000 MHz (estimated) Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 1000000 vma samples % symbol name 00400880 4901 100.000 daxpy_ 00400880 1 0.0204 00400881 7 0.1428 00400893 1 0.0204 00400895 1 0.0204 00400928 1 0.0204 00400934 1 0.0204 00400938 30 0.6121 ... 00400a0a 1 0.0204 00400a10 154 3.1422 00400a15 906 18.4860 00400a1c 773 15.7723 00400a22 3 0.0612 00400a2c 326 6.6517 00400a35 802 16.3640 00400a3a 3 0.0612 00400a3f 400 8.1616 00400a4f 199 4.0604 ... Thes samples can be related back to the original source code with the op_to_source command: opannotate --source /root/linpack1000d -o linpack_annote For each file used to build the executable, an equivalent file is annotated with the sample counts for each line. The created file also has information at the end of the file describing the processor and the events that the counters measured. Below is the daxpy_ subroutine from the created file. We can see that most of the samples are for the inner loop at the bottom of the function. ----daxpy---- 3 0.0594 : subroutine daxpy(n,da,dx,incx,dy,incy) /* daxpy_ total: 4901 97.0687 */ :c :c constant times a vector plus a vector. :c uses unrolled loops for increments equal to one. :c jack dongarra, linpack, 3/11/78. :c : double precision dx(1),dy(1),da : integer i,incx,incy,ix,iy,m,mp1,n :c 7 0.1386 : if(n.le.0)return 2 0.0396 : if (da .eq. 0.0d0) return : if(incx.eq.1.and.incy.eq.1)go to 20 :c :c code for unequal increments or equal increments :c not equal to 1 :c : ix = 1 : iy = 1 : if(incx.lt.0)ix = (-n+1)*incx + 1 : if(incy.lt.0)iy = (-n+1)*incy + 1 : do 10 i = 1,n : dy(iy) = dy(iy) + da*dx(ix) : ix = ix + incx : iy = iy + incy : 10 continue : return :c :c code for both increments equal to 1 :c :c :c clean-up loop :c : 20 m = mod(n,4) 70 1.3864 : if( m .eq. 0 ) go to 40 : do 30 i = 1,m 59 1.1685 : dy(i) = dy(i) + da*dx(i) : 30 continue : if( n .lt. 4 ) return : 40 mp1 = m + 1 : do 50 i = mp1,n,4 930 18.4195 : dy(i) = dy(i) + da*dx(i) 1128 22.3411 : dy(i + 1) = dy(i + 1) + da*dx(i + 1) 403 7.9818 : dy(i + 2) = dy(i + 2) + da*dx(i + 2) 2237 44.3058 : dy(i + 3) = dy(i + 3) + da*dx(i + 3) : 50 continue : return 62 1.2280 : end ----daxpy---- SAVING PROFILING DATA The "--save=xxx" option for opcontrol allow you to save the data for future reference. This allows you to keep the data from multiple experiments separate. However, you will need to avoid modifying the executables. OProfile cannot perform the analysis if the executable is modified. Oprofile needs both the saved sample files and the unchanged executable. For the previous linpack example the profiling data could be saved with the following command: opcontrol --save=linpack The addition of the "session:linpack" specifies the session saved by the opcontrol above. Thus, the following commands used earlier to analyze linpack would now be the following: opreport session:linpack --long-filenames| more opreport session:linpack -l image:/root/linpack1000d | more opreport -l session:linpack image:/root/linpack1000d --include-symbols=daxpy_ --details| more opannotate session:linpack --source /root/linpack1000d OPTIMIZING CODE The general procedure for optimization is to find where the hotspots in the code are and revise the code to avoid the performance problems in those areas. Optimizing rarely executed code does little to improve the overall performance. The default event OProfile uses shows the hotspots in the code. You can use the "ophelp --get-default-event" command or refer to table 2 for the specific event name for identifying code hotspots using time-based sampling. For many processors the CPU_CLK_UNHALTED is the event to get time-based measurements. The TIMER_INT provides time-based sampling using a periodic interrupt on systems lacking performance monitoring hardware support. The hotspots can be checked for common performance problems. Both Intel and AMD have produced documents that describe code optimization techniques for their processors in much greater detail. Those documents are listed in the references. A short list (and incomplete list) of things to look for are: Memory References and Data Cache Misses Misaligned Memory Accesses Branch misprediction Instruction Cache misses MEMORY REFERENCES AND DATA CACHE MISSES Processor speed has increased at a much greater rate than memory access speed. As a result, each memory cache miss is relatively much more expensive on current processors than on older processors. A cache miss that requires access to main memory on a 3GHz Pentium 4 costs on the order of a hundred clock cycles. To get good processor performance the cache miss rate needs to be low. Tables 3 and 4 shows the processor events related to memory references and cache misses respectively. Table 3. Memory reference events event unit-mask Intel Atom and Arch. Perfmon LLC_REFS Intel Core DATA_MEM_REFS Intel Core 2 L1D_ALL_CACHE_REF Intel Core i7 L1D_ALL_REF 0x01 Pentium 4 (HT and non-HT) -- AMD processors DATA_CACHE_ACCESSES TIMER_INT (nothing available) Table 4. Data Cache Miss Events event unit-mask Intel Atom and Arch. Perfmon LLC_MISSES Intel Core L2_LINES_IN Intel Core 2 L2_LINES_IN 0x40 Intel Core i7 LLC_MISSES Pentium 4 (HT and non-HT) BSQ_CACHE_REFERENCE 0x10f AMD processors DATA_CACHE_MISSES TIMER_INT (nothing available) Similiarly, one can look for sections in the code where there is significant bus traffic. The goal is to reduce the amount of bus traffic in these problem code sections. Table 5 lists events for sampling bus activity. Table 5. System Bus Traffic event unit-mask Intel Atom BUS_DRDY_CLOCKS Intel Core BUS_DRDY_CLOCKS Intel Core 2 BUS_DRDY_CLOCKS Intel Core i7 Pentium 4 (HT and non-HT) FSB_DATA_ACTIVITY 11 AMD processors DATA_CACHE_REFILLS_FROM_SYSTEM TIMER_INT (nothing available) UNALIGNED/PARTIAL MEMORY ACCESSES Some architectures allows memory access to be unaligned. This can lead to data that spans two cache lines and is more expensive to access. If possible, organize data structures with the elements with the strictest alignment constraints at the beginning of the data structure. The Pentium 4 has fowarding logic to speed up load/store and store/load sequences. However, for situations where a series of small writes are grouped together to form larger value, e.g. writing two 32-bit words to form a 64-bit double, the logic will fail and the stores will have to complete before the load can be started. In a similar manner a small load extracting data out of a larger store can stall waiting for the the larger store to complete. Thus, using C unions to access parts of larger items can have performance penaties. Table 6. Unaligned memory references event unit-mask Intel Atom --- Intel Core MISALIGN_MEM_REF Intel Core 2 MISALIGN_MEM_REF Intel Core i7 MISALIGN_MEM_REF 0x3 Pentium 4 (HT and non-HT) --- AMD processors MISALIGNED_DATA_REFS TIMER_INT (nothing available) BRANCH MISPREDICTION Much of the speed of current processors is due to the deep pipelines running at high clock rates. However, when a conditional branch is encountered the processor may be required to wait until a value is available and the next instruction to fetch and process can be determined. Most processors now have branch prediction circuitry allowing the processor to start executing instructions on the expected path before the outcome of the branch instruction is known. If the prediction is correct, a significant improvement in performance is obtained because the pipeline sits idle for less time. However, if the branch prediction is incorrect, the processor must nullify the work done for the incorrect branch. On a modern processor a branch misprediction costs 10 to 20 clock cycles. On processors with deeper pipelines the cost is higher. Table 7 shows the performance monitoring events for sampling branches and Table 8 shows the performance monitoring events for sampling branch mispredictions. Branches do not occur every clock cycle, it is recommend to set the branches count interval to be about 1/5 of what is used time-based sampling, e.g. (processor clock rate)/10,000. If code has good branch prediction behavior less than 10% of the branches are misspredicted, set the counter interval for branch misprediction to same rate used for branches. With the exception of the Hyperthreaded Pentium 4, OProfile can monitor at both branches and branch mispredicts concurrently. Configure OProfile to sample on both branches and branch mispredictions. Then analyze the sample files with opreport and opannotate to look for places where there are a large number of branch misprediction samples relative to the number of branch samples. Table 7. Branches event unit-mask Intel Atom BR_INST_RETIRED Intel Core BR_INST_RETIRED Intel Core 2 BR_INST_RETIRED Intel Core i7 BR_INST_RETIRED Pentium 4 (HT and non-HT) RETIRED_BRANCH_TYPE 15 AMD processors RETIRED_BRANCHES TIMER_INT (nothing available) Table 8. Branch mispredictions event unit-mask Intel Atom BR_MISS_PRED_RETIRED Intel Core BR_MISS_PRED_RETIRED Intel Core 2 BR_MISS_PRED_RETIRED Intel Core i7 BR_MISS_PRED_RETIRED Pentium 4 (HT and non-HT) RETIRED_MISPRED_BRANCH_TYPE 15 AMD processors RETIRED_BRANCHES_MISPREDICTED TIMER_INT (nothing available) When compiling GCC attempts to order the code to help the static prediction of branches; forward branches not taken (e.g. error handling code) and backward branches taken (e.g. loops). However, GCC's static estimates of branches' relative frequency may be incorrect. GCC can use information collected with the "-fprofile-arcs" option. The "-fprofile-arcs" option instruments the generated code to collect data on how often various branches are taken through the code. The instrumented program is run with typical input. when the program completes execution it writes the data out to ".da" files in the build directory. The program is compiled a second time with the same options except replacing the "-fprofile-arcs" with "-fbranchprobabilities". GCC will use the ".da" files to reorder the basic blocks in functions to help the static prediction. Infrequently used blocks of code will be moved to the end function improve cache performance. GCC can also eliminate branches by inlining functions. This occurs when the "-O3" or "-finline-functions" options are used to compile code. Inlining functions can also reduce the overhead of function prologue and epilogue. INSTRUCTION CACHE MISSES The instruction cache miss rate can also influence performance. Excessive cache miss rates could be found by comparing "Instruction Fetch" (event in Table 9) to "Instruction Fetch Miss" (event in Table 10). For microbenchmarks this is unlikely to be a problem. However, for large programs this could be a concern. The Pentium 4 does not have a tradition instruction cache; it has a trace cache that holds decoded micro ops. There is a event that can be used to determine the time that the trace cache is in build mode (decoding a single instructions at a time and placing the micro ops in the trace cache) and the time the trace cache is in deliver mode (providing up to three micro ops per cycle to the execution unit). A unit mask must be specified for the TC_DELIVER_MODE to indicate what to measure for the Pentium 4. Table 9. Instruction Fetch event unit-mask Intel Atom INST_RETIRED Intel Core INST_RETIRED Intel Core 2 INST_RETIRED_ANY_P Intel Core i7 INST_RETIRED 0 Pentium 4 (non-HT) TC_DELIVER_MODE 7 AMD processors ICACHE_FETCHES TIMER_INT (nothing available) Table 10. Instruction Cache Miss event unit-mask Intel Atom ICACHE 0x02 Intel Core ICACHE_MISSES Intel Core 2 L1I_MISSES Intel Core i7 L1I unit-mask 0x02 Pentium 4 (non-HT) TC_DELIVER_MODE 0x38 AMD processors ICACHE_MISSES TIMER_INT (nothing available) CAUTIONS ABOUT OPROFILE OProfile is a useful tool, but there are some limitations that one should be aware of when using OProfile: -use of shared libraries: samples for code in shared library are not attributed to the paricular application unless the "--separate=library" option is used. -performance monitoring samples are inexact: When a performance monitoring register triggers a sample interrupt, the interrupt is not precise like a divide by zero exception. Due to the out-of-order execution of instructions by the processor the sample is recorded on a nearby instruction. -opreport misattributes inlined functions' samples: opreport just uses a simple address range mechanism to determine which function an address is in. Inline functions samples are not attributed to the inlined function but rather to the function the inline function was inserted into. -OProfile accumulates data from multiple runs: OProfile is a system-wide profile and expects processes to start up and shut down down multiple times. Thus, samples from multiple runs will accumulate. On can use the command "opcontrol --reset" to clear out the samples. -non-CPU-limited performance problems: OProfile is oriented to finding problems with CPU-limited processes. OProfile does not identify processes that are asleep because they are waiting on locks or for some other event to occur, e.g. a I/O device to finish an operation. REFERENCES AMD Athlon Processor X86 Code Optimization Guide, February 2002, AMD Publication 22007, Revision K, http://support.amd.com/us/Processor_TechDocs/22007.pdf Software Optimization Guide for AMD64 Processors, October 2005, AMD Publication 25112, Revision 3.06, http://support.amd.com/us/Processor_TechDocs/25112.PDF Software Optimization Guide for AMD Family 10h Processors, May 2009, AMD Publication 40546, Revision 3.11, http://support.amd.com/us/Processor_TechDocs/40546-PUB-Optguide_3-11_5-21-09.pdf GCC homepage, March 2010. http://gcc.gnu.org/ IntelĀ® 64 and IA-32 Architectures Optimization Reference Manual, November 2009, Intel Order Number: 248966-020, http://www.intel.com/Assets/PDF/manual/248966.pdf OProfile, March 2010. http://oprofile.sourceforge.net/