Tuning with OProfile

William Cohen
wcohen@redhat.com
7/10/2003

INTRODUCTION

OProfile is a low overhead, system-wide profiler for Linux that uses
performance monitoring hardware on the processor to find the
locations and causes of performance bottlenecks.  OProfile is available
for the Taroon Beta of Red Hat Enterprise Linux.  This document
describes how to determine whether OProfile is installed on the system
and how to use OProfile to search for common performance problems in
code.  The software packages used for this tutorial were
oprofile-0.5.4-3 and kernel-smp-2.4.21-1.1931.2.317.ent.


OPROFILE INSTALLATION

OProfile is closely tied to the Linux kernel and the processor
architecture. Currently OProfile on Red Hat Taroon Beta supports the
processors listed in Table 1.  The TIMER_INT is a fallback mechanism
for processor without supported performance monitoring hardware. Use
of the TIMER_INT mechanism may also be seen on machines that have
problems with the interrupt handling hardware, e.g. laptop computers.

For the Red Hat supplied kernels only the Symmetric Multi-Processor
(SMP) kernels have OProfile support enabled. The uniprocessor kernels
do not have OProfile support enabled because some laptops have
problems with OProfile. However, if the SMP kernels runs on the
laptop, OProfile can be used.

First, check to see that appropriate Linux kernel and OProfile packages
are installed:

	 #checking oprofile installed
	 rpm -qa |grep oprofile

	 #checking kernel
	 rpm -qa |grep kernel

For this tutorial oprofile-0.5.4-3 and
kernel-smp-2.4.21-1.1931.2.317.ent were used.  To use OProfile the SMP
kernel must be running. Verify an smp kernel is running with:

	 #check which kernel is running
	 uname -a

At this point you should be able to load the OProfile module and set
up the oprofile filesystem by running the following command as root:

	 opcontrol --init

You should be able to verify that the oprofile module is loaded with
"/sbin/lsmod". The oprofile file system should be available as
"/dev/oprofile". The contents of "/dev/oprofile" should look similar
to the following:

ls /dev/oprofile
0  2  4  6  buffer       buffer_watershed  cpu_type  enable       stats
1  3  5  7  buffer_size  cpu_buffer_size   dump      kernel_only

The file "/dev/oprofile/cpu_type" contains a string to indicate the
processor type oprofile is going to use. You can "cat
/dev/oprofile/cpu_type" to get the string. Table 1 lists the possible
strings and the number of events that can be monitored concurrently
and the number of numerical directories you should see in
/dev/oprofile.


	 Table 1. OProfile processors
			       cpu_type	       #counters
	Pentium Pro	       	i386/ppro	2
	Pentium II		i386/pii	2
	Pentium III		i386/piii	2
	Pentium 4 (non-HT)	i386/p4		8
	Pentium 4 (HT)		i386/p4-ht	4
	Athlon			i386/athlon	4
	Hammer			x86-64/hammer	4
	Itanium			ia64/itanium	4
	Itanium 2		ia64/itanium2	4
	TIMER_INT		timer		1 (not user settable)
	IBM iseries		timer		1 (not user settable)
	IBM pseries		timer		1 (not user settable)
	IBM s390		timer		1 (not user settable)
	IBM s390x		timer		1 (not user settable)


After the verifying that the OProfile module is installed and the
OProfile file system is mounted the next step is to configure OProfile
to collect data.
	 

OPROFILE CONFIGURATION

The configuration of oprofile consists of indicating the kernel that
is running and what events to monitor. For TIMER_INT (/dev/cpu_type
"timer") there is no choice on the events to monitor or the rate. The
following will set up the profiling for the TIMER_INT:

      opcontrol --setup --vmlinux=/boot/vmlinux-`uname -r` 

The Pentium, Athlon, Hammer, and ia64 processors, which have support
for the performance monitoring hardware, can be set up in the same way
as the TIMER_INT above.  In the default setup a time-based event set
is selected to produce about 2,000 sample per second per
processor. Events and sample rates can be set explicitly.  The
following is an example for the Intel Pentium 4 to get time-based
measurements:

      opcontrol --setup --vmlinux=/boot/vmlinux-`uname -r` --ctr0-event=GLOBAL_POWER_EVENTS --ctr0-count=240000 --ctr0-unit-mask=1

This command selects the currently running kernel, and the
Pentium 4 time-based sampling with a sample generated every 240,000
clock cycles.  Smaller values for --ctr0-count produce more
frequent sampling; larger values produce less frequent sampling.  For
rarer events a lower count may be needed to get a sufficient number of
samples.  The "--ctr0-unit-mask=1" option tells oprofile to count clock
cycles only when the processor is unhalted.  Table 2 lists processors and
time-based metrics for other cpu_types.  The "op_help" command lists
the events that are available on machine.  The "--setup" option causes
oprofile to store the configuration information in
/home/root/.oprofile/daemonrc, but it does not start the daemon.


Table 2. Time-based events
					-ctr0-event=	    --ctr0-unit-mask=
	 Pentium Pro/PII/PIII		CPU_CLK_UNHALTED	
	 Pentium 4 (HT and non-HT)	GLOBAL_POWER_EVENTS	1
	 Athlon/Hammer			CPU_CLK_UNHALTED
	 Itanium 2			CPU_CYCLES
	 TIMER_INT			(nothing to set)

In some cases it is useful to associate the samples for shared
libraries with the executable that called the shared library. Adding
the "--separate=library" option to the opcontrol setup line will cause
the oprofile daemon to prepend the application executable to the name
and allow the analysis tools to determine which application executable
those samples are associated with.

Once oprofile is configured, the dameon is started with:

	opcontrol --start

The end of the output from the "opcontrol --start" states the following:

	Using log file /var/lib/oprofile/oprofiled.log
	Profiler running.

Check that oprofile daemon is running:

	ps -aux|grep oprofiled

The daemon can be shut down with:

	opcontrol --shutdown

One can also use the GUI program oprof_start to control and to
configure oprofile.


COLLECTING AND ANALYZING DATA

Once the daemon is running, OProfile will collect data on all
executables run on the machine.  You can run the code that you want to
measure.  OProfile writes the profiling data to
/var/lib/oprofile/samples at its convenience.  To ensure that all the
current profiling data is flushed to the sample files before each
analysis of profiling data, you should type in (as root):

	opcontrol --dump

You can check for sample files in /var/lib/oprofile/samples. Each
executable that has samples should be in that directory. Each sample
filename is coded as the path to the executable with "}" replacing "/"
with a "#" followed by the counter number used for that sample file. 

For this OProfile tutorial a Fortran Linpack benchmark was downloaded
from http://netlib.org/benchmark/1000d. The program was saved as
linpack1000d.f.  The program was compiled with the following command:

	g77 -g -O3 linpack1000d.f -o linpack1000d

To start with a clean slate we can run to save the previously
collected samples into /var/lib/oprofile/samples/junk:

	opcontrol --save=junk

OProfile automatically collects data on any program that runs while
oprofile is operating. Thus, the application is run normally:

	./linpack1000d


VIEWING OVERALL SYSTEM PROFILE

Once the profiling data is dumped to the sample files, you can analyze
the data and see which executables are being run and a breakdown of
the number of samples with "op_time".

op_time -r --counter 0 | more

Produces:

7310      76.3845  0.0000 /root/linpack1000d
882        9.2163  0.0000 /boot/vmlinux-2.4.21-1.1931.2.317.entsmp
691        7.2205  0.0000 /usr/lib/gcc-lib/i386-redhat-linux/3.2.3/f771
327        3.4169  0.0000 /bin/bash
85         0.8882  0.0000 /usr/bin/gdmgreeter
62         0.6479  0.0000 /usr/bin/ld
24         0.2508  0.0000 /usr/bin/as
...

We see that there are 7310 samples for /root/linpack1000d, the
application executable.  Because GLOBAL_POWER_EVENT is being measured
by counter 0, shows the relative amount of time the system spent in
the various executables. On the Pentium 4 and other processors there
are restrictions on which counters can measure specific events. Thus,
you may need to use a counter other than counter 0 to collect and
analyze data.


EXAMINING A SINGLE EXECUTABLE'S PROFILE

In most cases a more detailed analysis of an application executable is
desired. This requires debugging information to map the address in the
program back to the source code.  Thus, GCC's "-g" option should be
used when producing executables.  Note for GCC "-g" does not change
the executable code, but just adds the mapping information needed by
the debugger to map addresses back to line numbers.  Most RPMs are
distributed without debugging information to save space, so you will
not be able to get detailed profiling information from just the RPM.
Recently, the Red Hat build system started building debuginfo RPMs for
each normal RPM.  OProfile supplied by Red Hat can read the
information installed by the debuginfo RPMs to allow better
characterization of the executables distributed in RPMs.

For the example linpack100d program we can find where it spends its time:

oprofpp --counter 0 -l -r /root/linpack1000d | more
Cpu type: P4 / Xeon
Cpu speed was (MHz estimation) : 2386.61
Counter 0 counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (count cycles when processor is active) count 1193500
vma      samples  %           symbol name
08048fa0 6591     90.1888     daxpy_
08048b68 470      6.43131     dgefa_
08049b64 152      2.07991     ran_
08049ce0 63       0.862069    matgen_
08049ddc 16       0.218938    idamax_
08049330 15       0.205255    dmxpy_
08049254 1        0.0136836   dscal_
...

We can see that most of the time is spent in the "daxpy_" subroutine.
To get the detailed view of the addresses in the particular function in
the executable:

oprofpp --counter 0 -s daxpy_  /root/linpack1000d | more
Cpu type: P4 / Xeon
Cpu speed was (MHz estimation) : 2386.61
Counter 0 counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (count cycles when processor is active) count 1193500
vma      samples  %           symbol name
08048fa0 6591     100         daxpy_
 08048fa0 1        0.0151722
 08048fa4 1        0.0151722
 08048fb5 2        0.0303444
 08048fc0 1        0.0151722
 08048fc2 2        0.0303444
 08048fc4 2        0.0303444
 08048fca 1        0.0151722
 08048fd3 1        0.0151722
 08048fd5 1        0.0151722
 08048fd8 2        0.0303444
 0804902e 1        0.0151722
...
 080490b9 68       1.03171
 080490bb 71       1.07723
 080490be 25       0.379305
 080490c1 1371     20.8011
 080490c4 132      2.00273
 080490c8 72       1.0924
 080490ca 1460     22.1514
 080490cc 143      2.16963
 080490d1 92       1.39584
 080490d4 1408     21.3625
 080490d7 74       1.12274
 080490d9 63       0.955849
 080490dc 15       0.227583
 080490df 1356     20.5735
 080490e2 133      2.0179
 080490e8 64       0.971021
 080490ed 6        0.0910332

Thes samples can be related back to the original source code with the
op_to_source command:

	op_to_source --source-dir /root /root/linpack1000d

For each file used to build the executable, an equivalent file is
annotated with the sample counts for each line. The created file also
has information at the end of the file describing the processor and
the events that the counters measured. Below is the daxpy_ subroutine
from the created file. We can see that most of the samples are for the
inner loop at the bottom of the function.

----daxpy----
      2 0.027% :      subroutine daxpy(n,da,dx,incx,dy,incy) /* daxpy_ total:    6591 90.18% */
               :c
               :c     constant times a vector plus a vector.
               :c     uses unrolled loops for increments equal to one.
               :c     jack dongarra, linpack, 3/11/78.
               :c
               :      double precision dx(1),dy(1),da
               :      integer i,incx,incy,ix,iy,m,mp1,n
               :c
               :      if(n.le.0)return
      8 0.109% :      if (da .eq. 0.0d0) return
      4 0.054% :      if(incx.eq.1.and.incy.eq.1)go to 20
               :c
               :c        code for unequal increments or equal increments
               :c          not equal to 1
               :c
               :      ix = 1
               :      iy = 1
               :      if(incx.lt.0)ix = (-n+1)*incx + 1
               :      if(incy.lt.0)iy = (-n+1)*incy + 1
               :      do 10 i = 1,n
               :        dy(iy) = dy(iy) + da*dx(ix)
               :        ix = ix + incx
               :        iy = iy + incy
      3 0.041% :   10 continue
               :      return
               :c
               :c        code for both increments equal to 1
               :c
               :c
               :c        clean-up loop
               :c
               :   20 m = mod(n,4)
               :      if( m .eq. 0 ) go to 40
      2 0.027% :      do 30 i = 1,m
     13 0.177% :        dy(i) = dy(i) + da*dx(i)
      3 0.041% :   30 continue
               :      if( n .lt. 4 ) return
               :   40 mp1 = m + 1
      3 0.041% :      do 50 i = mp1,n,4
   1535 21.00% :        dy(i) = dy(i) + da*dx(i)
   1664 22.76% :        dy(i + 1) = dy(i + 1) + da*dx(i + 1)
   1643 22.48% :        dy(i + 2) = dy(i + 2) + da*dx(i + 2)
   1508 20.63% :        dy(i + 3) = dy(i + 3) + da*dx(i + 3)
    203 2.777% :   50 continue
               :      return
----daxpy----


OPTIMIZING CODE

The general procedure for optimization is to find out where the
hotspots in the code are and revise the code to avoid those
performance problems.  Optimizing rarely executed code does little to
improve the overall performance.  For the Intel Pentium Pro, Pentium
II, and Pentium III CPU_CLOCK_UNHALTED can be used to get time-based
measurements. For the Pentium 4 the GLOBAL_POWER_EVENTS provide
time-based measurements.  The TIMER_INT provides time-based sampling
using a periodic interrupt on processors without support performance
monitoring hardware.  One can refer to table 2 for the events to
monitor for time based sampling on various processors. The hotspots
can be check for common performance problems.  Both Intel and AMD have
produced documents that describe techniques on how to optimize code
for their processors in much greater detail. These documents are
listed in the references.  A short list (and incomplete list) of
things to look for:

   Memory References and Data Cache Misses
   Misaligned Memory Accesses
   Branch misprediction
   Instruction Cache misses


MEMORY REFERENCES AND DATA CACHE MISSES

Processor speed has increased at a much greater rate than memory
access speed.  As a result, each data cache miss is relatively much
more expensive on current processors than on older processors.  A
cache miss that requires access to main memory on a 3GHz Pentium 4
costs on the order of a hundred clock cycles.  To get good processor
performance the cache miss rate needs to be low. Tables 3 and 4 shows
the processor events related to memory references and cache misses
respectively.

Table 3. Memory reference events

					event=		
	 Pentium Pro/PII/PIII		DATA_MEM_REFS
	 Pentium 4 (HT and non-HT)	--
	 Athlon/Hammer			DATA_CACHE_ACCESSES
	 Itanium 2			DATA_REFERENCES_SET0
					DATA_REFERENCES_SET1
	 TIMER_INT			(nothing available)

Table 4. Data Cache Miss Events

					event=		
	 Pentium Pro/PII/PIII		DCU_MISS_OUTSTANDING
	 Pentium 4 (HT and non-HT)	BSQ_CACHE_REFERENCE unit-mask=0x10f
	 Athlon/Hammer			DATA_CACHE_MISSES
	 Itanium 2			L1D_READ_MISSES
	 TIMER_INT			(nothing available)

Similiarly, one can look for sections in the code where there is
significant bus traffic.  The goal is to reduce the amount of bus
traffic in these problem code sections.  Table 5 lists events for
sampling bus activity.

Table 5. System Bus Traffic

					event=		
	 Pentium Pro/PII/PIII		BUS_DRDY_CLOCKS unit-mask=0
	 Pentium 4 (HT and non-HT)	FSB_DATA_ACTIVITY unit-mask=11
	 Athlon/Hammer			DATA_CACHE_REFILLS_FROM_SYSTEM
	 Itanium 2			BUS_DATA_CYCLE
	 TIMER_INT			(nothing available)


UNALIGNED/PARTIAL MEMORY ACCESSES

Some architectures allows memory access to be unaligned. This can lead
to data that spans two cache lines and is more expensive to access. If
possible, organize data structures with the elements with the
strictest alignment constraints at the beginning of the data
structure.

The Pentium 4 has fowarding logic to speed up load/store and
store/load sequences.  However, for situations where a series of small
writes are grouped together to form larger value, e.g. writing two
32-bit words to form a 64-bit double, the logic will fail and the
stores will have to complete before the load can be started.  In a
similar manner a small load extracting data out of a larger store can
stall waiting for the the larger store to complete.  Thus, using C
unions to access parts of larger items can have performance penaties.


BRANCH MISPREDICTION

Much of the speed of current processors is due to the deep pipelines
running at high clock rates. However, when a conditional branch is
encountered the processor may be required to wait until a value is
available and the next instruction to fetch and process can be
determined.  Most processors now have branch prediction circuitry
allowing the processor to start executing instructions on the expected
path before the outcome of the branch instruction is known.  If the
prediction is correct, a significant improvement in performance is
obtained because the pipeline sits idle for less time. However, if the
branch prediction is incorrect, the processor must nullify the work
done for the incorrect branch.  On the Intel Pentium III a branch
misprediction costs about 10 clock cycles. On newer processors with
deeper pipelines the cost is even higher.

Table 6 shows the performance monitoring events for sampling branches
and Table 7 shows the performance monitoring events for sampling
branch mispredictions.  Branches do not occur every clock cycle, it is
recommend to set the branches count interval to be about 1/5 of what
is used time-based sampling, e.g. (processor clock rate)/10,000.  If
code has good branch prediction behavior less than 10% of the branches
are misspredicted, set the counter interval for branch misprediction
to same rate used for branches.

With the exception of the Hyperthreaded Pentium 4, OProfile can
monitor at both branches and branch mispredicts concurrently.
Configure OProfile to sample on both branches and branch
mispredictions.  Then analyze the sample files with oprofpp and
op_to_source to look for places where there are more a significant
number of branch misprediction sample when compared to the branch
samples.


Table 6. Branches
					event=		
	 Pentium Pro/PII/PIII		BR_INST_RETIRED
	 Pentium 4 (HT and non-HT)	RETIRED_BRANCH_TYPE unit-mask=15
	 Athlon/Hammer			RETIRED_BRANCHES
	 Itanium 2			BR_MISPRED_DETAIL unit-mask=0
	 TIMER_INT			(nothing available)


Table 7. Branch mispredictions
					event=		
	 Pentium Pro/PII/PIII		BR_MISS_PRED_RETIRED
	 Pentium 4 (HT and non-HT)	RETIRED_MISPRED_BRANCH_TYPE unit-mask=15
	 Athlon/Hammer			RETIRED_BRANCHES_MISPREDICTED
	 Itanium 2			BR_MISPRED_DETAIL unit-mask=2 (path)
	 Itanium 2			BR_MISPRED_DETAIL unit-mask=3 (target)
	 TIMER_INT			(nothing available)

When compiling GCC attempts to order the code to help the static
prediction of branches; forward branches not taken (e.g. error
handling code) and backward branches taken (e.g. loops). However, GCC's
static estimates of branches' relative frequency may be incorrect.
GCC can use information collected with the "-fprofile-arcs"
option. The "-fprofile-arcs" option instruments the generated code to
collect data on how often various branches are taken through the
code. The instrumented program is run with typical input. when the
program completes execution it writes the data out to ".da" files in
the build directory. The program is compiled a second time with the
same options except replacing the "-fprofile-arcs" with
"-fbranchprobabilities".  GCC will use the ".da" files to reorder the
basic blocks in functions to help the static prediction. Infrequently
used blocks of code will be moved to the end function improve cache
performance.

GCC can also eliminate branches by inlining functions.  This occurs
when the "-O3" or "-finline-functions" options are used to compile
code.  Inlining functions can also reduce the overhead of function
prologue and epilogue.


INSTRUCTION CACHE MISSES

The instruction cache miss rate can also influence performance.
Excessive cache miss rates could be found by comparing "Instruction
Fetch" (event in Table 8) to "Instruction Fetch Miss" (event in Table
9). For microbenchmarks this is unlikely to be a problem.  However,
for large programs this could be a concern. 

The Pentium 4 does not have a tradition instruction cache; it has a
trace cache that holds decoded micro ops.  There is a event that can be
used to determine the time that the trace cache is in build mode
(decoding a single instructions at a time and placing the micro
ops in the trace cache) and the time the trace cache is in deliver
mode (providing up to three micro ops per cycle to the execution
unit).  A unit mask must be specified for the TC_DELIVER_MODE to
indicate what to measure for the Pentium 4.


Table 8. Instruction Fetch
					event=		
	 Pentium Pro/PII/PIII		IFU_IFETCH
	 Pentium 4 (non-HT)		TC_DELIVER_MODE unit-mask=7
	 Athlon/Hammer			ICACHE_FETCHES
	 Itanium 2			INST_DISPERSED
	 TIMER_INT			(nothing available)

Table 9. Instruction Fetch Miss
					event=		
	 Pentium Pro/PII/PIII		IFU_IFETCH_MISS
	 Pentium 4 (non-HT)		TC_DELIVER_MODE unit-mask=0x38
	 Athlon/Hammer			ICACHE_MISSES
	 Itanium 2			L1I_FILLS
	 TIMER_INT			(nothing available)


CAUTIONS ABOUT OPROFILE

OProfile is a useful tool, but there are some limitations that one
should be aware of when using OProfile:

-use of shared libraries: sampeles for code in shared library are not
 attributed to the paricular application unless the
 "--separate=library" option is used.

-performance monitoring samples are inexact: When a performance monitoring
 register triggers a sample, the interrupt is not precise like a
 divide by zero exception. Due to the out-of-order execution of
 instruction by the processor the sample is recorded on a nearby
 instruction.

-oprofpp misattributes inlined functions' samples: oprofpp just uses a
 simple address range mechanism to determine which function an address
 is in.  Inline functions samples are not attributed to the inlined
 function but rather to the function the inline function was inserted
 into.

-OProfile accumulates data from runs: OProfile is a system-wide
 profile and expects processes to start up and shut down down multiple
 times. Thus, samples from multiple runs will accumulate. On can use
 the command "opcontrol --reset" to clear out the samples.

-non-CPU-limited performance problems: OProfile is oriented to finding
 problems with CPU-limited processes. OProfile does not identify
 processes that are asleep because they are waiting on locks or for some
 other event to occur, e.g. a I/O device to finish an operation.


REFERENCES

 
AMD Athlon Processor X86 Code Optimization Guide, February 2002, AMD
Publication 22007.
http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22007.pdf

Software Optimization Guide for AMD Athlon 64 and AMD Opteron Processors,
April 2003, AMD Publication 25112,
http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/25112.pdf

GCC homepage, July 2003. http://gcc.gnu.org/

IA-32 Intel® Architecture Optimization Reference Manual, 2003,
Intel Order number 248966-09.
http://www.intel.com/design/pentium4/manuals/248966.htm

Intel Itanium 2 Processor Reference Manual for Software Development and
Optimization, June 2002, Intel Document 251110-002.
http://www.intel.com/design/itanium2/manuals/251110.htm

OProfile, http://oprofile.sourceforge.net/