Tuning with OProfile on Fedora 12

William Cohen
wcohen@redhat.com
3/16/2010

INTRODUCTION

OProfile is a low overhead, system-wide profiler for Linux that uses
performance monitoring hardware on the processor to find the locations
and causes of performance bottlenecks.  OProfile is available for
Fedora 12.  This document describes how to determine whether OProfile
is installed on the system and how to use OProfile to search for
common performance problems in code.  The software packages used for
this tutorial were oprofile-0.9.5-4.fc12.x86_64 and
kernel-2.6.32.9-70.fc12.x86_64.


OPROFILE INSTALLATION

OProfile is closely tied to the Linux kernel and the processor
architecture. Currently, OProfile supports the processors listed in
Table 1.  The TIMER_INT is a fallback mechanism for processor without
supported performance monitoring hardware. The TIMER_INT mechanism may
also be used in environments where the performance counters are not
available, such as virtual environments such as Qemu and KVM.

First, check to see that appropriate Linux kernel and OProfile packages
are installed:

	 #checking oprofile installed
	 rpm -q oprofile

	 #checking kernel
	 rpm -q kernel

For this tutorial oprofile-0.9.5-4.fc12.x86_64 and
kernel-2.6.32.9-70.fc12.x86_64 were used.  If the OProfile is not on
the machine, it can be installed using the following command:

	yum install oprofile

If you want to get information about the kernel, you will need to
install the kernel-debuginfo package with the following command (you
may need to change "kernel-debuginfo" to "kernel-PAE-debuginfo" or
what ever matchs your currently installed kernel) :

	debuginfo-install kernel

At this point you should be able to load the OProfile module and set
up the oprofile filesystem by running the following command as root:

	opcontrol --init

You should be able to verify that the oprofile module is loaded with
"/sbin/lsmod". The oprofile file system should be available as
"/dev/oprofile". The contents of "/dev/oprofile" should look similar
to the following:

ls /dev/oprofile
0   12  16  2   23  27  30  6  backtrace_depth   cpu_buffer_size  pointer_size
1   13  17  20  24  28  31  7  buffer            cpu_type         stats
10  14  18  21  25  29  4   8  buffer_size       dump             time_slice
11  15  19  22  26  3   5   9  buffer_watershed  enable

The file "/dev/oprofile/cpu_type" contains a string to indicate the
processor type oprofile is going to use. You can "cat
/dev/oprofile/cpu_type" to get the string. Table 1 lists the possible
strings and the number of events that can be monitored concurrently
and the number of numerical counters supported by each processor.


	 Table 1. OProfile processors
				cpu_type		#counters
	Intel Arch. Perfmon	i386/arch_perfmon	2
	Intel Atom		i386/atom		2
	Intel Core		i386/core		2
	Intel Core 2		i386/core_2		2
	Intel Core i7		i386/core_i7		4
	Intel Pentium 4 (non-HT)i386/p4			8
	Intel Pentium 4 (HT)	i386/p4-ht		4
	Intel Pentium M		i386/p6_mobile		2
	AMD Athlon		i386/athlon		4
	AMD Family10		x86-64/family10 	4
	AMD Family 11h		x86-64/family11h 	4
	AMD AMD64		x86-64/hammer		4
	IBM PowerPC 970		ppc64/970		8
	IBM PowerPC 970MP	ppc64/970MP		8
	Cell Broadband Engine	ppc64/cell-be		4
	IBM Power 4		ppc64/power4		8
	IBM Power 5		ppc64/power5		6
	IBM Power 5+		ppc64/power5+		6
	IBM Power 5++		ppc64/power5++		4
	IBM Power 6		ppc64/power6		6
	IBM Power 7		ppc64/power7		6
	IBM power compat ver.1	ppc64/ibm-compat-v1	4
	TIMER_INT		timer		1 (not user settable)
	IBM iseries		timer		1 (not user settable)
	IBM pseries		timer		1 (not user settable)
	IBM s390		timer		1 (not user settable)
	IBM s390x		timer		1 (not user settable)


After the verifying that the OProfile module is installed and the
OProfile file system is mounted the next step is to configure OProfile
to collect data.
	 

OPROFILE CONFIGURATION

The configuration of oprofile for monitoring user-space code consists
of indicating what events to monitor. For TIMER_INT (/dev/cpu_type
"timer") there is no choice on the events to monitor or the rate. The
following will set up the profiling for the TIMER_INT of userspace code:

      opcontrol --setup --no-vmlinux

The Intel, AMD, and IBM processors which have support
for the performance monitoring hardware can be set up in the same way
as the TIMER_INT above.  In the default setup a time-based event set
is selected to produce about 100,000 time-based events per second.
Once the OProfile has been initialized with the "opcontrol --init"
command, the default event can be displayed with the following command:

	ophelp --get-default-event

Events and sample rates can be set explicitly.  The following is an
example for the AMD64 to get time-based measurements for both
kernel-space and user-space:

      opcontrol --setup --no-vmlinux --event=CPU_CLK_UNHALTED:1000000:0x0:1:1

This command selects the AMD64 time-based sampling with a sample
generated every 1,000,000 clock cycles.  Smaller values for the count
produce more frequent sampling; larger values produce less frequent
sampling.  For rare events a lower count may be needed to get a
sufficient number of samples.  For the Pentium 4 time-based event,
GLOBAL_POWER_EVENTS, the unit mask will (the 0x0 in the above AMD64
example) will need to be set to "0x1". Thus, for a Pentium 4 based machine
the command would be:

      opcontrol --setup --no-vmlinux --event=GLOBAL_POWER_EVENTS:1000000:0x1:1:1

Table 2 lists processors and time-based metrics for other cpu types.
The "ophelp" command lists the events that are available on the
machine.  The "--setup" option causes OProfile to store the
configuration information in /home/root/.oprofile/daemonrc, but it
does not start the daemon.


Table 2. Time-based events
					event			unit-mask
	 Intel Core, Core 2, Core i7	CPU_CLK_UNHALTED
	 Pentium 4 (HT and non-HT)	GLOBAL_POWER_EVENTS	1
	 AMD processors			CPU_CLK_UNHALTED
	 PowerPC 970/Power 4		CYCLES
	 Power 5/Power 5+/Power 6	CYCLES
	 TIMER_INT			(nothing to set)

In some cases it is useful to associate the samples for shared
libraries with the executable that called the shared library. Adding
the "--separate=library" option to the opcontrol setup line will cause
the OProfile daemon to prepend the application executable to the name
and allow the analysis tools to determine which application executable
those samples are associated with.

Often users are interested in getting detailed profiles of the
kernel. The setup will need need to include a pointer to an
uncompressed and unstripped version of the kernel running on the
machine. Fedora packages the vmlinux file with that information in a
separate debuginfo RPM, kernel-debuginfo-2.6.32.9-70.fc12.x86_64.rpm.
The debuginfo RPMs are very specific. If the running kernel came from
kernel-2.6.32.9-70.fc12.x86_64.rpm, then the
kernel-debuginfo-2.6.32.9-70.fc12.x86_64.rpm from that same build will
be needed; kernel-debuginfo rpms from other architectures or other
builds will not work. Note that the installed kernel debuginfo files
can require a signicant amount of space. Assuming the matching
debuginfo RPM for the kernel is installed, the AMD64 can be configured
with:

      opcontrol --setup \
      --vmlinux=/usr/lib/debug/lib/modules/`uname -r`/vmlinux \
      --event=CPU_CLK_UNHALTED:1000000:0x0:1:1

Once OProfile is configured, the dameon is started with:

	opcontrol --start

The end of the output from the "opcontrol --start" states the following:

	Using 2.6+ OProfile kernel interface.
	Reading module info.
	Using log file /var/lib/oprofile/oprofiled.log
	Daemon started.
	Profiler running.

Check that oprofile daemon is running:

	ps -aux|grep oprofiled

The daemon can be shut down with:

	opcontrol --shutdown

One can also use the GUI program oprof_start from the oprofile-gui RPM
to control and to configure oprofile.


COLLECTING AND ANALYZING DATA

Once the daemon is running, OProfile will collect data on all
executables run on the machine.  You can run the code that you want to
measure.  OProfile writes the profiling data to
/var/lib/oprofile/samples at its convenience.  To ensure that all the
current profiling data is flushed to the sample files before each
analysis of profiling data, you should type in:

	opcontrol --dump

You can check for sample files in /var/lib/oprofile/samples/current.
Each executable that has samples should be in that directory.  Using
"find" you should be able to see files that have names of the
executables that ran.

For this OProfile tutorial a Fortran Linpack benchmark was downloaded
from http://netlib.org/benchmark/1000d. The program was saved as
linpack1000d.f.  An "external ran" had to be added to the subroutine
matgen to compile with gfortran.  The program was compiled with the
following command:

	gfortran -g -O3 linpack1000d.f -o linpack1000d

To start with a clean slate we can, shutdown oprofile, save the
previously collected samples into /var/lib/oprofile/samples/junk, and
start oprofile again:

	opcontrol --shutdown
	opcontrol --save=junk
	opcontrol --start

OProfile automatically collects data on any program that runs while
oprofile is operating. Thus, the application is run normally:

	./linpack1000d


VIEWING OVERALL SYSTEM PROFILE

Once the profiling data is dumped to the sample files with
"opcontrol --dump", you can analyze the data and see which
executables are being run and a breakdown of the number of
samples with "opreport" for things that have 1% or more of
samples.

	opreport --long-filenames --threshold 1

Produces:

CPU: AMD64 processors, speed 2000 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 1000000
CPU_CLK_UNHALT...|
  samples|      %|
------------------
     5049 92.8125 /root/linpack1000d
      293  5.3860 /usr/lib/debug/lib/modules/2.6.32.9-70.fc12.x86_64/vmlinux

We see that there are 5049 samples for /home/wcohen/linpack1000d, the
application executable.  The second column shows the relative amount
of time the system spent in the various executables.


EXAMINING A SINGLE EXECUTABLE'S PROFILE

In most cases a more detailed analysis of an application executable is
desired. This requires debugging information to map the address in the
program back to the source code.  Thus, GCC's "-g" option should be
used when producing executables.  Note for GCC "-g" does not change
the executable code, but just adds the mapping information needed by
the debugger to map addresses back to line numbers.  Most RPMs are
distributed without debugging information to save space, so you will
not be able to get detailed profiling information from just the RPM.
As a part of the build process debuginfo RPMs are produced for each
RPM.  OProfile supplied by Red Hat can read the information installed
by the debuginfo RPMs to allow better characterization of the
executables distributed in RPMs. The all the debuginfo RPMs related to
the core-utils (RPM that contains ls, cp, dd, and man other commands)
be installed with the following command from the yum-utils RPM:

	 debuginfo-install -y coreutils

For the example linpack100d program we can find where it spends its time:

opreport -l image:/root/linpack1000d --threshold .1

CPU: AMD64 processors, speed 2000 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 1000000
samples  %        symbol name
4901     97.0687  daxpy_
92        1.8221  ran_
27        0.5348  matgen_
11        0.2179  idamax_
9         0.1783  dmxpy_
6         0.1188  dgefa_

We can see that most of the time is spent in the "daxpy_" subroutine.
To get the detailed view of the addresses in the particular function in
the executable:

 opreport -l image:/root/linpack1000d --include-symbols=daxpy_ --details| more

CPU: AMD64 processors, speed 2000 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask 
of 0x00 (No unit mask) count 1000000
vma      samples  %        symbol name
00400880 4901     100.000  daxpy_
  00400880 1         0.0204
  00400881 7         0.1428
  00400893 1         0.0204
  00400895 1         0.0204
  00400928 1         0.0204
  00400934 1         0.0204
  00400938 30        0.6121
  ...
  00400a0a 1         0.0204
  00400a10 154       3.1422
  00400a15 906      18.4860
  00400a1c 773      15.7723
  00400a22 3         0.0612
  00400a2c 326       6.6517
  00400a35 802      16.3640
  00400a3a 3         0.0612
  00400a3f 400       8.1616
  00400a4f 199       4.0604
  ...


Thes samples can be related back to the original source code with the
op_to_source command:

	opannotate --source /root/linpack1000d -o linpack_annote

For each file used to build the executable, an equivalent file is
annotated with the sample counts for each line. The created file also
has information at the end of the file describing the processor and
the events that the counters measured. Below is the daxpy_ subroutine
from the created file. We can see that most of the samples are for the
inner loop at the bottom of the function.

----daxpy----
     3  0.0594 :      subroutine daxpy(n,da,dx,incx,dy,incy) /* daxpy_ total:   
4901 97.0687 */
               :c
               :c     constant times a vector plus a vector.
               :c     uses unrolled loops for increments equal to one.
               :c     jack dongarra, linpack, 3/11/78.
               :c
               :      double precision dx(1),dy(1),da
               :      integer i,incx,incy,ix,iy,m,mp1,n
               :c
     7  0.1386 :      if(n.le.0)return
     2  0.0396 :      if (da .eq. 0.0d0) return
               :      if(incx.eq.1.and.incy.eq.1)go to 20
               :c
               :c        code for unequal increments or equal increments
               :c          not equal to 1
               :c
               :      ix = 1
               :      iy = 1
               :      if(incx.lt.0)ix = (-n+1)*incx + 1
               :      if(incy.lt.0)iy = (-n+1)*incy + 1
               :      do 10 i = 1,n
               :        dy(iy) = dy(iy) + da*dx(ix)
               :        ix = ix + incx
               :        iy = iy + incy
               :   10 continue
               :      return
               :c
               :c        code for both increments equal to 1
               :c
               :c
               :c        clean-up loop
               :c
               :   20 m = mod(n,4)
    70  1.3864 :      if( m .eq. 0 ) go to 40
               :      do 30 i = 1,m
    59  1.1685 :        dy(i) = dy(i) + da*dx(i)
               :   30 continue
               :      if( n .lt. 4 ) return
               :   40 mp1 = m + 1
               :      do 50 i = mp1,n,4
   930 18.4195 :        dy(i) = dy(i) + da*dx(i)
  1128 22.3411 :        dy(i + 1) = dy(i + 1) + da*dx(i + 1)
   403  7.9818 :        dy(i + 2) = dy(i + 2) + da*dx(i + 2)
  2237 44.3058 :        dy(i + 3) = dy(i + 3) + da*dx(i + 3)
               :   50 continue
               :      return
    62  1.2280 :      end
----daxpy----


SAVING PROFILING DATA

The "--save=xxx" option for opcontrol allow you to save the data for
future reference.  This allows you to keep the data from multiple
experiments separate.  However, you will need to avoid modifying the
executables. OProfile cannot perform the analysis if the executable
is modified. Oprofile needs both the saved sample files and the
unchanged executable.

For the previous linpack example the profiling data could be saved
with the following command:

	 opcontrol --save=linpack

The addition of the "session:linpack" specifies the session saved by
the opcontrol above. Thus, the following commands used earlier to
analyze linpack would now be the following:

	opreport session:linpack --long-filenames| more

	opreport session:linpack -l image:/root/linpack1000d | more

	opreport -l session:linpack image:/root/linpack1000d --include-symbols=daxpy_ --details| more

	opannotate session:linpack --source  /root/linpack1000d


OPTIMIZING CODE

The general procedure for optimization is to find where the hotspots
in the code are and revise the code to avoid the performance problems
in those areas.  Optimizing rarely executed code does little to
improve the overall performance.  The default event OProfile uses
shows the hotspots in the code.  You can use the "ophelp
--get-default-event" command or refer to table 2 for the specific
event name for identifying code hotspots using time-based sampling.
For many processors the CPU_CLK_UNHALTED is the event to get
time-based measurements. The TIMER_INT provides time-based sampling
using a periodic interrupt on systems lacking performance monitoring
hardware support.  The hotspots can be checked for common performance
problems.  Both Intel and AMD have produced documents that describe
code optimization techniques for their processors in much greater
detail. Those documents are listed in the references.  A short list
(and incomplete list) of things to look for are:

   Memory References and Data Cache Misses
   Misaligned Memory Accesses
   Branch misprediction
   Instruction Cache misses


MEMORY REFERENCES AND DATA CACHE MISSES

Processor speed has increased at a much greater rate than memory
access speed.  As a result, each memory cache miss is relatively much
more expensive on current processors than on older processors.  A
cache miss that requires access to main memory on a 3GHz Pentium 4
costs on the order of a hundred clock cycles.  To get good processor
performance the cache miss rate needs to be low. Tables 3 and 4 shows
the processor events related to memory references and cache misses
respectively.

Table 3. Memory reference events

					event			unit-mask
	Intel Atom and Arch. Perfmon	LLC_REFS
	Intel Core			DATA_MEM_REFS
	Intel Core 2			L1D_ALL_CACHE_REF
	Intel Core i7			L1D_ALL_REF		0x01
	Pentium 4 (HT and non-HT)	--
	AMD processors			DATA_CACHE_ACCESSES
	TIMER_INT			(nothing available)

Table 4. Data Cache Miss Events

					event			unit-mask
	Intel Atom and Arch. Perfmon	LLC_MISSES
	Intel Core			L2_LINES_IN		
	Intel Core 2			L2_LINES_IN		0x40
	Intel Core i7			LLC_MISSES
	Pentium 4 (HT and non-HT)	BSQ_CACHE_REFERENCE	0x10f
	AMD processors			DATA_CACHE_MISSES
	TIMER_INT			(nothing available)

Similiarly, one can look for sections in the code where there is
significant bus traffic.  The goal is to reduce the amount of bus
traffic in these problem code sections.  Table 5 lists events for
sampling bus activity.

Table 5. System Bus Traffic

					event			unit-mask
	Intel Atom			BUS_DRDY_CLOCKS
	Intel Core			BUS_DRDY_CLOCKS			
	Intel Core 2			BUS_DRDY_CLOCKS
	Intel Core i7
	Pentium 4 (HT and non-HT)	FSB_DATA_ACTIVITY	11
	AMD processors			DATA_CACHE_REFILLS_FROM_SYSTEM
	TIMER_INT			(nothing available)


UNALIGNED/PARTIAL MEMORY ACCESSES

Some architectures allows memory access to be unaligned. This can lead
to data that spans two cache lines and is more expensive to access. If
possible, organize data structures with the elements with the
strictest alignment constraints at the beginning of the data
structure.

The Pentium 4 has fowarding logic to speed up load/store and
store/load sequences.  However, for situations where a series of small
writes are grouped together to form larger value, e.g. writing two
32-bit words to form a 64-bit double, the logic will fail and the
stores will have to complete before the load can be started.  In a
similar manner a small load extracting data out of a larger store can
stall waiting for the the larger store to complete.  Thus, using C
unions to access parts of larger items can have performance penaties.

Table 6. Unaligned memory references

					event			unit-mask
	Intel Atom			---
	Intel Core			MISALIGN_MEM_REF
	Intel Core 2			MISALIGN_MEM_REF
	Intel Core i7			MISALIGN_MEM_REF	0x3
	Pentium 4 (HT and non-HT)	---
	AMD processors			MISALIGNED_DATA_REFS
	TIMER_INT			(nothing available)


BRANCH MISPREDICTION

Much of the speed of current processors is due to the deep pipelines
running at high clock rates. However, when a conditional branch is
encountered the processor may be required to wait until a value is
available and the next instruction to fetch and process can be
determined.  Most processors now have branch prediction circuitry
allowing the processor to start executing instructions on the expected
path before the outcome of the branch instruction is known.  If the
prediction is correct, a significant improvement in performance is
obtained because the pipeline sits idle for less time. However, if the
branch prediction is incorrect, the processor must nullify the work
done for the incorrect branch.  On a modern processor a branch
misprediction costs 10 to 20 clock cycles. On processors with
deeper pipelines the cost is higher.

Table 7 shows the performance monitoring events for sampling branches
and Table 8 shows the performance monitoring events for sampling
branch mispredictions.  Branches do not occur every clock cycle, it is
recommend to set the branches count interval to be about 1/5 of what
is used time-based sampling, e.g. (processor clock rate)/10,000.  If
code has good branch prediction behavior less than 10% of the branches
are misspredicted, set the counter interval for branch misprediction
to same rate used for branches.

With the exception of the Hyperthreaded Pentium 4, OProfile can
monitor at both branches and branch mispredicts concurrently.
Configure OProfile to sample on both branches and branch
mispredictions.  Then analyze the sample files with opreport and
opannotate to look for places where there are a large number of branch
misprediction samples relative to the number of branch samples.


Table 7. Branches
					event			unit-mask
	Intel Atom			BR_INST_RETIRED
	Intel Core			BR_INST_RETIRED
	Intel Core 2			BR_INST_RETIRED
	Intel Core i7			BR_INST_RETIRED
	Pentium 4 (HT and non-HT)	RETIRED_BRANCH_TYPE	15
	AMD processors			RETIRED_BRANCHES
	TIMER_INT			(nothing available)


Table 8. Branch mispredictions
					event			unit-mask
	Intel Atom			BR_MISS_PRED_RETIRED
	Intel Core			BR_MISS_PRED_RETIRED
	Intel Core 2			BR_MISS_PRED_RETIRED
	Intel Core i7			BR_MISS_PRED_RETIRED
	Pentium 4 (HT and non-HT)	RETIRED_MISPRED_BRANCH_TYPE 15
	AMD processors			RETIRED_BRANCHES_MISPREDICTED
	TIMER_INT			(nothing available)

When compiling GCC attempts to order the code to help the static
prediction of branches; forward branches not taken (e.g. error
handling code) and backward branches taken (e.g. loops). However, GCC's
static estimates of branches' relative frequency may be incorrect.
GCC can use information collected with the "-fprofile-arcs"
option. The "-fprofile-arcs" option instruments the generated code to
collect data on how often various branches are taken through the
code. The instrumented program is run with typical input. when the
program completes execution it writes the data out to ".da" files in
the build directory. The program is compiled a second time with the
same options except replacing the "-fprofile-arcs" with
"-fbranchprobabilities".  GCC will use the ".da" files to reorder the
basic blocks in functions to help the static prediction. Infrequently
used blocks of code will be moved to the end function improve cache
performance.

GCC can also eliminate branches by inlining functions.  This occurs
when the "-O3" or "-finline-functions" options are used to compile
code.  Inlining functions can also reduce the overhead of function
prologue and epilogue.


INSTRUCTION CACHE MISSES

The instruction cache miss rate can also influence performance.
Excessive cache miss rates could be found by comparing "Instruction
Fetch" (event in Table 9) to "Instruction Fetch Miss" (event in Table
10). For microbenchmarks this is unlikely to be a problem.  However,
for large programs this could be a concern. 

The Pentium 4 does not have a tradition instruction cache; it has a
trace cache that holds decoded micro ops.  There is a event that can be
used to determine the time that the trace cache is in build mode
(decoding a single instructions at a time and placing the micro
ops in the trace cache) and the time the trace cache is in deliver
mode (providing up to three micro ops per cycle to the execution
unit).  A unit mask must be specified for the TC_DELIVER_MODE to
indicate what to measure for the Pentium 4.


Table 9. Instruction Fetch
					event			unit-mask
	Intel Atom			INST_RETIRED
	Intel Core			INST_RETIRED
	Intel Core 2			INST_RETIRED_ANY_P
	Intel Core i7			INST_RETIRED		0
	Pentium 4 (non-HT)		TC_DELIVER_MODE		7
	AMD processors			ICACHE_FETCHES
	TIMER_INT			(nothing available)

Table 10. Instruction Cache Miss
					event			unit-mask
	Intel Atom			ICACHE			0x02
	Intel Core			ICACHE_MISSES
	Intel Core 2			L1I_MISSES
	Intel Core i7			L1I unit-mask		0x02
	Pentium 4 (non-HT)		TC_DELIVER_MODE		0x38
	AMD processors			ICACHE_MISSES
	TIMER_INT			(nothing available)


CAUTIONS ABOUT OPROFILE

OProfile is a useful tool, but there are some limitations that one
should be aware of when using OProfile:

-use of shared libraries: samples for code in shared library are not
 attributed to the paricular application unless the
 "--separate=library" option is used.

-performance monitoring samples are inexact: When a performance monitoring
 register triggers a sample interrupt, the interrupt is not precise like a
 divide by zero exception. Due to the out-of-order execution of
 instructions by the processor the sample is recorded on a nearby
 instruction.

-opreport misattributes inlined functions' samples: opreport just uses a
 simple address range mechanism to determine which function an address
 is in.  Inline functions samples are not attributed to the inlined
 function but rather to the function the inline function was inserted
 into.

-OProfile accumulates data from multiple runs: OProfile is a system-wide
 profile and expects processes to start up and shut down down multiple
 times. Thus, samples from multiple runs will accumulate. On can use
 the command "opcontrol --reset" to clear out the samples.

-non-CPU-limited performance problems: OProfile is oriented to finding
 problems with CPU-limited processes. OProfile does not identify
 processes that are asleep because they are waiting on locks or for some
 other event to occur, e.g. a I/O device to finish an operation.


REFERENCES

 
AMD Athlon Processor X86 Code Optimization Guide, February 2002, AMD
Publication 22007, Revision K,
http://support.amd.com/us/Processor_TechDocs/22007.pdf

Software Optimization Guide for AMD64 Processors,
October 2005, AMD Publication 25112, Revision 3.06,
http://support.amd.com/us/Processor_TechDocs/25112.PDF

Software Optimization Guide for AMD Family 10h Processors, May 2009,
AMD Publication 40546, Revision 3.11,
http://support.amd.com/us/Processor_TechDocs/40546-PUB-Optguide_3-11_5-21-09.pdf


GCC homepage, March 2010. http://gcc.gnu.org/

Intel® 64 and IA-32 Architectures Optimization Reference Manual, November 2009,
Intel Order Number: 248966-020,
http://www.intel.com/Assets/PDF/manual/248966.pdf

OProfile, March 2010. http://oprofile.sourceforge.net/