This is a guide to basic, and not so basic troubleshooting and
debugging on Red Hat linux systems. Goals include description
and useage of common tools, how to find information, etc. 
Basically, info that may be helpful to someone diagnosing 
a problem. Emphasis will be on software issues, but
might include hardware as well.



Enviroment settings
   - Allowing Core Files
                                                                                                  
        "core" files are dumps of a processes memory. When a program crashes
    it can leave behind a core file that can help determine what
    was the cause of the crash by loading the core file in a debugger.
                                                                                                  
        By default, most linuxes turn off core file support by setting the
        maximum allowed core file size to 0.
                                                                                                  
        In order to allow a segfaulting application to leave a core, you need
        to raise this limit. This is done via `ulimit`. To allow core
    files to be of an unlimitted size, issue:
                                                                                                  
                ulimit -c unlimited
                                                                                                  
        See the section on GDB for more information on what to do
        with core files.

  - LD_ASSUME_KERNEL
                                                                                                  
        LD_ASSUME_KERNEL is an enviroment variable used by the dynamic
        linker to decide what implementation of libraries are used. For
        most cases, the most important lib is the c lib, or "libc" or
        "glibc".
                                                                                                  
        The reason "glibc" is important is because it contains the
        thread implentation for a system.
                                                                                                  
        The values you can set LD_ASSUME_KERNEL to equate to linux
        kernel versions. Since glibc and the kernel are tighly bound,
        it's neccasary for glibc to change it's behaviour based on
        what kernel version is installed.
                                                                                                  
        For properly written apps, there sould be no reason to use
        this setting. However, for some legacy apps that depend
        on a particular thread implementation in glibc, LD_ASSUME_KERNEL
        can be used to force the app to use an older implementation.
                                                                                                  
        The primary targets fore LD_ASSUME_KERNEL=2.4.20 for use
        of the NTPL thread library. LD_ASSUME_KERNEL=2.4.1 use the
        implementation in /lib/i686 (newer LinuxTrheads).
        LD_ASSUME_KERNEL=2.2.5 or older uses the implementation
        in /lib (old LinuxThreads)
                                                                                                  
        For an app that requires the old thread implentation, it
        can be launch as:
                                                                                                  
                LD_ASSUME_KERNEL=2.2.5 ./some-old-app
                                                                                                  
        see http://people.redhat.com/drepper/assumekernel.html for
        more details.
                                                                                                  
     - glibc enviroment variables.
                                                                                                  
                Theres a wide variety of enviroment varibles that
        glibc uses to alter it's behaviour, many of which are
        useful for debugging or troubleshoot purposes.
                                                                                                  
        A good refence on these variables is at:
                http://www.scratchbox.org/documentation/general/tutorials/glibcenv.html
                                                                                                  
        Some interesting ones:
                                                                                                  
                LANG and LANGUAGE
                                                                                                  
                        LANG sets what message catalog to use, while LANGUAGE
                        sets LANG and all the LC_* variables. These are
control
                        the locale specific parts of glibc.
                                                                                                  
                        Lots of programs are written expecting to be one in
                        one local, and can break in other locales. Since
locale
                        settings can change things like sort order
(LC_COLLATE),
                        and the time formats (LC_TIME), shells scripts are
                        particularly prone to problems from this.
                                                                                                  
                        A script that assumes the sort order of something is
                        a good example.
                                                                                                  
                        A common way to test this is to try running the
                        troublesome app with the locale set to "C", or the
                        default locale.
                                                                                                  
                                LANGUAGE=C ls -al
                                                                                                  
                        If the app starts behaviour when ran that way, there
                        is probably something in the code that is assuming
                        "C" local (sorted lists and timeformats are strong
                        candidates).


   - glibc malloc stuff
     - all the glibc env variable stuff


Tools
    Effiently debugging and troubleshooting is often a matter
    of knowing the right tools for the job, and how to use
    them.
                                                                                                 
   - strace
	- simple useage
	- filtering output
	- examples
	- use as profiling
	- see what files are open
	- network connections

                Strace is one of the most powerful tools available
        for troubleshooting. It allows you to see what an application
        is doing, to some degree.
                                                                                                  
                strace display all the system calls that an application
        is makeing, what arguments it passes to them, and what the
        return code is. A system call is generally something that
        requires the kernel to do something. This generally means
        I/O of all sorts, process management, shared memory and
        IPC useage, memory allocation, and network useage.
                                                                                                  
                - examples
                                                                                                  
                The simplest example of using strace is as follows:
                        strace ls -al
                                                                                                  
                This starts the strace process, which then starts `ls -al`
        and shows every system call. For `ls -al` this is mostly
        I/O related calls. You can see it stat'ing files, opening
        config files, opening the libs it is linked against, allocatin
        memory, and write()'ing out the contents to the screen.
                                                                                                  
                                                                                                  
                - what files is this thing trying to open
                                                                                                  
                  A common troubleshooting technique is to see
         what files an app is reading. You might want to make
        sure it's reading the proper config file, or looking
        at the correct cache, etc. strace by default shows
        all file i/o operations.
                                                                                                  
                But to make it a bit easier, you can filter
        strace output. To see just file open()'s
                                                                                                  
                strace -eopen ls -al
                                                                                                  
                - whats this thing doing to the network
                                                                                                  
                To see all network related system calls (name
        resolution, opening sockets, writing/reading to sockets, etc)
                                                                                                  
                strace -e trace=network curl --head http://www.redhat.com
                                                                                                  
                - rudimentary profiling
                                                                                                  
                One thing that strace can be used for that is useful for
        debugging performance problems is some simple profiling.
                                                                                                  
                strace -c  ls -la
                                                                                                  
        Invoking strace with '-c' will cause a cumulative report of
        system call useage to be printed. This includes approximate
        amount of time spent in each call, and how many times a
        system call is made.

        This can sometimes help pinpoint performance issues, especially
        if an app is doing something like repeatedly opening/closing
        the same files.
                                                                                                  
                strace -tt ls -al
                                                                                                  
        the -tt option causes strace to print out the time each call
        finished, in microseconds.
                                                                                                  
                strace -r ls -al
                                                                                                  
        the -r option causes strace to print out the time since the
        last system call. This can be used to spot where a process
        is spending large amounts of time in user space, or especially
        slow syscalls.
                                                                                                  
                - following forks and attaching to running processes
                                                                                                  
        Often is difficult or impossible to run a command under
        strace (an apache httpd for instance). In this case, it's
        possible to attach to an already running process.
                                                                                                  
                strace -i 12345
                                                                                                  
        where 12345 is the PID of a process. This is very finding
        for trying to determine why a process has stalled. Many
        times a process might be blocking while waiting for I/O.
        with strace -p, this is easy to detect.
                                                                                                  
        Lots of processes start other processes. It is often desireable
        to see a strace of all the processes.
                                                                                                  
                strace -f /etc/init.d/httpd start
                                                                                                  
        will strace not just the bash process that runs the script, but
        any helper utilities executed by the script, and httpd itself.

	Since strace output is often a handy way to help a developer
	solve a problem, it's useful to be able to write it to a file.
	The easiest way to do this is with the -o option.

		strace -o /tmp/strace.out program

	Being somewhat familar with the common syscalls for linux
	is helpful in understanding strace output. But most of the common
	ones are simple enough to be able to figure out on context. 

	A line in strace output is essentially, the system call name,
	the arguments to the call in parens (sometimes truncated...), and then
	the return status. A return status for error is typically -1, but varies
	sometimes. For more information about the return status of a typically
	system call is by `man 2 syscallname`. Usually the return status will
	be documented in the "RETURN STATUS" section. 

	Another thing to note about strace it is often shows "errno"
	status. If your not familar with unix system programming, errno is a global
	variable that gets sets to specific values when some commands execute. This
	variable gets set to different values based on the error mode of the command.
	More info on this can be found in `man errno`. But typically, strace will
	show the brief description for any any errno values it gets. ie:

		 open("/foo/bar", O_RDONLY) = -1 ENOENT (No such file or directory)


	strace -s X

	the -s option tells strace to show the first X digits of strings.
	The default is 32 characters, which sometimes isnt enough. This will
	increase the info available to the user. 

	
   - ltrace
	- simple useage
	- filtering output

        ltrace is very similar to strace, except ltrace focuses on
        tracing library calls.
                                                                                                  
        For apps that use a lot of libs, this can be a very powerful
        debugging tool. However, because most modern apps use libs
        very heavily, the output from ltrace can sometimes be
        painfully verbose.
        
	There is a distinction between what makes a "systemcall"
	and a call to a library functions. Sometimes the line between the
	two is blurry, but the basic difference is that system calls are
	essentially communicating to the kernel, and library calls are just
	running more userland code. System calls are usually require for
	things like I/O, process controll, memory management issues,
	and other "kernel" things. 

	Library calls are by bulk, generaly calls to the standard
	C library (glibc..), but can of course be calls to any library,
	say Gtk,libjpeg, libnss, etc. Luckily most glibs functions
	are well documented and have either man or info pages. Documentation
	for other libraries varies greatly. 
                                                                                          
        ltrace supports the -r, -tt, -p, and -c options the same
        as strace. In addition it supports the -S option which
        tells it to print out system calls as well as library
        calls.
                                                                                                  
        One of the more useful options is "-n 2" which will
        indent 2 spaces for each nested calls. This can make it
        much easier to read.
                                                                                                  
        Another useful option is the "-l" option, which
        allows you to specify a specific library to trace, potentionall
        cutting down on the rather verbose output.



   - gdb

	`gdb` is the GNU debugger. A debugger is typically used by developers
     	to debug applications in development. It allows for a very detailed
	examination of exactly what a program is doing.

	That said, gdb isnt as useful as strace/ltrace for troubleshooting/sysadmin
	types of issues, but occasionally it comes in handy.  

	For troubleshooting, its useful for determining what cause
	core files. (`file core` will also typically show you this information too).
	But gdb can also show you "where" the file crashed. Once you determine
	the name of the app that caused the failure, you can start gdb with:

		gdb filename core 

		then at the prompt type
		`where`

	The unfortunate thing is that all the binaries we ship are
	stripped of debuggig symbols to make them smaller, so this often returns
	less than useful information. However, starting in Red Hat Enterprise Linux
	3, and included in Fedora, there are "debuginfo" packages. These
	packages include all the debugging symbols. You can install them the
	same as any other rpm, so `rpm`, `up2date`, and `yum` all work. 

	The only difficult part about debuginfo rpms is figuring out
	which ones you need. Generally, you want the debuginfo package
	for the src rpm of the package thats crashing. 

		rpm -qif /path/to/app

	Will tell you the info for the binary package the app is part of. 
	Part of that info include the src.rpm. Just use the package name
	of the src rpm plus "-debuginfo"


   - python debugging

   - perl debugging
	- `splain`
	- `perl -V`
	- perldoc -q
	- perldoc -l

   - sh debugging

   - bugbuddy etc

   - top
	`top` is a simple text based system monitoring tool. It packs
	a lot of information unto the screen, which can be helpful
	troubleshooting problems, particularly performance related
	problems.

 	The top of the "top" output includes a basic summary of the system.
	The top line is current time, uptime since the last reboot, users
	logged, and load average. The load average values here are the
	load for the last 1, 5,and 15 minutes. A load of 1.0 is considerd
	100% cpu utilization, so loads over one typically means stuff
	is having to wait. There is a lot of leeway and approxiation in
	these load values however.
                                                                                                                                                                                     
        The memory line shows the total physical ram available
	on the system, how much of it is used, how much free, and how
	much is shared, along with the amount of ram in buffers. These
	buffers are typically file system cachine, but can be other things.
	On a system with a significant uptime, expect the buffer value to
	take up all free physical ram not in use by a process.                                                                                                                                                                             
	The swap line is similar.
                                                                                                                                                                                     
	Each of the entries viewable in the system contain several
	fields by default. The most interesting are RSS, %CPU, and
	time. RSS shows the amount of physical ram the process is consuming.
	%CPU shows the percentage of the available processor time a process
	is taking,and time shows the total amount of processor time the process
	has had. A processor intensive program can easily have more "time"
	in just a few seconds than a long running low cpu process.

	Sorting the output:

        	M  : sorts the output by memory useage. Pretty handy for figuring
		     out which version of openoffice.org to kill.
        
		P : sorts the process by the % of cpu time they are using.

        	T : sorts by cumulative time
        	
		A : sorts by age of the process, newest process first                                                                                                                                                                     
	Command line options:                                                                                                                                                                             
        	The only really useful command line options are:
                                                                                                                                                                                     
        	b [batch mode] writes the standard top output to
        	stdout. Useful for a quick "system monitoring hack".
                                                                                                                                                                                     
        	ie, top d 360 b >> foo.output
        	to get a snapshot of the system appended to foo.output every
        	six minutes.

   - ps 
	`ps` can be thought of as a one shot top. But it's a bit
	more flexible in it's output than top.

	As far as `ps` commandline options go, it can get pretty
	hairy. The linux version of `ps` inherits ideas from both
	the BSD version, and the SYSV version. So be warned.

	The `ps` man page does a pretty good job of explaining
	this, so look there for more examples.

	some examples:

		ps aux

	shows all the process on the system is a "user" oriented
	format. In this case meaning the username of the owner
	of the process is shown in the first column.

		ps auxww

	the "w" option, when used twice, allows the output to be
	of unlimited width. For apps started with lots of commandline
	options, this will allow you to see all the options. 

		ps auxf

	the 'f" option, for "forest" tries to present the list
	of processes in a tree format. This is a quick and easy
	way to see which process are child processes of what.

		ps -eo pid,%cpu,vsz,args,wchan

	This is an interesting example of the -eo option. This
	allows you to customize the output of `ps`. In this
	case, the interesting bit is the wchan option, which
	attempts to show what syscall the process is in which
	`ps` checks. 

	For things like, apache httpds, this can be useful
	to get an idea what all the processes are doing
	at one time. See the info in the strace section
	on understand system call info for more info
   
   - systat/sar

        Systat works with two steps, a daemon process that
        collects information, and a "monitoring" tool.
        
	The daemon is called "systat", and the monitoring
        tool is called `sar`
                                                                                                                                                                  To start it, start the systat daemon:
                
		./systat start
        
	To see a list of `sar` options, just try `sar --help`
        
	Things to note. There are lots of commandline options.
        The last one is always the "count", meaning the time between
        updates.                                                                                                                                                                             
                sar 3                                                                                                                                                                     
        Will run the default sar stuff every three seconds.
       
	 For a complete summary, try:
                
		sar -A
        
	This generates a big pile of info ;->
        
	To get a good idea of disk i/o activity:
                
		sar -b 3
        
	For something like a heavily used web server, you
        may want to get a good idea how many processes
        are being created per second:
                
		sar -c 2
        
	Kind of surprising to see how many process can
        be created.
        
	Theres also some degree of hardware monitoring builtin.
        Monitoring how many times a IRQ is triggered can also
        provide good hints at whats causing system performance problems.
                
		sar -I SUM 3
        
	Will show the total number of system interrupts
                
		sar -I 14 2
        
	Watch the standard IDE controller IRQ every two seconds.
        
	Network monitoring is in here too:
                
		sar -n DEV 2
        
	Will show # of packets sent/receiced. # of bytes transfered, etc

                sar -n EDEV 2
        
	Will show stats on network errors.
        
	Memory usege can be monitoring with something like:
                
		sar -r 2
        
	This is similar to the output from `free`, except more easily
        parsed.
        
	For SMP machines, you can monitor per CPU stats with:
                
		sar -U 0
        where 0 is the first processor. The keyword ALL will show
        all of them.
        
	A really useful one on web servers and other configurations
        that use lots and lots of open files is:                                                                                                                                                                             
                sar -v
                                                                                                                                                  
        This will show number of used file handles, %of available
        filehandles available, and same for inodes.

       To show the number of context switches ( a good indication
        of how much time a process is wasting..)
       
                sar -w 2

   - vmstat 

      This util is part of the procps package, and can provide lots of useful
       info when diagnosing performance problems.

      	Heres a sample vmstat output on a lightly used desktop:

   procs                      memory    swap          io     system  cpu
 r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy id
 1  0  0   5416   2200   1856  34612   0   1     2     1  140   194   2   1 97

      	And heres some sample output on a heavily used server:

   procs                      memory    swap          io     system  cpu
 r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy id
16  0  0   2360 264400  96672   9400   0   0     0     1   53    24   3   1 96
24  0  0   2360 257284  96672   9400   0   0     0     6 3063 17713  64  36 0
15  0  0   2360 250024  96672   9400   0   0     0     3 3039 16811  66  34 0

      	The interesting numbers here are the first one, this is the number of
	the process that are on the run queue. This value shows how many process are
	ready to be executed, but can not be ran at the moment because other process
	need to finish. For lightly loaded systems, this is almost never above 1-3,
	and numbers consistently higher than 10 indicate the machine is getting
	pounded.

      	Other interseting values include the "system" numbers for in and cs. The
	in value is the number of interupts per second a system is getting. A system
	doing a lot of network or disk I/o will have high values here, as interupts
	are generated everytime something is read or written to the disk or network.

	The cs value is the number of context switches per second. A context
	switch is when the kernel has to take off of the executable code for a program
	out of memory, and switch in another. It's actually _way_ more complicated
	than that, but thats the basic idea. Lots of context swithes are bad, since it
	takes some fairly large number of cycles to performa a context swithch, so if
	you are doing lots of them, you are spending all your time chainging jobs and
	not actually doing any work. I think we can all understand that concept.


   - tcpdump/ethereal
   - netstat 
	     Netstat is a app for getting general information about the status
	of network connections to the machine.                                                                                                                                                             
        		netstat                                                                                                                                       
        will just show all the current open sockets on the machine. This will
	include unix domain sockets, tcp sockets, udp sockets, etc.
                                                                                                                                                                  One of the more useful options is:
                   
	        netstat -pa
        
	The `-p` options tells it to try to determine what program has the
	socket open, which is often very useful info. For example, someone nmap's
	their system and wants to know what is using port 666 for example. Running
	netstat -pa will show you its satand running on that tcp port.

	One of the most twisted, but useful invocations is:

		netstat -a -n|grep -E "^(tcp)"| cut -c 68-|sort|uniq -c|sort -n

	This will show you a sorted list of how many sockets are in each connection
	state. For example:

 	     9  LISTEN      
	     21  ESTABLISHED 

	
	- what process is doing what and
	  to whom over the network
	- number of sockets open
	- socket status

   - lsof

	/usr/sbin/lsof is a utility that checks to see what all open
	files are on the system. Theres a ton of options, almost none
	of which you ever need.
       
	This is mostly useful for seeing what processes
	have what file open. Useful in cases where you need to umount a partion,
	or perhaps you have deleted some file, but its space wasnt reclaimed
	and you want to know why.


	The EXAMPLES section of the lsof man page includes many
	useful examples.

   - fuser
   - ldd

	ldd prints out shared library depenencies. 

	For apps that are reporting missing libraries, this is a handy
	utility. It shows all the libraries a given app or library is
	linked to. 
	
	For most cases, what you will be looking for is missing libs.
	in the ldd output, they will show something like:

		libpng.so.3 => (file not found)

	In this case, you need to figure out why libpng.so.3 isnt
	being found. It might not be in the standard lib paths,
	or perhaps not in a path in /etc/ld.so.conf. Or you
	need to run `ldconfig` again to update the ld cache.

	ldd can also be useful when tracking down cases where
	a app is finding a library, but it finding the wrong
	library. This can happen if there are two libraries
	with the same name installed on a system in different
	paths. 
	
	Since the `ldd` output includes the full path to
	the lib, you can see if anything is pointing at
	at a wrong paths. One thing to look for when
	scanning for this, is one lib thats in a different
	lib path than the rest. If an app uses apps from
	/usr/lib, except for one from /usr/local/lib, theres
	a good chance thats your culprit.

   - nm
   - file

	`file` is a simple utility that tries to figure out
	what kind of file a given file is. It does this
	by magic(5). 

	Where this sometimes comes in handy for troubleshooting
	is looking for rogue files. A .jpg file that is
	actually a .html file. A tar.gz thats not actually
	compressed. Cases like those can sometimes cause
	apps to behave very strangely. 

   - netcat
	- to see network stuff
   - md5sum
	- verifying files
	- verifying iso's

   - diff

	diff compares two files, and shows the difference
	between the two. 

	For troubleshooting, this is most often used on
	config files. If one version of a config file works,
	but another does not, a `diff` of the two files
	can often be enlightening. Since it can be very
	easy to miss a small difference in a file, being
	able to see jus the differences is useful. 

	For debugging during development, diff (especially
	the versions built into revision control systems
	like cvs) is invaluable. Seeing exactly what
	changed between two versions is a great help.

	For example, if foo-2.2 is acting weird, where
	foo-2.1 worked fine, it's not uncommon to `diff`
	the source code between the two versions to
	see if anything related to your problem changed. 

   - find

	For troubleshooting a system that seems to have
	suddenly stopped working, find has a few tricks
	up its sleeve. 

	When a system stops working suddenly, the first
	question to ask is "what changed?". 

		find / -mtime -1 

	That command will recursively list all the
	file from / that have changed in the last 
	day. 

		find /usr/lib -mmin -30

	Will list all the files in /usr/lib that
	changed in the last 30 minutes. 

	Similar options exist for ctime and atime.
	
		find /tmp -amin -30

	Will show all the files in /tmp that have
	been accessed in the last 30 minutes.

	The -atime/-amin options are useful when trying
	to determine if an app is actually reading
	the files it is supposed. If you run the app,
	then run that command where the files are, and
	nothing has been accessed, something is wrong.

	If no "+" or "-" is given for the time value,
	find will match only exactly that time. This
	is handy in several cases. You can determine
	what files were modified/created at the
	same time. 

	A good example of this is cleaning up
	from a tar package that was unpacked into
	the wrong directory. Since all the files
	will have the same access time, you can
	use find and -exec to delete them all. 	

	- executables

	`find` can also find files with particular
	permisions set. 

		find / -perm -0777

	will find all world writeable files from
	/ down. 


		find /tmp -user "alikins"

	will find all files in /tmp owned
	by "alikins"
 
	- used in combo with grep to find
	  markers (errors, filename, etc)

	When troubleshooting, there are plenty of
	cases where you want to find all instances of
	a filename, or a hostname, etc. 

	To recursievely grep a large number of files,
	you can use find and it's exec options

		find . -exec grep foo {} \;

	This will grep for "foo" on all files from
	the current working directory and down.

	Note that in many cases, you can also
	use `grep -r` to do this as well. 

   - ls/stat
        - finding [sym|hard] links
        - out of space
   - df

	Running out of disk spaces causes so
	many apps to fail in weird and bizarre
	ways, that a quick `df -h` is a pretty
	good troubleshooting starting point. 

	Use is easy, looks for any volume thats
	100% full. Or in the case of apps that
	might be writing lots of data at once,
	reasonably close to being filled. 

	It's pretty common to spend more time
	that anyone would like to admit debugging
	a problem to suddenly here someone yell
	"Damnit! It's out of disk space!". 

	A quick check avoids that problem.

	In addition to running out of space,
	it's possible to run out of file system
	inodes. A `df -h` will not show this,
	but a `df -i` will show the number of
        inodes available on each filesystem.

	Being out of inodes can cause even
        more obscure failures than being
	out of space, so something to
	keep in mind. 

   - watch
        - used to see if process output changes
        - free, df, etc

   - ipcs/iprm
       - anything that uses shm/ipc
         - oracle/apache/etc

   - google
	- googling for error messages can
	be very handy

   - source code

	For Red Hat Linux, you have the source code,
	so it can often be useful to search though
	the code for error messages, filenames, or 
	other markers related to the problem. In many
	cases, you don't really need to be able to
	understand the programming language to
	get some useful info.

	 Kernel drivers are an great example for this, since they
	often include very detailed info about which hardware
	is supported, whats likely to break, etc.

    - strings
	`strings` is a utility that will search though a
	file and try to find text strings. For troubleshooting
	sometimes it is handy to be able to look for strings
	in an executable. 

	For an example, you can run strings on a binary to
	see if it has any hard coded paths to helper utilities.
	If those utils are in the wrong place, that app may
	fail.

	Searching for error messages can help we well,
	especially in cases where you not sure what
	binary is reporting an error message.

	It some ways, it's a bit like grepping though
	source code for error messages, but a bit
	easier. Unfortunately, it also provide far
	less info. 

    - syslog/log levels
	- what goes to syslog
	- how to get more stuff there

   - ksymoops
	- get somewhat meaning info out of kernel
	 traces
	- netdump?

   - xev
	- debugging keycode/mouseclick weirdness, etc

Logs
   - messages, dmesg, lastlog, etc
   - log filtering tools?
	
Using RPM to help troubleshoot
   - package verify
   - missing deps

Types Of Problems
   - Things are missing.

	This type of problem occurs in many forms. Shell scripts that
	expect an executable to exist that doesn't. Applications linked
	against a library that can not be found. Applications expecting
	a config file to be found that isnt. 

	It can get even more subtle when file permisions are involved.
	An app can report a file as "not found" when it reality, it
	exists, but the permissions are wrong. 

        - Missing Files

	Often an app will fail because of missing files, but will
	not be so helpful as to tell which file is missing. Or it
	reports the error in a vague manner like "config file not found"

	For most of these cases where something is missing, but you
	are not sure _what_, strace is the best tool. 
	
		strace -eopen trouble_causing_app

	That commandline will list all the files that app is
	trying to open up, and if it succedded or not. The type
	of line to look for is something like:

	open("/path/to/some/file/", O_RDONLY) = -1 ENOENT (No such file or directory)

	That indicates the file wasn't found. In many cases, these errors
	are harmless. For example, some apps will try to open config files
	in the users home directory, in addition to system config files. 
	If the user config file doesn't exist, the app might just continue.

	- Missing Libs

	For missing libraries, the same approach will work. Another
	approach is to run `ldd` against the app, and see if any
	shared libraries show up as missing. See the `ldd` section
	for more details.

	- File Permissions

	For cases where it's the file permision thats causing
	the problem, you are looking for a line like:

	open("/path/to/file/you/cant/read", O_RDONLY) = -1 EACCES (Permission denied)

	Something about that file is not letting you read it. So the
	permisions need to be checked, or perhaps elevated privilidges
	obtained (aka, does the app require running it as root?)
	

  - networking
	
	On modern systems, having networking problems is crippling
	at times. Troubleshooting whats causing them can be just
	as painful at times.

	Some common problems include firewall issues (both on
	the client and external), kernel/kernel module issues,
	routing problems, name resolution, etc. 

	Name resolution issues deserve there own category, so
	see the name resolution section for more info. 

    - firewall checks
	
	When seeing odd network behaviour, these days, local
	firewalls are a pretty good suspect. Client side 
	firewalls are getting more and more aggressive. 

	If you see issues using a network service, especially
	a non standard service, it's possible local firewall
	rules are causing it. 

	 Insert infor about seeing what firewall rules
	are up.

	Insert info about increasing log levels to see
	firewall rejections in system logs. 

	Insert info about temprorarily dropping firewalls
	to diagnose problems. 

    - Crappy Connections

	A common problem is connections that are having
	problems. A few easy things to look for to see
	if an external connection might be having issues.

	- ping to a remote host

	  `ping` is very simple, and very low level, so
	  it's a good tool to get an idea if an interface
	  or route is working correctly. 

		ping www.yahoo.com

	  That will start pinging www.yahoo.com and
	  reporting ping times. Stopping it with ctrl-c
	  will show a report of any missed packets. 

	  Generally healthy links will have 0 dropped
	  packets, so anything higher than that is something
	  to be worried about. 

	- traceroute

		traceroute www.yahoo.com

	  Attempts to gather info about each node in
	  the connections. Generally these map to
	  physical routers, but in these days of
	  VPN's, it's hard to tell.  

	  If a traceroute stalls at some point,
	  it usually indicates a problems. Also
	  look for high ping times, particularly
	  any node that seems much slower than the
          others. 

	- /sbin/ifconfig

	  ifconfig does a lot. It can control
	  and configure network interfaces of all
	  types. See the man page.

	  When trying to determine if theres networking
	  issues, run `ifconfig` and look for the interface
	  showing issues. If there is a high "error" count,
	  there could be physical layer issues, or possibly
	  overloaded routers etc. 

	  That said, with modern networks, it's pretty
	  rare to see interface errors, but it's still
	  something to take a quick look at. 

    - Bandwidth Useage 
	When the available network bandwidth runs dry,
	it can be difficult to find the culprits. Theres
	a couple subtle variations of this. One being a
	client machine that has some process using a lot
	lot of bandwidth. Another is a server application
	that has one or more clients using a lot of
	bandwidth. 


	- /sbin/ifconfig

	ifconfig reports the number of packets
	sent/received on a network interface, so
	this can be a quick way to get an idea
	what interface is out of bandwidth.

	- sar

	As mentioned in the section on sar, `sar -n DEV`
	can be used to see info about the amount of
	packages each interface is sending at a 
	given time. 

	- trafshow

		 I don't know anything
	about trafshow

	- ntop/intop
		 havent used in ages


	- netstat

	  `netstat` wont show bandwith useage, but it
	   is a quick way to see what applications have
	   open network connections, which is often a
	   good start to finding bandwidth hogs. 

	   See the netstat section for more info.		

	- tcpdump/ethereal

	  tcpdump and ethereal are both applications
	  to monitor network traffic. tcpdump is pretty
	  standard, but ethereal is more featureful.

	  ethereal also has a nice graphical user
	  interface which can be very handy when 
	  attempting to digest the large amouts of
	  data a network trace can deliver. 

	  The basic approach is to fire up ethereal,
	  start a capture, let whatever weird networking
	  your trying to diagnose happen, then stop	
	  capture. 

	  Ethereal will display all the connections
	  it traced during the capture. There are
	  a couple ways to look for bandwidth hogs.

	  The "Statitics" menu has a couple useful
	  options. The "Protocol Hierarchy" shows
	  what % of packets in the trace is from
          each type of protocol. In the case of
	  a bandwith hog, at least what protocol
	  is the culprit should be easy to spot
	  here. 

	  The "Conversations" screen is also helpful
	  for looking for bandwidth hogs. Since you
	  can sort the "conversations" by number of
	  packets, the culprit is likely to hop to
	  the top. This isn't always the case, as it
	  could easily be many small connections killing
	  the bandwidth, not one big heavy connection. 

	  As far as tcpdump goes, the best way to spo
	  bandwidth hogs is just to start it up. Since
	  it pretty much dumps all traffic to the screen
	  in a text format, just keep your eyes peel for
          what seems to be coming up a lot. 

	- using iptables

	  iptables can log how much traffic is flowing though
	  a given rule. 

	  
		something like:
		iptables -nLx 

	 
	  
    - routing issues
    - kernel module flakyeness
    - dropped connections


    - tcpdump/ethereal
    - netcat
    - netstat

  - Programs Crashing

	You just finished the last page in your 1200 page
	novel about how aliens invaded Siberia in the
	19th century and made everyone depressed. *boom*
	the word processor disappears off the screen faster
	than it really should. It segfaulted. Your work
	is lost. 

	Crashing applications are annoying to say the
	least. But sometimes, there are ways to figure
	out why they crashed. And if you can figure out
	why, you might be able to avoid it next time. 

	- Crash Catchers

	  Most GNOME/KDE apps now are linked against libs
	  that include a cratch catching utility. Basically,
	  whenever the app gets a segfault, the hander for
	  it invokes a process, attaches to it with a debugger,
	  gets a stacktrace, and offeres to upload it to a
          bug tracking system. 

	  Since these include the option to see the stack
	  trace, it can be handy way to get a stack trace.
	  Once you have a stack trace, it should point 
	  you to where the app is crashing. Figure out
	  why it crashed varies greatly in complexity.
 
	- strace

	  `strace` can also be handy for tracking down
	  crashes. It doesn't provide as muct detail 
	  as ltrace or gdb, but it is commonly available.

	  The idea being to start the app under trace,
	  wait for it to crash, and see what the last
	  few things it did. Some things to look for
	  include recently opened files (maybed the app
	  is trying to load a corrupted file), memory
	  management calls (maybe something is causing 
	  it to use large amounts of ram), failed network
	  connections (maybe the app has poor error handling).

	- ltrace

	  `ltrace` is a bit more useful for debuggin crashing
	   apps, as it can give you an idea what function 
	   an app was in when it crashed. Not as useful
           as a real stack trace, but its easier.

	- gdb

	  When it comes to figuring out all the gory details
	  of why an app crashed, nothing is better than
	  `gdb`. 

	  For basic useage, see the gdb section in the
	  tools section of this document. 

	  For more detail useage, see the gdb documentation. 

	  need some examples here

	 - debuginfo packages

	   One caveat with using gdb on most apps, is that
	   they are stripped of debugging information. You can
	   still get a stack trace, but it will not be as meaningful
	   as one with the debug information available. 

	   In the past, this meant recompiling the application with
	   debugging turned on, and "stripping" turned off. Which
	   can at times, be a slow and painful process.

	   In Red Hat Enterprise Linux and later, you can install
	   the "debuginfo" packages.

	   See the gdb section in the tools section for more info
	   on debug packages.

  	 - core files

	   If an application has crashed, and left a core
	   file (see the "Allowing Core Files" section under
           the "Enviroment Settings" section for info on how
           to do this), you can use gdb to debug the core file.

	   Invocation is easy:

			gdb /path/to/core/file

	   After loading the core file, you can issue `bt`
	   to get a backtrace. 

	   See the gdb section above for infomation about
	   "debuginfo" packages. 

  - configs screwed up
     An incorrect, missing, or corrupt config file can wreak
     havoc. Well coded apps will usually give you some idea 
     if a config file is bogus, but thats not always the case.
     
     - Finding the config files

	The first thing is figureing out if an app
	uses a config file and what it is. Theres
	a couple ways to do this.

	- finding config files with rpm
		
	  If a package as installed from an rpm, it
	  should have the config files flagged as such.
	  To query a package to see what it's config files
	  are, issue the command:
	
		rpm -q --configfiles packagename

	  While you are using rpm, you should see if
 	  the config files have been modified from
	  the defaults.
	
		rpm -V packagename

	  That command will list all the files in that
	  packaged that have been changed in someways.
	  The rpm man page includes more details on what
	  the output means, but the basics are:

	  if there is a "S", the files size has changed.
	  if there is a "5", the file has been modified.
	  if there is a "M", the files permissions have changed. 
	
       - strace
	
	 Using `strace -eopen process` is a good way to see what
	 files a process is opening, including any config files. 

       - documentation
	 
	 If all else fails, try reading the docs. Often the
	 man pages or docs describe where and what the config
	 files are. 

      - Verifying the Config Files

	Once you know what the config files, then
	you need to verify they are correct. This is
	highly application dependent. 

	- diff'ing against known good files

	  If you have a known good config file,
	  diffing the old file and the new one 
	  can often be useful. 

	- look for for .rpmnew or .rpmorig files.

	  In some cases, rpm will install a new config
	  file along side the existing one. This happens
	  on package upgrades where the default config
	  file has changed between the two packages, and
	  the version on disk is different from either 
	  version. 

	  The idea being, if the default config file is
	  different, then it's possible the config file
	  format changed. Which means the previous on
	  disk config file may not work with the new
	  version, so a .rpmnew version is installed
	  alongside the existing one. 

	  So if an app is suddenly behaving oddly, 
	  especially after a package update, see
	  if there are any .rpmnew or .rpmorig file.
	  If so, you may need to update the existing
	  config file to use the new format.  

	- stat/ls
	
	  If an app is behaving oddly, and you belive
	  is it because of a config file, you should
	  check to see when that file was modified.

		stat /path/to/config/file

	  The `stat` command will give you the last
	  modified and last accessed times. If the
	  file seems to have changed later than you
	  think, it's possibly something or someone
	  has changed it more recently. 

	  See the information on the `find` utility
	  for ways to look for all files modified
	  at/since/before a certain time. 

   - gconf
   - The config file has changed but the app is ignoring it
	- is it the correct config file?

	  Often an application will look for config
	  files in several places. In some cases, 
	  some versions of the config file have
	  precedence over other versions. 

	  A common example is for an app to
	  a default config file, a per system
	  config file, and a per user config file.
	  With the user and system runs overriding
	  the default one. For some apps, individual
	  config items have there own inheritance 
	  rules.

	  So for example, if your modifying a system
	  wide config file, make sure there isnt
	  a per user config file that overrides the
	  change. 
	
	- is it a daemon?

	  daemon and server processes typically
	  only read there config file when they
	  start up. 

	  Sometimes a restart of the process is
	  required. Sometimes it is possible to
	  send a "HUP" signal to an app to force
	  it to reload configs. To send a "HUP"
	  signal:

		kill -HUP $pid
	
	  Where $pid is the process id of the 
	  running process. 

	  Sometimes init scripts will have
          options for reloading config files.

	  Apache httpd's init script has
	  a reload option.

		service httpd reload

	- shell config?

	  Some process, user shells in particular,
	  have fairly complicated rules about when
	  some of it's config files are read. 

	  See the "INVOCATION" section of the
          bash man page for an example of when
	  the various bash config files get loaded.


  - kernel issues
   - single user
   - init=/bin/bash
   - bootloader configs
   - log levels
 
  - stuff not writing to disk
     - out of space
	You run a command to write a file, or save a file from an
	app. When you go to look at the file, it's not there, or
	it's empty. Or the app complains that is "unable to write to device"
	Whats going on?

	More than likely, the system doesnt not have any storage space
	for the file. The file system that the app is trying to write
	to is full, or nearly full. 

	This case can cause a wide variety of problems. The
	easiest way to check to see if this is the case is
	the `df` command. See the df section in the tools
	section for more info on df. 

	One thing to keep in mind is that the correct
	filesystem has space. Just because something in
	`df` shows free space, doesn't mean the app
	can use it.

     - out of inodes

	`df -i` can catch this one as well. It's fairly
	uncommon these days, but it can still happen. 

     - file permissions

	Check the file permissions for the file, and
	directory the app is trying to write to. 

	You can use strace to see where it's writing to
	if nothing else tells you. 

     - ACL's
	
	If the system is using ACL's, you need to 
	verify the user/app is in the proper ACL's.

     - selinux

	selinux can control what app can write where
	and how. So verify the selinux perms are
	correct.

	 need more info on tracking down
	selinux issues 

     - quotas

	If the system has file system quotas enabled,
	it's possible the user is over quota. 

		`quota`
	
	That command will show the current quota 
	useage, and indicate if quotas are in
	effect or not.

     - read-only mounts

	Network file systems in particular tend
	to mount shared partions read-only. The
	mount options overrides any file permisions
	on the file system that is being shared.

     - read only media

	cd-roms are read-only media. The app isn't
	trying to write to it is it?

     - chattr/lsattr

	One feature of ext2/3 is the ability to
	`chattr` files. There are per file attributes
	beyond standard unix permissions. 

	See the chattr/lsattr section of the tools
	section for more details. 

	If a file has had the command `chattr +i` run
	on it, then the file is "immutable" and nothing
	can modify it or delete it, including the root
	user. The only way to change it is to run `chattr -i`
	on it. Then it can be treated as a normal file. 


  - files doing weird stuff
	The app is reading the right file. The file _looks_
	correct, but it is still behaving weirdly. A few
	things to look for.

	- hidden chars
	
	 Sometimes a file can get hidden characters in
	 it that give parsers headaches. This is increasingly
	 common as support for more characters encoding 
	 become common. 

	 - dos style carriage returns
	 - embedded tabs
	 - high byte chars

	 One good approach is to open the file with
	 vi in bin mode:
		
		vi -b filename

	 Then to put vi into 'setlist" mode. Do this
	 by hitting escape and entering ":setlist".

	 This should show any non ascii chars, new
	 lines, tabs, etc. 

	 the `od` utility can be useful for viewing
	 files "in the raw" as well.  some
	 useful od invocations

    - ending new line

	Some apps are picking about having any
	extra new lines at the end of files. So
	something to look for.

    - trailing spaces
	
	A particular hard to spot circumstance
	that can break some parsers. Trailing
	spaces after a string. This can be
	particularly difficult to spot in
	cases where it's a space, then 	a
	new line.
	
	This seems to be particularly common
	for config options for usernames and
	passords "foobar" != "foobar "


  - env stuff
      The users "enviroment" can often cause problems
      for applications. Some typical cases and
      how to detect them. X DISPLAY settings,
      the PATH, HOME, TERM settings etc can
      cause issues.

    - things work as user/not root, vice versa

	There can be any number reason something
	works as root, but not as a user. Most
	of them related to permissions (on
	files or devices). 

	Another common cause is PATH. On Red Hat
	at least, users do not have /sbin:/usr/sbin
	in there PATH by default. So some scripts
	or commands can fail because they are not
	in the PATH. Even having the PATH order
	being different between root/user can
	cause problems.

	 X forwarding crap

    - env

	The easiest way to see enviroment
	variables is just to run:

		env

    - what basic env stuff means
    - su/sudo issues
    - env -i to launch with clean env

	If a app seems to be having issues
	that is enviroment dependent, one
	thing that can be useful when trouble
	shooting is to launch it with `env -i`.
	Something like:

		env -i /bin/someapp

	`env -i` basically strips all enviroment
	variables, so the app can launch with
	nothing set. 

    - su -, etc 

	If `su` is being used to gain root,
	one thing to keep in mind is the difference
	between `su` and `su -`. The '-' tells su
	to start up a new login shell. In practice,
	this means `su -` essentialy gets a shell
	with the same enviroment a normal root
	shell would have. 

	A shell created with `su` still has the
	users SHELL, PATH, USERNAME, HOME, and
	other variables from the users shell. 

	A shell created with `su -` has the
	same variables a root shell has. 

	This can often cause weird behavior
	on apps that depend on the enviroment.
 
    - sudo -l

  - shell scripting

	Scripting in bash or sh is often
	the quickest and easiest way to solve
	a problem. It's also heavily used in
	the configuration and startup of Red Hat
	Linux systems. 

	Unfortunately, debugging shell scripts
	can be quite painful. Debugging shell
	scripts someone else wrote a decade
	ago is even worse.
	
   - echo

	The bash builtin "echo" is often the
	best debugging tool. A common trick
	is just to add "echo" to the begining
	of a line of code that you believe is
	doing something incorrect. 

	This will just print out the line, but
	after variable expansion. Particularly
	handy if the line in question is using
	lots of shell variables.

   - sh -x 
	
	Bash includes some support for getting
	more verbose information out of
	scripts as they run. Invoking
	a shell script as follows:

		sh -x /path/to/someshell.sh

	That command will at least trying
	to print out every line it as it
	executes them. 

   - trap 
   - bash debugger

	There is a bash debugger available
	at http://bashdb.sourceforge.net/.

	It's essentially a "gdb" style debugger
	but for bash. Including support for
	step debugging, breakpoints, and 
	watchpoints. 

  - DNS/name resolution
	
	Once a network is up and going, name resolution
	can continue to be a source of problems. Since
	so many applications expect reliable name resolution,
	when it fails.	
	
     - useage of dig

	`dig` is probably the most useful tool for
	tracking down DNS issues.
	
	insert useful dig examples
	   
     - /etc/hosts
	
	Check /etc/hosts for spurious entries. It's
	not uncommon for "temporary" /etc/hosts entries
	to become permanent, and when the host ip does
	change, things break. 

     - nscd

	nscd is a name service caching daemon. It's
	very useful when using name services info
	like hesiod and ldap. But it can also
	cache DNS as well.

	Most of the time, it just works. But
	it's been known to break in odd and
	mysterious ways before. So trying
	DNS with and without nscd running
	is a good idea. 
	

     - /etc/nsswitch.conf
     - splat names/host typos

	"*" matching on DNS servers is pretty
	common these days. It normally doesn't
	cause any problems, as much as it can
	make certain types of errors harder
	to track down.

	A typo in a hostname will get redirected
	to another server (typically a web server)
	instead of giving an name resolution error.

	Since the obvious "host not found" errors
	don't happen, tracking down these kind
	of problems can be compounded if used
	with "wildcard" DNS.	

  - auth info
   - getent
   - ypwhich/match/cat
   
  - certificate/crypto issues
   - ssl CA certs
   - gpg keys/signatures
   - rpm gpg keys
   - ssltool
   - curl

  - Network File Systems
    - NFS causes weird issues
      - timestamps
      - perms/rootsquash/etc
      - weird inode caching
    - samba
	- it touches windows stuff, icky
   
  - Some app/apps is rapidly forking shortlived process
    - gah, what a PITA to troubleshoot
    - psacct?
    - sar -X?
    - watching pids grow?
    - dump-acct + parsing?

App specific
   - apache
     - scorecard stuff
     - module debugging
     - log files
     - init file "configtest"
     - -X debug mode

  - php
	- 

  - gtk apps
    - event debuging stuff?

  - X apps
    - nosync stuff
    - X log

  - ssh
    - debug flags
    - sshd -d -d 

  - pam/auth/nss
    - logging options?
    - getent

  - sendmail  

Credits

  Comments, suggestions, hints, ideas, critisicms, pointers,
  and other useful info from various folks including:

  Mihai Ibanescu
  Chip Turner
  Chris MacLeod
  Todd Warner
  Nicholas Hansen
  Sven Riedel
  Jacob Frelinger
  James Clark
  Brian Naylor
  Drew Puch