Troubleshooting obscure "thats not supposed to happen" types of problems are often the most difficult type of problems to solve. When you reach the point at which you've exhausted all available options, and cant think if anything else to ask, and dont know of any other way to get info, there might be a few options left. The use of debugging tools (really meant for development), and the system monitoring tools can also provide a lot of insight into what is causing problems. Some of these tools include strace, ltrace, netstat, vmstat, gdb, and others. Situtation 1. Program Foo exists with "foo: file not readable" You have checked every file imaginiable, read the docs, checked perms, checked disk space, etc, but you still get this error. There are a couple ways to proceed. One is to check the source code for that error message and see generates its. Not always the quickets or easiest thing to do though. A simpler and faster alternative is to try running `strace` on the application. `strace` is a utility that will show you all the system calls that a program makes, which typical includes opening and closing files, network access, checking file permisions, etc. Very useful info in situations like this. To info `strace`, you typically need to be root, and then just run the program as: strace programname You will get a _pile_ of debuggin info on stderr thats detailing every system call the program is making. For the example of a program being unable to open a file like above, you can start by looking for the error message in the strace output. This will typically be in a "write" syscall if you want to narrow your search down. Keep in mind that strace output goes to stderr, so you can use a command like: strace programname 2>&1 | less and then search for "write" or the text of the error message, etc. Then look to see where the error message is, and typically in the lines shortly preceding that error message will be the command that actually causes the error. More info on what the output of `strace` means. Being somewhat familar with the common syscalls for linux is helpful in understanding strace output. But most of the common ones are simple enough to be able to figure out on context. A line in strace output is essentially, the system call name, the arguments to the call in parens (sometimes truncated...), and then the return status. A return status for error is typically -1, but varies sometimes. For more information about the return status of a typically system call is by `man 2 syscallname`. Usually the return status will be documented in the "RETURN STATUS" section. Another thing to note about strace it is often shows "errno" status. If your not familar with unix system programming, errno is a global variable that gets sets to specific values when some commands execute. This variable gets set to different values based on the error mode of the command. More info on this can be found in `man errno`. But typically, strace will show the brief description for any any errno values it gets. ie: open("/foo/bar", O_RDONLY) = -1 ENOENT (No such file or directory) Other Useful strace options In many cases, you want to see what a already running process is doing. For example, some process appears to have stalled, or perhaps is taking 100% cpu time. To attach to an already open project, you need to know the pid of the command (just use ps, top, etc), and then issue the command: strace -p pid For hung processes, this will typically show something like: select(blah blah blah or wait(blah blah blah These are just common syscalls that block, so are typical of hung process. Select monitors file descriptors or sockets and waits for input, and wait, well, waits for a program to finish. Select's will hang if they dont get any input on the files they are monitoring. Wait's will hang if the process they are waiting for hasnt returned. More info in `man 2 select|wait` strace -f This command tells strace to follow "fork()'s" in the program. fork() is just a system call that gets called whenever a program calls another program (or itself, again). By default, strace doesnt follow these, but for many programs you need this option to get any useful output. A good example is shell scripts. By default, strace will only show the system calls called by the shell itself, which is often not the info you need. `strace -f` will show all the syscalls for the shell, and any other programs it calls. strace -s X the -s option tells strace to show the first X digits of strings. The default is 32 characters, which sometimes isnt enough. This will increase the info available to the user. strace -e something_or_another For big programs that generate large amounts of strace spewage, the -e option is often useful, especially if you have a good idea what you are looking for. Lets say for example, netscape is dying because it cant some file. running `strace` on netscape is going to take a long long amount of time, and generate megs and megs of output. You can use the -e trace= option to narrow down the output of strace. To just see the syscall which do somthing to a filename, try: strace -f -e trace=file netscape Another useful option is "-e trace=network" which of course, straces network related calls. Thats the basics of strace. Ltrace ltrace is very similar to strace, but instead of tracing system calls, it traces library calls. Most of the commentary about strace applies to ltrace as well, and many of their command line options are the same. Note on the difference between system calls and library calls. There is a distinction between what makes a "systemcall" and a call to a library functions. Sometimes the line between the two is blurry, but the basic difference is that system calls are essentially communicating to the kernel, and library calls are just running more userland code. System calls are usually require for things like I/O, process controll, memory management issues, and other "kernel" things. Library calls are by bulk, generaly calls to the standard C library (glibc..), but can of course be calls to any library, say Gtk,libjpeg, libnss, etc. Luckily most glibs functions are well documented and have either man or info pages. Documentation for other libraries varies greatly. A interesting note about ltrace, is that it can also display system calls if the "-S" option is used. This is often useful for getting a quick overview of what a program is doing. strace however, allows more find grained control over system calls. GDB gdb isnt as useful as strace/ltrace for troubleshootin/support types of issues, but occasionally it comes in handy. For troubleshooting, its useful for determining what cause core files. (`file core` will also typically show you this information too). But gdb can also show you "where" the file crashed. Once you determine the name of the app that caused the failure, you can start gdb with: gdb filename core then at the prompt type `where` The unfortunate thing is that all the binaries we ship are stripped of debuggig symbols to make them smaller, so this often returns less than useful information. But it can sometimes give you a full useful bits of info if your really hard up. Stracing a shell script to see what a shell script is doing, try: sh -x shell_script.sh It will produce output of the variety: + echo "foo" + echo "blah" etc. Occasionally handy.