views:

4000

answers:

10

A collegue told me once that the last option when everything has failed to debug on linux was to use strace.

I tried to learn the science there is behind this strange tool but I am not a system admin guru and I Don't really get results.

So what is it exactly and what does it do ?

How to use it ? In which case ?

How to understand the output and process it ?

In brief, how does this stuff work ? In simple words.

+7  A: 

strace lists all system calls done by the process it's applied to. If you don't know what system calls mean, you won't be able to get much mileage from it.

Nevertheless, if your problem involves files or paths or environment values, running strace on the problematic program and redirecting the output to a file and then grepping that file for your path/file/env string may help you see what your program is actually attempting to do, as distinct from what you expected it to.

Asaf Bartov
And for non-trivial programs this is often like drinking from a fire-hose, so you have your work cut out for you going through the results...
dmckee
+1  A: 

The man page for strace is available here, for more information: http://linux.die.net/man/1/strace

warren
+5  A: 

Strace Overview
strace can be seen as a light weight debugger. It allows a programmer / user to quickly find out how a program is interacting with the OS. It does this by monitoring system calls and signals.

Uses
Good for when you don't have source code or don't want to be bothered to really go through it.
Also, useful for your own code if you don't feel like opening up GDB, but are just interested in understanding external interaction.

A good little introduction
I ran into this into to strace use just the other day: strace hello world

John Mulder
+1  A: 

In simple words, strace traces all system calls issued by a program along with their return codes. Think things such as file/socket operations and a lot more obscure ones.

It is most useful if you have some working knowledge of C since here system calls would more accurately stand for standard C library calls.

Let's say your program is /usr/local/bin/cough. Simply use:

strace /usr/local/bin/cough <any required argument for cough here>

All strace output will go to stderr (beware, the sheer volume of it often asks for a redirection to a file). In the simplest cases, your program will abort with an error and you'll be able to see what where its last interactions with the OS in strace output.

More information should be available with:

man strace
bltxd
A: 

Strace is a tool that tells you how your application interacts with your operating system.

It does this by telling you what OS system calls your application uses and with what parameters it calls them.

So for instance you see what files your program tries to open, and weather the call succeeds.

You can debug all sorts of problems with this tool. For instance if application says that it cannot find library that you know you have installed you strace would tell you where the application is looking for that file.

And that is just a tip of the iceberg.

Luka Marinko
+1  A: 

After you've finished learning strace, have a look at ltrace. It is similar except it applies to library calls, not system calls.

camh
+2  A: 

Strace stands out as a tool for investigating production systems where you can't afford to run these programs under a debugger. In particular, we have used strace in the following two situations:

  • Program foo seems to be in deadlock and has become unresponsive. This could be a target for gdb; however, we haven't always had the source code or sometimes were dealing with scripted languages that weren't straight-forward to run under a debugger. In this case, you run strace on an already running program and you will get the list of system calls being made. This is particularly useful if you are investigating a client/server application or an application that interacts with a database
  • Investigating why a program is slow. In particular, we had just moved to a new distributed file system and the new throughput of the system was very slow. You can specify strace with the '-T' option which will tell you how much time was spent in each system call. This helped to determine why the file system was causing things to slow down.

For an example of analyzing using strace see my answer to this question.

terson
A: 

I have successfully used strace for determining that my application was trying to open a file but could not open it. fopen was the call.

Basically get a reference for the system calls and it will aid in your debugging.

Brian G
+1  A: 

strace is a good tool for learning how your program makes various system calls (requests to the kernel) and also reports the ones that have failed along with the error value associated with that failure. Not all failures are bugs. For example, a code that is trying to search for a file may get a ENOENT (No such file or directory) error but that may be an acceptable scenario in the logic of the code.

One good use case of using strace is to debug race conditions during temporary file creation. For example a program that may be creating files by appending the process ID (PID) to some predecided string may face problems in multi-threaded scenarios. [A PID+TID (process id + thread id) or a better system call such as mkstemp will fix this].

It is also good for debugging crashes. You may find this (my) article on strace and debugging crashes useful.

tc
+1  A: 

Strace can be used as a debugging tool, or as a primitive profiler.

As a debugger, you can see how given system calls were called, executed and what they return. This is very important, as it allows you to see not only that a program failed, but WHY a program failed. Usually it's just a result of lousy coding not catching all the possible outcomes of a program. Other times it's just hardcoded paths to files. Without strace you get to guess what went wrong where and how. With strace you get a breakdown of a syscall, usually just looking at a return value tells you a lot.

Profiling is another use. You can use it to time execution of each syscalls individually, or as an aggregate. While this might not be enough to fix your problems, it will at least greatly narrow down the list of potential suspects. If you see a lot of fopen/close pairs on a single file, you probably unnecessairly open and close files every execution of a loop, instead of opening and closing it outside of a loop.

Ltrace is strace's close cousin, also very useful. You must learn to differenciate where your bottleneck is. If a total execution is 8 seconds, and you spend only 0.05secs on system calls, then stracing the program is not going to do you much good, the problem is in your code, which is usually a logic problem, or the program actually needs to take that long to run.

The biggest problem with strace/ltrace is reading their output. If you don't know how the calls are made, or at least the names of syscalls/functions, it's going to be difficult to decypher the meaning. Knowing what the functions return can also be very beneficial, especially for different error codes. While it's a pain to decypher, they sometimes really return a pearl of knowledge; once I saw a situation where I ran out of inodes, but not out of free space, thus all the usual utilities didn't give me any warning, I just couldn't make a new file. Reading the error code from strace's output pointed me in the right direction.

Marcin