views:

284

answers:

3

I have a program written in Ruby and C. The C portion is a shared library, which is an extension for the Ruby program. I want to profile the C shared library I wrote, using gprof. I compile the shared library like this:

gcc -I. -I/usr/lib/ruby/1.8/i486-linux -I/usr/lib/ruby/1.8/i486-linux -I. -D_FILE_OFFSET_BITS=64  -fPIC -fno-strict-aliasing -g -march=i686 -O2 -ggdb -pg -fPIC -c extension.c
gcc -shared -o extension.so extension.o -L. -L/usr/lib -L. -Wl,-Bsymbolic-functions -rdynamic -Wl,-export-dynamic  -lruby1.8  -lpthread -lrt -ldl -lcrypt -lm -lc

Then I execute the ruby program, which loads this shared library, and I expect a gmon.out file in the current directory, but for some reason the file gmon.out does not get created. How do I do this?

I googled for this but couldn't find a satisfactory answer (which worked).

P.S. - As a workaround I can have a modified version of the extension which is a pure C program (instead of being created as a shared library) that I can use to profile, but it becomes tedious to maintain two versions of the same C extension (large number of differences between the two).

I tried writing a C program which uses the shared library directly too. I immediately get a page fault in one of the ruby library functions which get called during the initialization of the shared library. I think it's really expecting to be loaded from a ruby program, which may internally be doing some magic.

(gdb) bt
#0  0x0091556e in st_lookup () from /usr/lib/libruby1.8.so.1.8
#1  0x008e87c2 in rb_intern () from /usr/lib/libruby1.8.so.1.8
#2  0x008984a5 in rb_define_module () from /usr/lib/libruby1.8.so.1.8
#3  0x08048dd0 in Init_SimilarStr () at extension.c:542
#4  0x0804933e in main (argc=2, argv=0xbffff454) at extension.c:564

Update: Never mind. I used #ifdef to compile out Ruby portions of the extension and get a profile. Closing.

A: 

You can run the ruby interpreter itself through the profiler. If that is too much, write a small C program that loads the shared library and calls its exported functions. Then profile that C program. It saves you from maintaining two versions of the library.

Vijay Mathew
Yeah, I guess I can profile the ruby interpreter itself - except maybe I'll have to go through a lot of cruft I am not interested in. I think it's possible to profile just the shared libraries itself, since I googled and found references suggesting that's possible. Except I didn't get an authoritative link telling me exactly how to do it.
Sudhanshu
+1  A: 

I found oprofile to be a lot better option than gprof for profiling, in this situation. The reports from oprofile are much more comprehensive. I compiled out the ruby portions which were causing the seg-faults (not all of them were), from the C extension using #ifndef PROFILE, and replaced them with non-ruby code. I wrote a main() routine within the extension itself, to call the functions in the extension. Then I set up a makefile to compile the extension as a C program with PROFILE defined. Then I installed oprofile on Ubuntu. Wrote this script.

#!/bin/bash
sudo opcontrol --reset
sudo opcontrol --start
./a.out Rome Damascus NewYork Delhi Bangalore
sudo opcontrol --shutdown
opreport -lt1

Compiled my program, and executed the above script, which gives output like this from the "opreport" command:

...
...
Killing daemon.
warning: /no-vmlinux could not be found.
warning: [vdso] (tgid:10675 range:0x920000-0x921000) could not be found.
warning: [vdso] (tgid:1270 range:0xba1000-0xba2000) could not be found.
warning: [vdso] (tgid:1675 range:0x973000-0x974000) could not be found.
warning: [vdso] (tgid:1711 range:0x264000-0x265000) could not be found.
warning: [vdso] (tgid:1737 range:0x990000-0x991000) could not be found.
warning: [vdso] (tgid:2477 range:0xa53000-0xa54000) could not be found.
warning: [vdso] (tgid:5658 range:0x7ae000-0x7af000) could not be found.
CPU: Core Solo / Duo, speed 1000 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Unhalted clock cycles) with a unit mask of 0x00 (Unhalted core cycles) count 100000
samples  %        app name                 symbol name
12731    32.8949  a.out                    levenshtein
11958    30.8976  a.out                    corpora_pass2
5231     13.5161  no-vmlinux               /no-vmlinux
4021     10.3896  a.out                    corpora_pass1
1733      4.4778  libc-2.10.1.so           /lib/tls/i686/cmov/libc-2.10.1.so
542       1.4004  ld-2.10.1.so             /lib/ld-2.10.1.so
398       1.0284  a.out                    method_top_matches

There it is: the top consumer is the function levenshtein(). I followed this by another command to generate disassembled output annotated with source code and execution count/time of each line. This looks like this (counts/times are on the left of each executed line):

> opannotate --source --assembly ./a.out > report.as.handcoded.1
> cat report.as.handcoded.1

...
...
...
           :         __asm__ (
 2  0.0069 : 804918a:       mov    -0x50(%ebp),%ecx
 4  0.0137 : 804918d:       mov    -0x54(%ebp),%ebx
           : 8049190:       mov    -0x4c(%ebp),%eax
12  0.0412 : 8049193:       cmp    %eax,%ecx
10  0.0344 : 8049195:       cmovbe %ecx,%eax
 8  0.0275 : 8049198:       cmp    %eax,%ebx
11  0.0378 : 804919a:       cmovbe %ebx,%eax
16  0.0550 : 804919d:       mov    %eax,-0x4c(%ebp)
           :                   "cmp     %0, %2\n\t"
           :                   "cmovbe  %2, %0\n\t"
           :                  : "+r"(a) :
           :                    "%r"(b), "r"(c)
           :                  );
           :          return a;
 ...
 ...
 ...
Sudhanshu
+1  A: 

You can do better than gprof. Consider stackshots. You can do it using pstack, lsstack (if you can get it), or by manually pausing under the debugger. Here's a short intro to the technique.

Mike Dunlavey
Mike, isn't the per instruction count + time taken, given out by the opannotate program more or less the same technique (see my own answer here), that you have described there?
Sudhanshu
@Sudhanshu: I just looked at the oprofile doc, and I'm not sure, but it looks like it's close to what I think is needed. It does sample the call stack, but I can't tell for sure if it samples on wall-clock time. (Some profilers don't sample during system calls, and that's useless.) Then if the number it shows on a call-site (or an ordinary instruction) is the fraction of samples containing that call site, then it's doing the right thing. Another profiler that does this is RotateRight/Zoom.
Mike Dunlavey
... the other thing it needs to do is let you control when it samples and when not (like when it's waiting for user input).
Mike Dunlavey
... I don't think counts are useful. The times are useful if what they mean is fraction-of-samples. Then I want to see fractions of at least 0.1 or 0.05. In my experience, problems worth fixing tend to cost > 10%.
Mike Dunlavey
... I like to see the actual stack samples, rather than just summary numbers, because then I can form a verbal description of exactly what it's doing and why, and tell what fraction of the time it's doing it. I also like to include, if necessary, other state information as well (like what *was* the string being looked up). That tells me if it was really necessary.
Mike Dunlavey
+1 @Mike: Your suggested methodology is good and I think oprofile more or less does that kind of sampling. Oprofile actually profiles the whole system (kernel + all user programs), although I can tell oprofile to output a report for a specific executable (in this case, my own program). The "no-vmlinux" in the opannotate is the time spent inside the kernel. However, in my program most amount of time is spent in the user-space. And the annotated disassembled output gives me time spent line by line.I can see where you're going with stack samples, and I agree it can give some good insights.
Sudhanshu
@Sudhanshu: Thanks for making me look at oprofile. So now that's two profilers I'm aware of that do "the right thing" (more or less). BTW, if you take your own samples in the debugger, you can trace through the Ruby interpreted stack (with some work) if you want to explore the possibility that some of the Ruby code could be optimized.
Mike Dunlavey