ansaurus

Question

What is this cProfile result telling me I need to fix?

Answer 1

A:

The entries relevant for possible optimization are those with high values for ncalls and tottime. bgchr:4(<module>) and <string>:1(<module>) probably refer to the execution of your module body and are not relevant here.

Obviously, your performance problem comes from string processing. This should perhaps be reduced. The hot spots are split, join and sys.stdout.write. bz2.decompress also seems to be costly.

I suggest you try the following:

Your main data seems to consist of tab separated CSV values. Try out, if CSV reader performs better.
sys.stdout is line buffered and flushed each time a newline is written. Consider writing to a file with a larger buffer size.
Instead of joining elements before writing them out, write them sequentially to the output file. You may also consider using CSV writer.
Instead of decompressing the data at once into a single string, use a BZ2File object and pass that to the CSV reader.

It seems that the loop body that actually uncompresses data is only invoked once. Perhaps you find a way to avoid the call dataHandle.read(size), which produces a huge string that is then decompressed, and to work with the file object directly.

Addendum: BZ2File is probably not applicable in your case, because it requires a filename argument. What you need is something like a file object view with integrated read limit, comparable to ZipExtFile but using BZ2Decompressor for decompression.

My main point here is that your code should be changed to perform a more iterative processing of your data instead of slurping it in as a whole and splitting it again afterwards.

Bernd Petersohn 2010-10-10 02:18:23

-1 See my answer.

John Machin 2010-10-10 04:14:54

Answer 2

+2 A:

This output is going to be more useful if your code is more modular as Lie Ryan has stated. However, a couple of things you can pick up from the output and just looking at the source code:

You're doing a lot of comparisons that aren't actually necessary in Python. For example, instead of:

if len(entryText) > 0:

You can just write:

if entryText:

An empty list evaluates to False in Python. Same is true for an empty string, which you also test for in your code, and changing it would also make the code a bit shorter and more readable, so instead of this:

   for line in metadataLines:      
        if line == '':
            break
        else:
            metadataList.append(line)

You can just do:

for line in metadataLines:
    if line:
       metadataList.append(line)

There are several other issues with this code in terms of both organization and performance. You assign variables multiple times to the same thing instead of just creating an object instance once and doing all accesses on the object, for example. Doing this would reduce the number of assignments, and also the number of global variables. I don't want to sound overly critical, but this code doesn't appear to be written with performance in mind.

jonesy 2010-10-10 02:21:23

I have replaced the source code and resulting cProfile analysis results. If you have a moment to take a look and offer suggestions, I would appreciate it. Thanks.

Alex Reynolds 2010-10-10 17:37:33

First, I think what's been suggested here by various people are generally good advice. Second, I'm still thinking that the rampant use of globals is going to have an impact on something that parses a large amount of data -- put the appropriate functions into a class, and that goes almost completely away. Also, wrapping things in a class makes it far easier to experiment with solutions like threading and multiprocessing (I favor multiprocessing over threading for this case, fwiw). Showing sample input would likely get you more feedback.

jonesy 2010-10-11 12:17:26

Answer 3

+4 A:

ncalls is relevant only to the extent that comparing the numbers against other counts such as number of chars/fields/lines in a file may highligh anomalies; tottime and cumtime is what really matters. cumtime is the time spent in the function/method including the time spent in the functions/methods that it calls; tottime is the time spent in the function/method excluding the time spent in the functions/methods that it calls.

I find it helpful to sort the stats on tottime and again on cumtime, not on name.

bgchar definitely refers to the execution of the script and is not irrelevant as it takes up 8.9 seconds out of 13.5; that 8.9 seconds does NOT include time in the functions/methods that it calls! Read carefully what @Lie Ryan says about modularising your script into functions, and implement his advice. Likewise what @jonesy says.

string is mentioned because you import string and use it in only one place: string.find(elements[0], 'p'). On another line in the output you'll notice that string.find was called only once, so it's not a performance problem in this run of this script. HOWEVER: You use str methods everywhere else. string functions are deprecated nowadays and are implemented by calling the corresponding str method. You would be better writing elements[0].find('p') == 0 for an exact but faster equivalent, and might like to use elements[0].startswith('p') which would save readers wondering whether that == 0 should actually be == -1.

The four methods mentioned by @Bernd Petersohn take up only 3.7 seconds out of a total execution time of 13.541 seconds. Before worrying too much about those, modularise your script into functions, run cProfile again, and sort the stats by tottime.

Update after question revised with changed script:

"""Question: What can I do about join, split and write operations to reduce the apparent impact they have on the performance of this script?""

Huh? Those 3 together take 2.6 seconds out of the total of 13.8. Your parseJarchLine function is taking 8.5 seconds (which doesn't include time taken by functions/methods that it calls. assert(8.5 > 2.6)

Bernd has already pointed you at what you might consider doing with those. You are needlessly splitting the line completely only to join it up again when writing it out. You need to inspect only the first element. Instead of elements = line.split('\t') do elements = line.split('\t', 1) and replace '\t'.join(elements[1:]) by elements[1].

Now let's dive into the body of parseJarchLine. The number of uses in the source and manner of the uses of the long built-in function are astonishing. Also astonishing is the fact that long is not mentioned in the cProfile output.

Why do you need long at all? Files over 2 Gb? OK, then you need to consider that since Python 2.2, int overflow causes promotion to long instead of raising an exception. You can take advantage of faster execution of int arithmetic. You also need to consider that doing long(x) when x is already demonstrably a long is a waste of resources.

Here is the parseJarchLine function with removing-waste changes marked [1] and changing-to-int changes marked [2]. Good idea: make changes in small steps, re-test, re-profile.

def parseJarchLine(chromosome, line):
    global pLength
    global lastEnd
    elements = line.split('\t')
    if len(elements) > 1:
        if lastEnd != "":
            start = long(lastEnd) + long(elements[0])
            # [1] start = lastEnd + long(elements[0])
            # [2] start = lastEnd + int(elements[0])
            lastEnd = long(start + pLength)
            # [1] lastEnd = start + pLength
            sys.stdout.write("%s\t%ld\t%ld\t%s\n" % (chromosome, start, lastEnd, '\t'.join(elements[1:])))
        else:
            lastEnd = long(elements[0]) + long(pLength)
            # [1] lastEnd = long(elements[0]) + pLength
            # [2] lastEnd = int(elements[0]) + pLength
            sys.stdout.write("%s\t%ld\t%ld\t%s\n" % (chromosome, long(elements[0]), lastEnd, '\t'.join(elements[1:])))
    else:
        if elements[0].startswith('p'):
            pLength = long(elements[0][1:])
            # [2] pLength = int(elements[0][1:])
        else:
            start = long(long(lastEnd) + long(elements[0]))
            # [1] start = lastEnd + long(elements[0])
            # [2] start = lastEnd + int(elements[0])
            lastEnd = long(start + pLength)
            # [1] lastEnd = start + pLength
            sys.stdout.write("%s\t%ld\t%ld\n" % (chromosome, start, lastEnd))               
    return

Update after question about sys.stdout.write

If the statement that you commented out was anything like the original one:

sys.stdout.write("%s\t%ld\t%ld\t%s\n" % (chromosome, start, lastEnd, '\t'.join(elements[1:])))

Then your question is ... interesting. Try this:

payload = "%s\t%ld\t%ld\t%s\n" % (chromosome, start, lastEnd, '\t'.join(elements[1:]))
sys.stdout.write(payload)

Now comment out the sys.stdout.write statement ...

By the way, someone mentioned in a comment about breaking this into more than one write ... have you considered this? How many bytes on average in elements[1:] ? In chromosome?

=== change of topic: It worries me that you initialise lastEnd to "" rather than to zero, and that nobody has commented on it. Any way, you should fix this, which allows a rather drastic simplification plus adding in others' suggestions:

def parseJarchLine(chromosome, line):
    global pLength
    global lastEnd
    elements = line.split('\t', 1)
    if elements[0][0] == 'p':
        pLength = int(elements[0][1:])
        return
    start = lastEnd + int(elements[0])
    lastEnd = start + pLength
    sys.stdout.write("%s\t%ld\t%ld" % (chromosome, start, lastEnd))
    if elements[1:]:
        sys.stdout.write(elements[1])
    sys.stdout.write(\n)

Now I'm similarly worried about the two global variables lastEnd and pLength -- the parseJarchLine function is now so small that it can be folded back into the body of its sole caller, extractData, which saves two global variables, and a gazillion function calls. You could also save a gazillion lookups of sys.stdout.write by putting write = sys.stdout.write once up the front of extractData and using that instead.

BTW, the script tests for Python 2.5 or better; have you tried profiling on 2.5 and 2.6?

John Machin 2010-10-10 04:13:46

I agree with the interpretation of numbers. But I also looked at the code: the main work is obviously done in the last section marked "Extract data". What I see there is a lot of string creations of what must be a massive amount of data. And this is the point which needs optimization in my eyes. You will probably not get much more informative numbers if you put that code into a function. And yes, personally I would have structured this code differently.

Bernd Petersohn 2010-10-10 10:09:43

I have replaced the source code and resulting cProfile analysis results. If you have a moment to take a look and offer suggestions, I would appreciate it. Thanks.

Alex Reynolds 2010-10-10 17:37:08

Thank you for your helpful advice. I'm not a Python expert and would have had no idea that `int` types are promoted to `long` automatically, based on work in other languages. However, the changes you suggested seem only to shave off 0.8 sec. My script is still taking about as twice as long as a `csh/awk` solution (one that doesn't allow use of `seek`-based random access, like Python, and should be slower). Unless there are other Python-language-specific optimizations and tricks (which still allow writing to standard output) I think I may have to look into a C-based solution at this point.

Alex Reynolds 2010-10-10 22:39:41

I started out running this under Python 2.5.2 and it was just as slow — I went to Python 2.7 to try to get any speed enhancements. I have been writing a C-based equivalent of this script the last couple days, and the result is lightning fast, in comparison, albeit with some bzip difficulties I'm still working out. I may post code later for comparison purposes.

Alex Reynolds 2010-10-12 06:33:11

@Alex Reynolds: Did you test the last suggestion of John Machin? I think one main problem of your script is that it maintains an unnecessarily large memory footprint. In `extractData()`, you first read the uncompressed data block for one chromosome. Then a second string of the decompressed data is created. It contains about 1.75 millions of lines. It is easy to reach 1 GB here. Then you create a list of those 1.75 million lines. It would be better to wrap the uncompressed data in StringIO and iterate over its lines. You should also release the compressed data early using `del`.

Bernd Petersohn 2010-10-12 14:20:20

Answer 4

A:

Good grief. Look, there is a basic thing to understand.

In a given execution of a program, every instruction (line of code, statement) is responsible for a certain percent of the total time. If it were not there, that percent of time would not be spent. During that time, it is on the stack.

So, if there is a line of code or a small set of lines (you can't edit anything else) that are worth optimizing, then they are on the stack for at least as much time as changing them could save. Those are what to look for.

All the rest is "beating around the bush".

Here's another way to say it.

Mike Dunlavey 2010-10-10 12:35:33

@downvoter: Plz explain.

Mike Dunlavey 2010-10-10 13:41:21

I think you're confusing your language here. Optimizing in a high-level language like Python has an entirely different nature than optimizing in a compiled language like C/C++.

Lie Ryan 2010-10-10 14:43:13

@Lie: Python has functions, doesn't it? You call them, don't you? and they return? Then there has to be something that acts like a stack, no? If the program runs for 10 seconds, and if a particular call site is on the stack for 3 of those seconds, whether in one call or spread over 1000 calls, then if that call were not there, those 3 seconds would not be spent, no? What's compiling, or optimizing, got to do with it?

Mike Dunlavey 2010-10-10 14:48:13

@Mike Dunlavey: I don't get what you meant. In all languages that I know, stack does not take time, it takes memory. I'm not sure what you meant by functions taking time in the stack. Unless you meant you somehow caused a data in the stack to be executed (i.e. in the fashion of self-modifying code), which is just not possible in very high-level language (a much safer, and more useful alternative is exec/eval, but it's a different beast).

Lie Ryan 2010-10-10 15:02:38

@Lie: It's very simple. The purpose of the stack is to allow subroutines to return, so as to allow the caller to finish whatever it was doing. The time that a routine takes is the time it spends locally, plus the time taken by any subroutines that it calls. While it is in those calls, the location where it came from is on the stack. So if you can see what's on the stack over time (such as by sampling it), you can see which lines of code are calling for the most time to be spent. Maybe this is something you might need to sleep on. If so, you're not alone.

Mike Dunlavey 2010-10-10 15:10:06

@Mike Dunlavey: I see what you meant now, and as I said before, you're confusing your language (though now, I meant terminologies, instead of programming language). First, what you're referring solely by "stack" actually means "function call stack". Second, what you meant by "stack taking time" is really "counting how likely a particular functions is part of backtrace". While I agree that's a reasonable method to profile a program, that's not how Python's cProfile works, and thus my original sense of "you're confusing your (programming) language" also holds.

Lie Ryan 2010-10-10 15:22:48

@Lie: Ok, now I think we're on the same page. That's why profilers like `cProfile`, based on the same concepts as `gprof` are not very effective. Things that sample the function call stack, at random wall-clock times, and report percent at line-of-code level (such as Zoom and LTProf, or simple manual sampling) are so phenomenally more effective that it's hard to understand why the old way is even used. People say "well that's all I had", which is small comfort when they are nearly useless.

Mike Dunlavey 2010-10-10 15:30:15

@Mike Duvanley: I fail to see your point here. cProfile is still extremely useful as a function level profiler, sure it has flaws, but all measurement tools do. It's a programmer's job to interpret the data and determine whether the result of your measurement is useful or whether you should look at another measurement method. cProfile just happens to be in the standard library, so it's convenient to use. I fail to see how you make conclusion from "function-level profiling are not accurate" to "its result is totally useless".

Lie Ryan 2010-10-10 15:56:04

@Lie: It's a different paradigm. It's not about accuracy and measuring, interpreting data - in other words, being a sleuth. The stack-sampling method simply points you at the problem. There's no comparison. mmyers expressed it pretty well: http://stackoverflow.com/questions/2624667/whats-a-very-easy-c-profiler-vc/2624725#2624725 It's easy to debate this issue. It's even easier to try it.

Mike Dunlavey 2010-10-10 16:20:37

@Mike Dunlavey: I'm not debating about the efficacy of stack-sampling or function-instrumentation profiling. But I fail to see why you think they are relevant to the OP's case. You might want to suggest that it's easier to find bottlenecks using stack-sampling, if your code averages 200 lines/functions, that most people cannot interpret gprof results correctly; but they're irrelevant to the usefulness of function-level measurement.

Lie Ryan 2010-10-10 16:42:34

I have replaced the source code and resulting cProfile analysis results. If you have a moment to take a look and offer suggestions, I would appreciate it. Thanks.

Alex Reynolds 2010-10-10 17:38:11

@Alex: I'll try and get a Python, but, sure, console output might be a dominant fraction of cost depending on how much of it you're doing. Plenty of times I've found that between the formatting of numbers to strings, the frequent flushing of output buffers, and the presumed cost of painting characters to a screen window and scrolling it, yes, console I/O can be a "bottleneck".

Mike Dunlavey 2010-10-11 02:47:04

@Lie: I am guilty of broadly interpreting questions. Often when someone asks a question, they are really asking a deeper question that they don't know how to frame. I haven't met anyone who doesn't want to know how to make their software take less time, and a *lot* less would be even better. So yes, I often take liberties with the literal question. Most of the time, people would rather have their problem addressed than their question taken too literally.

Mike Dunlavey 2010-10-11 02:52:36

@Alex: OK, I just installed Python 2.5 (Windows) and tried to run your script. I got an indentation error in runpy.py at line 85, so I've just got learning pains, I'm sure. However, I did try an infinite loop and verified that Ctrl-C interrupts it nicely and gives a stack traceback, which is all that's needed. Anything it's doing for x% of time, like printing or whatever, that's the chance you'll catch it in the act on each interrupt, and you'll see exactly why it's doing it. That's all I ever do. I have no patience for gprof-style output.

Mike Dunlavey 2010-10-11 13:19:52

@Alex: I got further with your script just by removing the 4 leading spaces on each line (duh). Then when I ran it, it wanted some kind of chromosome file, so I still couldn't stackshot it. (If you want to email me that, you can, from my home page.) I just did the Ctrl-C technique on a dumb old fibonnacci example on http://www.doughellmann.com/PyMOTW/profile/ and it said what was expected, in 1 sample.

Mike Dunlavey 2010-10-11 19:47:08

ansaurus

tags:

views:

answers:

What is this cProfile result telling me I need to fix?

related questions