views:

149

answers:

5

Say I have something like the following:

dest = "\n".join( [line for line in src.split("\n") if line[:1]!="#"] )

(i.e. strip any lines starting with # from the multi-line string src)

src is very large, so I'm assuming .split() will create a large intermediate list. I can change the list comprehension to a generator expression, but is there some kind of "xsplit" I can use to only work on one line at a time? Is my assumption correct? What's the most (memory) efficient way to handle this?

Clarification: This arose due to my code running out of memory. I know there are ways to entirely rewrite my code to work around that, but the question is about Python: Is there a version of split() (or an equivalent idiom) that behaves like a generator and hence doesn't make an additional working copy of src?

+1  A: 

The problem is that strings are immutable in python, so it's going to be very difficult to do anything at all without intermediate storage.

Andrew Jaffe
Understood. What I want to do is minimize the intermediate storage used. There's src, then a copy of that made by split, then a copy in the list comprehension, then dest. Twice as many copies as theoretically required.
Tom
I expect the generator will save you a lot for very little effort. Anything beyond is possibly premature optimization?
Andrew Jaffe
Unfortunately, no, the process is running out of memory... The generator gets it down to 3 copies though, as does StringIO, but the generator is definitely less effort which is better.
Tom
@Tom: If you are running out of memory perhaps you shouldn't be doing this operation in memory in the first place? You could instead store the data on the disk and read and write it in a streaming fashion - read a line, write a line, read a line, write a line, etc... Then you only need a few KB of memory for the entire operation. But you should measure the performance otherwise you are just guessing and guessing the performance of even a simple program is very, very difficult.
Mark Byers
@Mark, you're probably right, but I'm hoping to avoid rewriting this bit of code as much as possible. Halving the memory use is probably good enough, 33% less may have to do.
Tom
@Tom: "Halving the memory use is probably good enough". Would buying twice as much memory be an option?
Mark Byers
+4  A: 

In your existing code you can change the list to a generator expression:

dest = "\n".join(line for line in src.split("\n") if line[:1]!="#")

This very small change avoids the construction of one of the two temporary lists in your code, and requires no effort on your part.

A completely different approach that avoids the temporary construction of both lists is to use a regular expression:

import re
regex = re.compile('^#.*\n?', re.M)
dest = regex.sub('', src)

This will not only avoid creating temporary lists, it will also avoid creating temporary strings for each line in the input. Here are some performance measurements of the proposed solutions:

init = r'''
import re, StringIO
regex = re.compile('^#.*\n?', re.M)
src = ''.join('foo bar baz\n' for _ in range(100000))
'''

method1 = r'"\n".join([line for line in src.split("\n") if line[:1] != "#"])'
method2 = r'"\n".join(line for line in src.split("\n") if line[:1] != "#")'
method3 = 'regex.sub("", src)'
method4 = '''
buffer = StringIO.StringIO(src)
dest = "".join(line for line in buffer if line[:1] != "#")
'''

import timeit

for method in [method1, method2, method3, method4]:
    print timeit.timeit(method, init, number = 100)

Results:

 9.38s   # Split then join with temporary list
 9.92s   # Split then join with generator
 8.60s   # Regular expression
64.56s   # StringIO

As you can see the regular expression is the fastest method.

From your comments I can see that you are not actually interested in avoiding creating temporary objects. What you really want is to reduce the memory requirements for your program. Temporary objects don't necessarily affect the memory consumption of your program as Python is good about clearing up memory quickly. The problem comes from having objects that persist in memory longer than they need to, and all these methods have this problem.

If you are still running out of memory then I'd suggest that you shouldn't be doing this operation entirely in memory. Instead store the input and output in files on the disk and read from them in a streaming fashion. This means that you read one line from the input, write a line to the output, read a line, write a line, etc. This will create lots of temporary strings but even so it will require almost no memory because you only need to handle the strings one at a time.

Mark Byers
I'm after memory efficiency, not speed, avoiding temporary copies of the data is the goal. The generator expression helps a little, but there are still 3 copies created (by my estimate). Do you know how regex.sub() measures up in this regard?
Tom
@Tom: Also, what exactly is the problem or the error you are getting with your current code? If you get an exception, can you post the traceback in your question? If your code just behaves undesirably, can you describe in more detail which specific undesirable behaviour you are seeing and why it is a problem for you?
Mark Byers
I get a `MemoryError` exception on a line similar to the one posted. There's no other pertinent information in the traceback.
Tom
Nice, I was about to write "try re.sub", just with a different regex: `r'(#[^\n]*)(?:\n|$)'`. REs *should* create the output string in one go (instead of split *and* join).
THC4k
+5  A: 
buffer = StringIO(src)
dest = "".join(line for line in buffer if line[:1]!="#")

Of course, this really makes the most sense if you use StringIO throughout. It works mostly the same as files. You can seek, read, write, iterate (as shown), etc.

Matthew Flaschen
That's pretty good. Won't work for more generic calls to split() though will it? Using a StringIO instead of join is an option. Does StringIO make a copy of the buffer or is it referenced?
Tom
No, I don't know any split that returns an iterator. I'm think StringIO immediately makes a copy, since strings are immutable. It could be copy-on-write, but I don't think so.
Matthew Flaschen
I would avoid using `str` as a variable name.
Mark Byers
Thanks, @Mark. I've changed it to `src`.
Matthew Flaschen
+3  A: 

Here's a way to do a general type of split using itertools

>>> import itertools as it
>>> src="hello\n#foo\n#bar\n#baz\nworld\n"
>>> line_gen = (''.join(j) for i,j in it.groupby(src, "\n".__ne__) if i)
>>> '\n'.join(s for s in line_gen if s[0]!="#")
'hello\nworld'

groupby treats each char in src separately, so the performance probably isn't stellar, but it does avoid creating any intermediate huge data structures

Probably better to spend a few lines and make a generator

>>> src="hello\n#foo\n#bar\n#baz\nworld\n"
>>>
>>> def isplit(s, t): # iterator to split string s at character t
...     i=j=0
...     while True:
...         try:
...             j = s.index(t, i)
...         except ValueError:
...             if i<len(s):
...                 yield s[i:]
...             raise StopIteration
...         yield s[i:j]
...         i = j+1
...
>>> '\n'.join(x for x in isplit(src, '\n') if x[0]!='#')
'hello\nworld'

re has a method called finditer, that could be used for this purpose too

>>> import re
>>> src="hello\n#foo\n#bar\n#baz\nworld\n"
>>> line_gen = (m.group(1) for m in re.finditer("(.*?)(\n|$)",src))
>>> '\n'.join(s for s in line_gen if not s.startswith("#"))
'hello\nworld'

comparing the performance is an exercise for the OP to try on the real data

gnibbler
Bang on! Thanks, re.finditer() is the kind of thing I was thinking of. Your isplit() function is good too, but I was hoping it would be in the standard library and hopefully have a C implementation or something.
Tom
-1 isplit fails if `not s.endswith(t)`
John Machin
@John, is that ok now?
gnibbler
@gnibbler: Yes and no. Yes: behaviour suits the OP's requirements. No: it doesn't mimic the behaviour of `str.split` by producing a final '' when `s.endswith(t)` -- it behaves more like str.splitlines(False).
John Machin
+2  A: 

If I understand your question about "more generic calls to split()" correctly, you could use re.finditer, like so:

output = ""

for i in re.finditer("^.*\n",input,re.M):
    i=i.group(0).strip()
    if i.startswith("#"):
        continue
    output += i + "\n"

Here you can replace the regular expression by something more sophisticated.

loevborg