ansaurus

Question

speed-up python function to process files with data segments separated by a blank space

Answer 1

+2 A:

not to convert floats to floats would be the first step. I would suggest, however, to first profile your code and then try to optimize the bottleneck parts.

I understand that you've changed your code from the original, but

values = [value for value in line.split()]

is not a good thing either. just write values = line.split() if this is what you mean.

Seeing how you're using NumPy, I'd suggest some methods of file reading that are demonstrated in their docs.

SilentGhost 2010-02-22 14:20:34

O_o ..oO(ups) edited

remosu 2010-02-22 14:28:41

Answer 2

+1 A:

You are only reading every character exactly once, so there isn't any real performance to gain.

You could combine strip and split if the empty lines contain a lot of whitespace.

You could also save some time initializing the numpy array from start, instead of first creating a python array and then converting.

Thomas Ahle 2010-02-22 14:41:28

change «line = line.strip(); if not line:»for «if len(line) <= 1:»empty lines only have '\n'

remosu 2010-02-22 15:59:51

Answer 3

A:

numpy.fromfile doesn't work for you?

arr = fromfile('tmp.txt', sep=' ', dtype=int)

PabloG 2010-02-22 15:16:31

nop... my data file is huge to be read complet in memory

remosu 2010-02-22 15:27:12

I was looking for an efficient way to read each segment of data to str and use numpy.fromstring

remosu 2010-02-22 15:43:20

Answer 4

+1 A:

try increasing the read buffer, IO is probably the bottle neck of your code

open('file.txt', 'r', 1024 * 10)

also if the data is fully sequential you can try to skip the line by line code and convert a bunch of lines at once

arthurprs 2010-02-22 16:37:49

this help too. Thanks

remosu 2010-02-22 17:43:16

Answer 5

+1 A:

Instead of :

if len(line) <= 1: # only '\n' in «empty» lines
    break
values = line.split()

try this:

values = line.split()
if not values: # line is wholly whitespace, end of segment
    break

John Machin 2010-02-22 21:38:57

Answer 6

A:

Here's a variant that might be faster for few indices. It builds a string of only the desired values so that np.fromstring does less work.

def get_pos_nextvalues_fewindices(pos_file, indices):
    result = ''
    for line in pos_file:
        if len(line) > 1:
            s = line.split()
            for i in indices:
                result += s[i] + ' '
        else:
            return np.array([])
    result = np.fromstring(result, dtype=float, sep=' ')
    result = result.reshape(result.size/len(indeces), len(indeces))
    return result

This trades off the overhead of split() and an added loop for less parsing. Or perhaps there's some clever regex trick you can do to extract the desired substrings directly?

Old Answer

np.mat('1.23 2.34 3.45 6\n1.32 2.43 7 3.54') converts the string to a numpy matrix of floating point values. This might be a faster kernel for you to use. For instance:

import numpy as np
def ReadFileChunk(pos_file):
    chunktxt = ""
    for line in pos_file:
        if len(line) > 1:
            chunktxt = chunktxt + line
        else:
            break

    return np.mat(chunktxt).tolist()
    # or alternatively
    #return np.array(np.mat(s))

Then you can move your indexing stuff to another function. Hopefully having numpy parse the string internally is faster than calling float() repetitively.

mtrw 2010-02-22 22:19:55

I was trying something like this, but was looking for a more efficient way of reading chunks of data to go by concatenating line by line. I use np.fromstring instead of np.mat, I needan an array and make a reshape is not expensive. But let numpy parse the string it's definitely a substantial improvement.

remosu 2010-02-23 13:11:15

return np.array(np.mat(s)) is more expensive than using no.fromstring

remosu 2010-02-23 13:42:41

So np.fromstring + reshape is faster than np.array(np.mat(s))? Ah well. I learned something new, at least, I thought np.fromstring was for binary data packed into a string.

mtrw 2010-02-23 22:46:52

new answer run at a similar speed but I like it more because frees me from the worry about future change of the line size(7 now)

remosu 2010-02-24 17:57:06

upss... I voted for this answer again but stackoverflot put 0 and don't let me vote again :(. I need a deep read of the faq.

remosu 2010-02-24 17:59:40

No problem, I'm glad it helped!

mtrw 2010-02-24 19:33:15

ansaurus

tags:

views:

answers:

speed-up python function to process files with data segments separated by a blank space

related questions