views:

913

answers:

5

I would like to trim long sequences of the same value from a binary file in python. A simple way of doing it is simply reading in the file and using re.sub to replace the unwanted sequence. This will of course not work on large binary files. Can it be done in something like numpy?

+2  A: 

If two copies fit in memory, then you can easily make a copy. The second copy is the compressed version. Sure, you can use numpy, but you can also use the array package. Additionally, you can treat your big binary object as a string of bytes and manipulate it directly.

It sounds like your file may be REALLY large, and you can't fit two copies into memory. (You didn't provide a lot of details, so this is just a guess.) You'll have to do your compression in chunks. You'll read in a chunk, do some processing on that chunk and write it out. Again, numpy, array or simple string of bytes will work fine.

S.Lott
A: 

You need to make your question more precise. Do you know the values you want to trim ahead of time?

Assuming you do, I would probably search for the matching sections using subprocess to run "fgrep -o -b <search string>" and then change the relevant sections of the file using the python file object's seek, read and write methods.

fivebells
+4  A: 

If you don't have the memory to do open("big.file").read(), then numpy wont really help.. It uses the same memory as python variables do (if you have 1GB of RAM, you can only load 1GB of data into numpy)

The solution is simple - read the file in chunks.. f = open("big.file", "rb"), then do a series of f.read(500), remove the sequence and write it back out to another file object. Pretty much how you do file reading/writing in C..

The problem then is if you miss the pattern you are replacing.. For example:

target_seq = "567"
input_file = "1234567890"

target_seq.read(5) # reads 12345, doesn't contain 567
target_seq.read(5) # reads 67890, doesn't contain 567

The obvious solution is to start at the first character in the file, check len(target_seq) characters, then go forward one character, check forward again.

For example (pseudo code!):

while cur_data != "":
    seek_start = 0
    chunk_size = len(target_seq)

    input_file.seek(offset = seek_start, whence = 1) #whence=1 means seek from start of file (0 + offset)
    cur_data = input_file.read(chunk_size) # reads 123
    if target_seq == cur_data:
        # Found it!
        out_file.write("replacement_string")
    else:
        # not it, shove it in the new file
        out_file.write(cur_data)
    seek_start += 1

It's not exactly the most efficient way, but it will work, and not require keeping a copy of the file in memory (or two).

dbr
Thanks, that helps a lot. I was hoping numpy would have some auto memory management for large files - I'm not too familiar with it.
bluegray
+1  A: 

dbr's solution is a good idea but a bit overly complicated all you really have to do is rewind the file pointer the length of the sequence you are searching for, before you read your next chunk.

def ReplaceSequence(inFilename, outFilename, oldSeq, newSeq):
 inputFile  = open(inFilename, "rb")
 outputFile = open(outFilename, "wb")

 data = ""
 chunk = 1024

 while 1:
      data = inputFile.read(chunk)
      data = data.replace(oldSeq, newSeq)
      outputFile.write(data)

      inputFile.seek(-len(oldSequence), 1)
      outputFile.seek(-len(oldSequence), 1)

     if len(data) < chunk:
           break

 inputFile.close()
 outputFile.close()
A: 

This generator-based version will keep exactly one character of the file content in memory at a time.

Note that I am taking your question title quite literally - you want to reduce runs of the same character to a single character. For replacing patterns in general, this does not work:

import StringIO

def gen_chars(stream):
   while True:
      ch = stream.read(1)
      if ch: 
         yield ch
      else:
         break

def gen_unique_chars(stream):
   lastchar = ''
   for char in gen_chars(stream):
      if char != lastchar:
         yield char
      lastchar=char

def remove_seq(infile, outfile):
   for ch in gen_unique_chars(infile):
      outfile.write(ch)

# Represents a file open for reading
infile  = StringIO.StringIO("1122233333444555")

# Represents a file open for writing
outfile = StringIO.StringIO()

# Will print "12345"
remove_seq(infile, outfile)
outfile.seek(0)
print outfile.read()
Triptych