I would like to trim long sequences of the same value from a binary file in python. A simple way of doing it is simply reading in the file and using re.sub to replace the unwanted sequence. This will of course not work on large binary files. Can it be done in something like numpy?
If two copies fit in memory, then you can easily make a copy. The second copy is the compressed version. Sure, you can use numpy, but you can also use the array package. Additionally, you can treat your big binary object as a string of bytes and manipulate it directly.
It sounds like your file may be REALLY large, and you can't fit two copies into memory. (You didn't provide a lot of details, so this is just a guess.) You'll have to do your compression in chunks. You'll read in a chunk, do some processing on that chunk and write it out. Again, numpy, array or simple string of bytes will work fine.
You need to make your question more precise. Do you know the values you want to trim ahead of time?
Assuming you do, I would probably search for the matching sections using subprocess
to run "fgrep -o -b <search string>
" and then change the relevant sections of the file using the python file
object's seek
, read
and write
methods.
If you don't have the memory to do open("big.file").read()
, then numpy wont really help.. It uses the same memory as python variables do (if you have 1GB of RAM, you can only load 1GB of data into numpy)
The solution is simple - read the file in chunks.. f = open("big.file", "rb")
, then do a series of f.read(500)
, remove the sequence and write it back out to another file object. Pretty much how you do file reading/writing in C..
The problem then is if you miss the pattern you are replacing.. For example:
target_seq = "567"
input_file = "1234567890"
target_seq.read(5) # reads 12345, doesn't contain 567
target_seq.read(5) # reads 67890, doesn't contain 567
The obvious solution is to start at the first character in the file, check len(target_seq)
characters, then go forward one character, check forward again.
For example (pseudo code!):
while cur_data != "":
seek_start = 0
chunk_size = len(target_seq)
input_file.seek(offset = seek_start, whence = 1) #whence=1 means seek from start of file (0 + offset)
cur_data = input_file.read(chunk_size) # reads 123
if target_seq == cur_data:
# Found it!
out_file.write("replacement_string")
else:
# not it, shove it in the new file
out_file.write(cur_data)
seek_start += 1
It's not exactly the most efficient way, but it will work, and not require keeping a copy of the file in memory (or two).
dbr's solution is a good idea but a bit overly complicated all you really have to do is rewind the file pointer the length of the sequence you are searching for, before you read your next chunk.
def ReplaceSequence(inFilename, outFilename, oldSeq, newSeq):
inputFile = open(inFilename, "rb")
outputFile = open(outFilename, "wb")
data = ""
chunk = 1024
while 1:
data = inputFile.read(chunk)
data = data.replace(oldSeq, newSeq)
outputFile.write(data)
inputFile.seek(-len(oldSequence), 1)
outputFile.seek(-len(oldSequence), 1)
if len(data) < chunk:
break
inputFile.close()
outputFile.close()
This generator-based version will keep exactly one character of the file content in memory at a time.
Note that I am taking your question title quite literally - you want to reduce runs of the same character to a single character. For replacing patterns in general, this does not work:
import StringIO
def gen_chars(stream):
while True:
ch = stream.read(1)
if ch:
yield ch
else:
break
def gen_unique_chars(stream):
lastchar = ''
for char in gen_chars(stream):
if char != lastchar:
yield char
lastchar=char
def remove_seq(infile, outfile):
for ch in gen_unique_chars(infile):
outfile.write(ch)
# Represents a file open for reading
infile = StringIO.StringIO("1122233333444555")
# Represents a file open for writing
outfile = StringIO.StringIO()
# Will print "12345"
remove_seq(infile, outfile)
outfile.seek(0)
print outfile.read()