tags:

views:

58

answers:

2

After implementing some of the solutions in my previous question, I've come up with the following solution:

reader = open('C://text.txt') 
writer = open('C://nona.txt', 'w')
counter = 1    
names, nums = [], []    
row = reader.read().split(' ')
x = len(row)/2
for (a, b) in [(c, d) for c, d in zip(row[:x], row[x:]) if d!='na']:
    print counter
    counter +=1
    names.append(a)
    nums.append(b)

writer.write(' '.join(names))
writer.write(' ')
writer.write(' '.join(nums))

This program works quite well for a smaller sample data set. However it freezes up when I use the full data set and causes python to crash. Any suggestions on how I can overcome this?

A: 

Your file is organized in an unfortunate manner for Pythonic processing.

Note that when you call reader.read(), you are reading the entire file into memory. Let's say this takes up X bytes.

Calling split will effectively add another X bytes of memory usage, as it will create a new string for each separate string in the file.

Then you call row[:x] and row[x:], which will add ANOTHER X bytes (because the slice operator makes a copy).

Then you call zip, and make a list comprehension, etc, etc. Strings and tuples are immutable data, which means you are always creating them from scratch.

I would approach this problem at a lower level. Open one file descriptor and point it to the beginning of the file. Open another and have it seek to the beginning of the (na/0/1/2) values (you will know where this is by counting the spaces). Now, read one name and one value at a time, and if the value is not "na" you can write the name to an output file. If you need to write the values to the output file also, hold them in memory and write them all at once when you are done.

Unfortunately this will be more difficult to code than just using the high-level functions that Python provides (you will need to write code that operates at the character level), but as you have seen there is a price to pay for those high-level functions.

danben
+1  A: 

What you should do is break your file up into two separate files. Your logic should do something like this:

  1. Open data file
  2. open name file
  3. read next data
  4. is it name? see 5. Otherwise see 6
  5. write name to name file, see 3
  6. is it number or na? close name file and open number file
  7. read next data
  8. is it number or na? see 7, otherwise write file

once you have your files split into two pieces, you can iterate over them together:

names = open('names.txt')
numbers = open('numbers.txt')

for name, number in zip(names, numbers):
   if not numbers == 'na':
       output.write(name + " " + number)

or you could write to two different files and then join them together if that's what you need.

Wayne Werner
Since it appears that his data is a huge list of names followed by a huge list of numbers, he could probably even do the splitting up in a good text editor. It is also worth noting that this approach requires names and numbers to have each name/number on a separate line.
Wilduck
Can you reccomend a good text editor?
Robert A. Fettikowski
any of them? Notepad++ is a simple one for beginners. I personally use Vim (www.vim.org) which has a pretty steep learning curve, but is incredibly useful once you get it down.
Wayne Werner