




I have information about 12340 cars. This info is stored sequentially in two different files:

  1. car_names.txt, which contains one line for the name of each car
  2. car_descriptions.txt, which contains the descriptions of each car. So 40 lines for each one, where the 6th line reads @CAR_NAME

I would like to do in python: to add for each car in the car_descriptions.txt file the name of each car (which comes from the other file) in the 7th line (it is empty), just after @CAR_NAME

I thought about:

1) read 1st file and store car names in a matrix/list 2) start to read 2nd file and each time it finds the string @CAR_NAME, just write the name on the next line

But I wonder if there is a faster approach, so the program reads each time one line from each file and makes the modification.


for line1, line2 in zip(file(filename1), file(filename2)):
    # do your thing

or similar

Corey Porter
what does "zip" do here?
It interleaves the elements from a list of iterables. You could use itertools.izip as an alternative.
In this specific case, it returns a list of tuples, each tuple being (line x from file1, line x from file2)
But how does this help **at all** with the OP's problem of processing 40 lines from file 2 for each line from file 1?!
Alex Martelli
Downvoted because this won't work for the original question. This solution assumes each file has the same number of lines, but the original question was clear that file 2 had 40 lines for every one in file 1.
Bryan Oakley

12340 is not any data (in sense that there are much bigger data to process on the market).

Even better approach would use build in sqlite module. If not use some simple format like CSV for example. This is a structure organized. If not use threads, you could process two files simultaneously.

can that sqlite module be used from python? how?
import sqlite3
+8  A: 

I'm not sure if I completely understand what you're trying to do, is something like this?

f1 = open ('car_names.txt')
f2 = open ('car_descriptions.txt')
for car_name in f1.readlines ():
        for i in range (6):   # echo the first 6 lines
                print f2.readline ()
        assert f2.readline() == '@CAR_NAME'  # skip the 7th, but assert that it is @CAR_NAME
        print car_name    # print the real car name
        for i in range (33):  # print the remaining 33 of the original 40
               print f2.readline ()
yes, i guess so! I will check it now! thanks
+4  A: 

Reading car_names.txt will save you a piddling amount of memory (really really tiny by today's standards;-) but it absolutely won't be any faster than slurping it down at one gulp (best case it will be exactly the same speed, probably even a little bit slower unless your underlying operating system and storage system do a great job at read-lookahead caching / buffering). So I suggest:

import fileinput

carnames = open('car_names.txt').readlines()
carnamit = iter(carnames)

skip = False
for line in fileinput.input(['car_descriptions.txt'], True, '.bak'):
  if not skip:
    print line,
  if '@CAR_NAME' in line:
    print next(carnamit),
    skip = True
    skip = False

So measure the speed of this, and an alternative that does

carnamit = open('car_names.txt')

at the start instead of reading all lines over a list like my first version -- I bet that the first version (in as much as there's any measurable and repeatable difference) will prove to be faster.

BTW, the fileinput module of the standard library is documented here, and it's truly a convenient way to perform "virtual rewriting in-place" of text files (typically keeping the old version as a backup, just in case -- but even if the machine should crash in the middle of the operation the old version of the data will still be there, so in a sense the "rewriting" operates atomically with respect to machine crashes, a nice little touch;-).

Alex Martelli
hi, excuse me, i understand your approach, but i get the error:NameError: name 'Next' is not definedam i missing some other library?
I believe next is new in Python 2.6. Are you running an earlier version?
Brent Newey
In 2.5 or earlier, you need `` instead of the nicer `next(carnamit)` that works in 2.6 and later.
Alex Martelli
+8  A: 

First, make a generator that retrieves the car name from a sequence. You could yield every 7th line; I've made mine yield whatever line follows the line that starts with @CAR_NAME:

def car_names(seq):
    for line in seq:
        if yieldnext: yield line
        yieldnext = line.startswith('@CAR_NAME')

Now you can use itertools.izip to go through both sequences in parallel:

from itertools import izip
with open(r'c:\temp\cars.txt') as f1:
    with open(r'c:\temp\car_names.txt') as f2:
        for (c1, c2) in izip(f1, car_names(f2)):
            print c1, c2
Robert Rossney
Who told you that this was windows?
I test the code that I post. If the fact that you can infer that my machine runs Windows troubles you, I suggest lying in a quiet, darkened room with a cool, damp washcloth over your eyes until the feeling passes.
Robert Rossney
Wow, what a heated reply to my question! It looks like you need the cool washcloth more than I do! Anyway, my comment was only meant as a sad note: windows users often tend to think everybody is a windows user. You can tested your script with a file in the same directory (like this other guy ) or you could just have abstracted the path with `filename`. Improve your answer instead of ranting!
You were, after all, moved to comment on a matter of perfect irrelevance. What next, complaining when someone uses "fizz" and "buzz" as temporary variable names because "foo" and "bar" are standard?
Robert Rossney

I think this fits the question:

  • it reads the description file one line at a time
  • when it sees @CAR_NAME, it still emits it, but replaces the next line in the description file with the next line from the names file

def merge_car_descriptions(namefile, descrfile):
    names = open(namefile,'r')
    descr = open(descrfile,'r')
    for d in descr:
        if '@CAR_NAME' in d:
            yield d + names.readline()
            yield d

if __name__=='__main__':
    import sys
    if len(sys.argv) != 3:
        sys.exit("Syntax: %s car_names.txt car_descriptions.txt" % sys.argv[0])
    for l in merge_car_descriptions(sys.argv[1], sys.argv[2]):
        print l,