ansaurus

Question

Multiple output files

Answer 1

+2 A:

I would open seven file streams as accumulating them might be quite memory extensive if it's a lot of data. Of course that is only an option if you can sort them live and don't first need all data read to do the sorting.

MrTopf 2009-02-17 00:34:36

Answer 2

+1 A:

"...pulls in data from two large CSV files, one of people's schedules and the other of information about their schedules." Vague, but I think I get it.

"The data is mined and combined to eventually create pajek format graphs for Monday-Sat of peoples connections," Mined and combined. Cool. Where? In this script? In another application? By some 3rd-party module? By some web service?

Is this row-at-a-time algorithm? Does one row of input produce one connection that gets sent to one or more daily graphs?

Is this an algorithm that has to see an entire schedule before it can produce anything? [If so, it's probably wrong, but I don't really know and your question is pretty vague on this central detail.]

"... a seventh graph representing all connections over the week with a string of 1's and 0's to indicate which days of the week the connections are made." Incomplete, but probably good enough.

def makeKey2( row2 ):
    return ( row2[1], row2[2] ) # Whatever the lookup key is for source2

def makeKey1( row1 ):
    return ( row1[3], row1[0] ) # Whatever the lookup key is for source1

dayFile = [ open("day%d.pajek","w") for i in range(6) ]
combined = open("combined.dat","w")
source1 = open( schedules, "r" )
rdr1= csv.reader( source1 )
source2 = open( aboutSchedules, "r" )
rdr2= csv.reader( source2 )

# "Combine" usually means a relational join between source 1 and source 2.
# We'll assume that source2 is a small-ish dimension and the
# source1 is largish facts

aboutDim = dict( (makeKey2(row),row) for row in rdr2 )

for row in rdr1:
    connection, dayList = mine_and_combine( row, aboutDim[ makeKey1(row) ] )
    for d in dayList:
        dayFile[d].write( connection )
    flags = [ 1 if d is in dayList else 0 for d in range(6) ]
    combined.write( connection, flags )

Something like that.

The points are:

One pass through each data source. No nested loops. O(n) processing.
Keep as little in memory as you need to create a useful result.

S.Lott 2009-02-17 02:56:04

Combined early in the script. One row of input of personal data results in ~60 lines of output, and there is a lot of normalization and processing because some of the data isn't standardized (hand input). Thank you though, this has been very helpful.

Lonnen 2009-02-17 16:11:00

ansaurus

tags:

views:

answers:

Multiple output files

related questions