tags:

views:

152

answers:

2

edit: Initially I was trying to be general but it came out vague. I've included more detail below.

I'm writing a script that pulls in data from two large CSV files, one of people's schedules and the other of information about their schedules. The data is mined and combined to eventually create pajek format graphs for Monday-Sat of peoples connections, with a seventh graph representing all connections over the week with a string of 1's and 0's to indicate which days of the week the connections are made. This last graph is a break from the pajek format and is used by a seperate program written by another researcher.

Pajek format has a large header, and then lists connections as (vertex1 vertex2) unordered pairs. It's difficult to store these pairs in a dictionary, because there are often multiple connections on the same day between two pairs.

I'm wondering what the best way to output to these graphs are. Should I make the large single graph and have a second script deconstruct it into several smaller graphs? Should I keep seven streams open and as I determine a connection write to them, or should I keep some other data structure for each and output them when I can (like a queue)?

+2  A: 

I would open seven file streams as accumulating them might be quite memory extensive if it's a lot of data. Of course that is only an option if you can sort them live and don't first need all data read to do the sorting.

MrTopf
+1  A: 

"...pulls in data from two large CSV files, one of people's schedules and the other of information about their schedules." Vague, but I think I get it.

"The data is mined and combined to eventually create pajek format graphs for Monday-Sat of peoples connections," Mined and combined. Cool. Where? In this script? In another application? By some 3rd-party module? By some web service?

Is this row-at-a-time algorithm? Does one row of input produce one connection that gets sent to one or more daily graphs?

Is this an algorithm that has to see an entire schedule before it can produce anything? [If so, it's probably wrong, but I don't really know and your question is pretty vague on this central detail.]

"... a seventh graph representing all connections over the week with a string of 1's and 0's to indicate which days of the week the connections are made." Incomplete, but probably good enough.

def makeKey2( row2 ):
    return ( row2[1], row2[2] ) # Whatever the lookup key is for source2

def makeKey1( row1 ):
    return ( row1[3], row1[0] ) # Whatever the lookup key is for source1

dayFile = [ open("day%d.pajek","w") for i in range(6) ]
combined = open("combined.dat","w")
source1 = open( schedules, "r" )
rdr1= csv.reader( source1 )
source2 = open( aboutSchedules, "r" )
rdr2= csv.reader( source2 )

# "Combine" usually means a relational join between source 1 and source 2.
# We'll assume that source2 is a small-ish dimension and the
# source1 is largish facts

aboutDim = dict( (makeKey2(row),row) for row in rdr2 )

for row in rdr1:
    connection, dayList = mine_and_combine( row, aboutDim[ makeKey1(row) ] )
    for d in dayList:
        dayFile[d].write( connection )
    flags = [ 1 if d is in dayList else 0 for d in range(6) ]
    combined.write( connection, flags )

Something like that.

The points are:

  1. One pass through each data source. No nested loops. O(n) processing.

  2. Keep as little in memory as you need to create a useful result.

S.Lott
Combined early in the script. One row of input of personal data results in ~60 lines of output, and there is a lot of normalization and processing because some of the data isn't standardized (hand input). Thank you though, this has been very helpful.
Lonnen