views:

52

answers:

1

I am trying to push some big files (around 4 million records) into a mongo instance. What I am basically trying to achieve is to update the existent data with the one from the files. The algorithm would look something like:

rowHeaders = ('orderId', 'manufacturer', 'itemWeight')
for row in dataFile:
    row = row.strip('\n').split('\t')
    row = dict(zip(rowHeaders, row))

    mongoRow = mongoCollection.find({'orderId': 12344})
    if mongoRow is not None:
        if mongoRow['itemWeight'] != row['itemWeight']:
            row['tsUpdated'] = time.time()
    else:
        row['tsUpdated'] = time.time()

    mongoCollection.update({'orderId': 12344}, row, upsert=True)

So, update the whole row besides 'tsUpdated' if weights are the same, add a new row if the row is not in mongo or update the whole row including 'tsUpdated' ... this is the algorithm

The question is: can this be done faster, easier and more efficient from mongo's point of view ? (eventually with some kind of bulk insert)

A: 

Combine an unique index on orderId with an update query where you also check for a change in itemWeight. The unique index prevents an insert with only a modified timestamp if the orderId is already present and itemWeight is the same.

mongoCollection.ensure_index('orderId', unique=True)
mongoCollection.update({'orderId': row['orderId'],
    'itemWeight': {'$ne': row['itemWeight']}}, row, upsert=True)

My benchmark shows a 5-10x performance improvement against your algorithm (depending on the amount of inserts vs updates).

mischa_u