I am trying to push some big files (around 4 million records) into a mongo instance. What I am basically trying to achieve is to update the existent data with the one from the files. The algorithm would look something like:
rowHeaders = ('orderId', 'manufacturer', 'itemWeight')
for row in dataFile:
row = row.strip('\n').split('\t')
row = dict(zip(rowHeaders, row))
mongoRow = mongoCollection.find({'orderId': 12344})
if mongoRow is not None:
if mongoRow['itemWeight'] != row['itemWeight']:
row['tsUpdated'] = time.time()
else:
row['tsUpdated'] = time.time()
mongoCollection.update({'orderId': 12344}, row, upsert=True)
So, update the whole row besides 'tsUpdated' if weights are the same, add a new row if the row is not in mongo or update the whole row including 'tsUpdated' ... this is the algorithm
The question is: can this be done faster, easier and more efficient from mongo's point of view ? (eventually with some kind of bulk insert)