views:

31

answers:

1

I'm designing a MongoDB database that works with a script that periodically polls a resource and gets back a response which is stored in the database. Right now my database has one collection with four fields , id, name, timestamp and data.

I need to be able to find out which names had changes in the data field between script runs, and which did not. In pseudocode,

if(data[name][timestamp]==data[name][timestamp+1]) //data has not changed
store data in collection 1
else //data has changed between script runs for this name
store data in collection 2

Is there a query that can do this without iterating and running javascript over each item in the collection? There are millions of documents, so this would be pretty slow.

Should I create a new collection named timestamp for every time the script runs? Would that make it faster/more organized? Is there a better schema that could be used?

The script runs once a day so I won't run into a namespace limitation any time soon.

A: 

OK, this is a neat question b/c the short is basically: you will have to iterate and run javascript over each item.

The part where this gets "neat" is that this isn't really different from what an SQL solution would have to do. I mean, you're basically joining a table to itself where x.1=x.1 and y.1=y.2. Even if the relational DB can handle such a beast, it's definitely not going to be fast with millions of entries.

So the truth is, you're doing this right way. Here are the extra details I would use to make this cleaner.

  1. Ensure that you have an index on Name/Timestamp.
  2. Run a db.mycollection.find().foreach() across the data set.
  3. Foreach entry you're going to a) Perform comparison. b) Save appropriately. c) Update a flag indicating that this record has been processed.
  4. On future loads you should be able to add a query to your find. db.mycollection.find({flag:{$exists:false}}).foreach()
  5. Use db.eval() to help with speed.

The reason for the "Name/Timestamp" index is that you're going to be looking up each "successor" by "Name/Timestamp", so you want to be quick here.

The reason for the "processed" flag is that you should never have to re-run the same item. If given timestamp 'n' you find 'n+1', then that's the only 'n+1' you're going to have.

Honestly, if you're only running this once / day, it's quite likely that the speed will be just fine, especially if you only running on new records. Just assume that it's going to take several minutes.

Gates VP