views:

91

answers:

1

I want to store some genomic positions (chromosome,position) using MongoDB.

something like:

{
chrom:"chr2",
position:100,
name:"rs25"
}

I want to be able to quickly find all the records in a given segment (chrom , [posStart - posEnd]). What would be the best key/_id to be used ?

a chrom , position object ?

db.snps.save({_id:{chrom:"chr2",position:100},name:"rs25"})

a padded string ?

db.snps.save({_id:"chr02:00000000100",chrom:"chr2",position:100,name:"rs25"})

an auto-generated id with an index on chrom and position ?

db.snps.save({chrom:"chr2",position:100,name:"rs25"})

other ?

???

thanks for your suggestion(s)

Pierre

PS: (this question was cross posted on biostar: http://biostar.stackexchange.com/questions/2519 )

+1  A: 

I believe the two-column index will offer the fastest access path, because it will be the most compact index.

However, it will be an additonal index (since you already have the _id index, which you are not using), so the first two options are nice in that they eliminate the extra index.

The padded string is shorter than the complex object solution, shorter means less memory use, hence faster the scan. I'd only go for complex object, if flattening/padding is not possible. Also, since the complex object keys need to be encoded into the index (not the case with other indexes), choose shorter key names (c and p).

So, I'd go for two-column index (if you do not mind "wasting" the id index) or padded string. You could even go padded binary (saving a few bytes on encoding the integer), but that is probably not worth the hassle.

Thilo
thanks, I'm going to validate this interesting answer.
Pierre