views:

55

answers:

2

Hello,

I need to build an indexed database of whole domains in the world.

.

Example:

domain1.com ips: 1.1.1.1,2.2.2.2,3.3.3.3 

domain2.com ips: 1.1.1.1,4.4.4.4

requirements:

  1. fast insertions

  2. fast "selects"

  3. index on ip's - need fast "select" for all domains on IP: 1.1.1.1 .

.

I built it in Berkley-DB , and it seems fine: ( please pay attention to the "MANY_TO_MANY" annotation )

.

@Entity

public static class DomainInfo {

  @PrimaryKey

  String domain;



  @SecondaryKey(relate=MANY_TO_MANY) 

  Set<String> IP = new HashSet<String>();

}

.

Can I build something like that in Cassandra ?

Thanks a lot !!!

.

+1  A: 

Yes, its possible. You will get fast inserts for free using Cassandra. Fast "selects"? As long as you construct appropriate column families with reasonable index you will have fast "selects".

Index on ips. Fine, just create a second column family for that index. Or wait for the upcoming 0.7 relase (rc is about to released very soon, betas are available.) and use the built in support for secondary index.

Schildmeijer
A: 

You could build a lookup model with these two column families as an example:

DomainLookup = { 
  'domain1.com' : {
    'ips' : '1.1.1.1,2.2.2.2,3.3.3.3'
  } 
  'domain2.com' : {
    'ips' : '1.1.1.1,4.4.4.4'
  }
}

ReverseLookup = {
  '1.1.1.1' : {
    'domains' : 'domain1.com,domain2.com
  }
  '2.2.2.2' : {
    'domains' : 'domain1.com'
  }
  '3.3.3.3' : {
    'domains' : 'domain1.com'
  }
  '4.4.4.4' : {
    'domains' : 'domain2.com'
  }
}

This example is probably not ideal for your case. But remember Cassandra is optimized for write. So you could create other indices best for your query scenario. Plus, Cassandra adopts Dynamo's fully distributed design which makes it easier to scale. It is self-managed meaning you could add a new machine to your Cassandra cloud and it will automatically balance the storage and load. One thing you need to pay attention is to choose either Random or Order Preserving Partitioning.

Sheng Chien
Thank you so much for answering. the problem is that there are some Hosting providers IP's that contains thousands of domains ( e.g. parked domains ) and the inserting/deletions from the reversed-index will be painful. :( . do you know if the Secondary-Index Schildmeijer proposed can solve the mystery ( like the berkley-db 'Set' ) ? thanks again.
aa aaa
There will be new per-column settings "index_name" and "index_type" to support secondary index since version 0.7. It seems like it will create you inverted index once configured. So you might just need to maintain domain lookup with index on ip column. Also Cassandra is relacing old configuration with cassandra.yaml. I believe it should save you some trouble if not all.
Sheng Chien
Another indexing solution is to use Lucandra which is backed by Lucene, a search engine open source. But you will still need to maintain the index. You could refer to this blog http://blog.sematext.com/2010/02/09/lucandra-a-cassandra-based-lucene-backend/
Sheng Chien