views:

76

answers:

2

I'm going to build a search engine on solr, and nutch as a crawler. I have to index about 13mln documents. I have 3 servers for this job:

  1. 4 core Xeon 3Ghz, 20Gb ram, 1.5Tb sata
  2. 2*4 core Xeon 3Ghz, 16Gb ram, 500Gb ide
  3. 2*4 core Xeon 3Ghz, 16Gb ram, 500Gb ide

One of the servers I can use as a master for crawling and indexing, other twos as a slave for searching, or I can use one for searching, and another two for indexing with two shards. What architecture can you recommend? Should I use sharding, how much shards, and which of the servers should I use for what?

+1  A: 

I think try both. Read up on what the HathiTrust has done. I would start out with a single master for and two slaves, that is the simplest approach. And if you only have 13mln documents, I am guessing the load will be on the indexing/crawling side..... But 13mln is only ~300 pages a minute. I think you nutch crawler will be the bottle neck....

Eric Pugh
A: 

I'd tend towards using two servers for search and one for indexing.

As a general rule you want to keep search as fast as possible, at the expense of indexing performance. Also, two search servers gives you some natural redundancy.

I'd use the third server for searching, too, when it's not actually doing the indexing. (13 million docs isn't a huge index, and indexing it shouldn't take very long compared to how often you reindex it)

Nick Lothian