views:

521

answers:

2

hey guys

ok, I'm totally new to SOLR and Lucene, but have got Solr running out-of-the-box under Tomcat 6.x and have just gone over some of the basic Wiki entries.

I have a few questions, and require some suggestions too.

  1. Solr can index data in files (XML, CSV) and it can also index DBs. Can you also just point it to a URI/domain, and have it index a website in the way google would?

  2. If I have a website with "Pages" data, so "Page Name", "Page Content" etc, and "Products Data", so "Product Name", "SKU" etc, do I need two different Schema.xml files? and if so, does that mean two different instances of Solr?

Finally, if you have a project with a large relational and normalized database, what would you say is the best approach from the 3 options below?:

  1. Have a middleware service running in the background, which mines the DB and manually creates the relevant XML files to then send to SOLR

  2. Have SOLR index the DB directly. In this case, would it be best to just point SOLR to views, which would abstract all the table relationships?

  3. Any other options I'm unaware of?

Context: We're running in a Windows 2003 environment, .NET 3.5, SQLServer 2005/2008

cheers!

+4  A: 
  1. No, you need a crawler for that, e.g. Nutch
  2. Yes, you want two separate indexes ( = two schema.xml) since the datasets don't seem to be related. This doesn't mean two instances of Solr, you can manage the two indexes with Cores.

As for populating the Solr index, it depends on your particular project, for example, can it tolerate stale data or does it have to absolutely fresh.

Other options to index data include:

  • Database triggers
  • If you're using some sort of ORM use its interception capabilities. For example you can use NHibernate events to update the index on update, insert or delete. If you use NHibernate and SolrNet this is taken care of automatically
Mauricio Scheffer
+1 Thanks Mauricio, this is really useful. I wonder if you could just expand a little on one point, possibly two. In terms of stale and fresh data, what data source i use doesn't matter does it? only how often I commit changes...assuming all commits (add/updates/deletes) have to be done manually right? As for SolrNet, do I not need to worry about communication manually with SOLR at all? thanks again
andy
about data freshness: it depends on the *user* (consumer) of the data. If the consumer needs to *always* see up-to-date data that would rule out offline/background indexing methods and you'd have to go with something more reactive, like triggers or ORM interception.Of course, when indexing webpages you don't get any "triggers", your only option is a crawler.Yes, SolrNet handles .Net <-> Solr communication.
Mauricio Scheffer
@mauricio: thanks man. We use a custom CMS to build our site. So, would it be an intelligent decision do you think to just commit updates/deletes to Solr via XML whenever Pages/Products are edited in the CMS? Also we don't use NHybernate, so I guess no benefits to SolrNet. thanks again, this is really helpful
andy
NHibernate integration is only one of the features of SolrNet. Its main purpose is handling all the Solr XML / HTTP communication and provide a .Net interface for all Solr operations.
Mauricio Scheffer
thanks Mauricio, I think I will use Solrnet, thanks for making it Open Source. Does Solrnet take care of writing the schema for Solr? if so, how? if not, then I have to write the schema myself?cheers!
andy
Nope, you still have to write the schema and configuration yourself. There are lots of server-only settings and tweaks, use the Solr wiki (http://wiki.apache.org/solr/) or Eric's book (http://www.packtpub.com/solr-1-4-enterprise-search-server/book) as reference.
Mauricio Scheffer
+1  A: 

I think Mauricio is dead on for his advice. The only point I would make is that when deciding to have a "middleware" indexer, or use the database directly. If your database (or the views?) map very closely to what a good Solr schema wants, then DIH is great. But, if you are indexing from multiple sources of data, or if you have to munge about the data in your database to meet what Solr would like, then having a dedicated middleware indexer is better.

Eric Pugh
And by "dead on", I mean very accurate! Just in case anyone was confused!
Eric Pugh
cool, thanks for the extra advice Eric. I was just wondering if having the middleware was just totally stupid, but i think it makes sense in an environment, where as you say, the data sources are varied. cheers! +1
andy