tags:

views:

37

answers:

2

I have three databases that all have the contents of several web pages in them. What would be the best way to go about searching all three and having the most relevant web page at the top of the search results?

The only way I can think of is break down content by word count and/or creating a complex set of search rules to give one content priority over another. This might be more trouble than what it's worth, but I was wondering if anybody knows a way or product out there that would be able to help me.

A: 

If you're using MS SQL Server then the full text search can return a ranking for you. I haven't used it, so you'll need to check the documentation or online for specifics.

Tom H.
A: 

To further support Ivans answer above Lucene is the way to go. You haven't mentioned what platform you're on so I'll point out that you can use a .NET port of this too.

If you do use Lucene there is a very good book from Manning on the subject which I recommend you look at.

When it comes to populating your index, you have a couple of choices. For starters you can just dump all of your text into the index and allow the engine to just search on it. However, I'd recommend adding fixed fields to your index which will allow you to support things such as partitioned searches or searches against those fields only.

To explain, lets say you have a field for the website. Then you can partition your index by restricting the index search to those documents that have that website in that field.

The other process is to extract points of interest from your document and allow searches on those without searching the entire index entry. Your mileage may vary with this as the lucene engine is very well written so it may simply allow you to collect your searches into more logical units which helps you with your solution.

I've done this myself and it helps when answering management questions about what exactly is searched and indexed.

HTH!

Simon