views:

614

answers:

4

I'm trying to create a search engine for all literature (books, articles, etc), music, and videos relating to a particular spiritual group. When a keyword is entered, I want to display a link to all the PDF articles where the keyword appears, and also all the music files and video files which are tagged with the keyword in question. The user should be able to filter it with information such as author/artist, place, date/time, etc. When the user clicks on one of the results links (book names, for instance), they are taken to another page where snippets from that book everywhere the keyword is found are displayed.

I thought of using the Lucene library (or Searcharoo) to implement my PDF search, but I also need a database to tag all the other information so that results can be filtered by author/artist information, etc. So I was thinking of having tables for Text, Music, and Videos, and a field containing the path to the file for each. When a keyword is entered, I need to search the DB for music and video files, and also need to search the PDF's, and when a filter is applied, the music and video search is easy, but limiting the text search based on the filters is getting confusing.

Is my approach correct? Are there better ways to do this? Since the search content is limited only to the spiritual group, there is not an infinite number of items to search. I'd say about 100-500 books and 1000-5000 songs.

+3  A: 

Yes, there is a better approach. Try Solr and in particular check out facets. It will save you a lot of trouble.

jro
Multi-faceted search is a good option if the user wants to do things like filter search results by author, content type, file size, etc. However, to my knowledge, Solr can only be installed as a web-service, so it would take a little longer to get up and running, and installing the software on client machines could become a configuration nightmare.
ph0enix
Correct, Solr provides a web service interface to a Lucene search index. And yes, facets can be used for filtering, but also tells you your metadata about your search objects.Not sure what to make about "installing software on client machines" as Solr is a server-based implementation. No client-side stuff involved, other than the application that exposes the search.
jro
By "client-side", I just meant if this was meant to be built as an "off the shelf" application (e.g., users could build their own index). Based on the question, it doesn't appear that way, but if it's a possibility for the future, then it's certainly something worth considering.
ph0enix
+1  A: 

You could try using MS Search Server Express Edition, one of the major benefits is that it is free.

http://www.microsoft.com/enterprisesearch/en/us/search-server-express.aspx#none

Shiraz Bhaiji
Thanks for the tip. I looked into it briefly, but I feel the full text search might be easier.
Prabhu
+1  A: 

If you definitely want to go the database route then you should use SQL Server with Full Text Search enabled. You can use this with Express versions, too. You can then store and search the contents of PDFs very easily (so long as you install the free Adobe PDF iFilter).

Dan Diplo
Thanks. I think this might work. Is there a way to get "snippets" of text everywhere the keyword is found in a document using a sql query?
Prabhu
Sorry, not that I know of easily (but I've never tried).
Dan Diplo
+4  A: 

Lucene is a great way to get up and running quickly without too much effort, along with several areas for extending the indexing and searching functionality to better suit your needs. It also has several built-in analyzers for common file types, such as HTML/XML, PDF, MS Word Documents, etc.

It provides the ability to use a variety of Fields, and they don't necessarily have to be uniform across all Documents (in other words, music files might have different attributes than text-based content, such as artist, title, length, etc.), which is great for storing different types of content.

Not knowing the exact implementation of what you're working on, this may or may not be feasible, but for tagging and other related features, you might also consider using a database, such as MySQL or SQL Server side-by-side with the Lucene index. Use the Lucene index for full-text search, then once you have a result set, go to the database to extract all the relational content. Our company has done this before, and it's actually not as big of a headache as it sounds.

NOTE: If you decide to go this route, BE CAREFUL, as the "unique id" provided by Lucene is highly volatile (it changes everytime the index is optimized), so you will want to store the actual id (the primary key in the database) as a separate field on the Document.

Another added benefit, if you are set on using C#.NET, there is a port called Lucene.Net, which is written entirely in C#. The down-side here is that you're a few months behind on all the latest features, but if you really need them, you can always check out the Java source and implement the required updates manually.

ph0enix
so say I have a bunch of files in a folder that I want searched. How do I add the attributes (artist, etc) to each file?
Prabhu
If you're using files, you can probably categorize them properly by file extension, then depending on the extension of each file, you just need to write code to properly build the Lucene Document that you want to add to your index. Hope that helps!
ph0enix