views:

467

answers:

3

I am writing an in-house application that holds several pieces of text information as well as a number of pieces of data about these pieces of text. These pieces of data will be held within a database (SQL Server, although this could change) in order of entry.

I'd like to be able to search for the most relevant of these pieces of information, with the most relevant of these to be at the top. I originally looked into using SQL Server Full-Text Search but it's not as flexible for my other needs as I had hoped so it seems that I'll need to develop my own solution to this.

From what I understand what is needed is an inverted index, then for the contents of said inverted index to be restored and modified based on the results of the additional information held (although for now this can be left for a later date as I just want the inverted index to index the main text from the database table/strings provided).

I've had a crack at writing this code in Java using a Hashtable with the key as the words and the value as a list of the occurrences of the word but in all honesty I'm still rather new at C# and have only really used things like DataSets and DataTables when handling information. If requested I'll upload the Java code soon once I've cleared this laptop of viruses.

If given a set of entries from a table or from a List of Strings, how could one create an inverted index in C# that will preferably save into a DataSet/DataTable?

EDIT: I forgot to mention that I have already tried Lucene and Nutch, but require my own solution as modifying Lucene to meet my needs would take far longer than writing an inverted index. I'll be handling a lot of meta-data that'll also need handling once the basic inverted index is completed, so all I require for now is a basic full-text search on one area using the inverted index. Finally, working on an inverted index isn't something I get to do every day so it'd be great to have a crack at it.

A: 

Lucene.net might be your best bet. Its a mature full text search engine using inverted indexes.

http://codeclimber.net.nz/archive/2009/09/02/lucene.net-your-first-application.aspx

mcintyre321
I should have explained in my question that I had already looked into using Lucene or replacing parts of its functionality with what I've written. Sadly Lucene isn't flexible enough for me to change what I need to meet the criteria for the information I need to hold, so I'll need to write the inverted index myself.
EnderMB
Strangely, my experience with Lucene.net is that it's *too* flexible, making should-be-simple tasks a chore. Plus it doesn't work right in medium trust. Plus the philosophy of staying true to the Java means many convenient and performant C#/.NET idioms are not used. A shame because it's awesome in many ways.
James M.
+1  A: 

Here's a rough overview of an approach I've used successfully in C# in the past:

 struct WordInfo
 {
     public int position;
     public int fieldID;
 }

 Dictionary<string,List<WordInfo>> invertedIndex=new Dictionary<string,List<WordInfo>>();

       public void BuildIndex()
       {
            foreach (int  fieldID in GetDatabaseFieldIDS())
            {    
                string textField=GetDatabaseTextFieldForID(fieldID);

                string word;

                int position=0;

                while(GetNextWord(textField,out word,ref position)==true)
                {
                     WordInfo wi=new WordInfo();

                     if (invertedIndex.TryGetValue(word,out wi)==false)
                     {
                         invertedIndex.Add(word,new List<WordInfo>());
                     }

                     wi.Position=position;
                     wi.fieldID=fieldID;
                     invertedIndex[word].Add(wi);

                }

            }
        }

Notes:

GetNextWord() iterates through the field and returns the next word and position. To implement it look at using string.IndexOf() and char character type checking methods (IsAlpha etc).

GetDatabaseTextFieldForID() and GetDatabaseFieldIDS() are self explanatory, implement as required.

Ash
Sorry for the huge delay in getting back to this answer. This looks great! The one question I have with this is how one would then write your dictionary back to a database. I have edited the question with what I mean.
EnderMB
Sorry, I've just looked over the code and realised that I could just duplicate the words should they appear in more than one document. It should be quite easy to send this to my database handling classes; once I get this implemented I'll accept this answer.
EnderMB
@Ender, glad it was helpful. Serialization is one option to save/load from the databases. Alternatively iterating through the Dictionary Keys collection and getting each corresponding value would be another.
Ash
I've managed to successfully implement it by using LINQ to get each value out, so I've accepted this answer. Just out of interest, how would you have implemented the GetNextWord() method, as I've split the full field into separate words and iterated through until they were all gone.
EnderMB
A: 

If you're looking to spin your own, the Dictionary<T> class is most likely going to be your base, like your Java hashtables. As far as what is stored as the values in the dictionary, its hard to tell based on the information you provide, but typically search algorithms use some type of Set structure so you can run unions and intersections. LINQ gives you a much of that functionality on any IEnumerable, although a specialized Set class may boost performance.

One such implementation of a Set is in the Wintellect PowerCollections. I'm not sure if that would give you any performance benefit or not over LINQ.

As far as saving to a DataSet, I'm not sure what you're envisioning. I'm not aware of anything that "automagically" writes to a DataSet. I suspect you will have to write this yourself, especially since you mentioned several times about other third-party options not being flexible enough.

Dave