ansaurus

Question

Storing relational data in a Lucene.NET index

Answer 1

+2 A:

I've had my share of problems with storing relational data i Lucene but the one you have should be easy to fix.

I guess you tokenize the group fields and that makes it possible to search for substrings in the field value. Just add the field untokenized and it should work like expected.

Please check the following small piece of code:

internal class Program {
    private static void Main(string[] args) {
        var directory = new RAMDirectory();
        var writer = new IndexWriter(directory, new StandardAnalyzer());
        AddDocument(writer, "group", "stuff", Field.Index.UN_TOKENIZED);
        AddDocument(writer, "group", "other stuff", Field.Index.UN_TOKENIZED);
        writer.Close(true);

        var searcher = new IndexSearcher(directory);
        Hits hits = searcher.Search(new TermQuery(new Term("group", "stuff")));

        for (int i = 0; i < hits.Length(); i++) {
            Console.WriteLine(hits.Doc(i).GetField("group").StringValue());
        }
    }

    private static void AddDocument(IndexWriter writer, string name, string value, Field.Index index) {
        var document = new Document();
        document.Add(new Field(name, value, Field.Store.YES, index));
        writer.AddDocument(document);
    }
}

The sample adds two documents to the index which are untokenized, does a search for stuff and gets one hit. If you changed the code to add them tokenized then you will have two hits as you see now.

The issue with using Lucene for relational data is that it might be expected that wildcard and range searches always will work. That is not really the case if the index is big due to way Lucene resolves those queries.

Another sample to illustrate the behavior:

    private static void Main(string[] args) {
        var directory = new RAMDirectory();
        var writer = new IndexWriter(directory, new StandardAnalyzer());

        var documentA = new Document();
        documentA.Add(new Field("name", "A", Field.Store.YES, Field.Index.UN_TOKENIZED));
        documentA.Add(new Field("group", "stuff", Field.Store.YES, Field.Index.UN_TOKENIZED));
        documentA.Add(new Field("group", "other stuff", Field.Store.YES, Field.Index.UN_TOKENIZED));
        writer.AddDocument(documentA);
        var documentB = new Document();
        documentB.Add(new Field("name", "B", Field.Store.YES, Field.Index.UN_TOKENIZED));
        documentB.Add(new Field("group", "stuff", Field.Store.YES, Field.Index.UN_TOKENIZED));
        writer.AddDocument(documentB);
        var documentC = new Document();
        documentC.Add(new Field("name", "C", Field.Store.YES, Field.Index.UN_TOKENIZED));
        documentC.Add(new Field("group", "other stuff", Field.Store.YES, Field.Index.UN_TOKENIZED));
        writer.AddDocument(documentC);

        writer.Close(true);

        var query1 = new TermQuery(new Term("group", "stuff"));
        SearchAndDisplay("First sample", directory, query1);

        var query2 = new TermQuery(new Term("group", "other stuff"));
        SearchAndDisplay("Second sample", directory, query2);

        var query3 = new BooleanQuery();
        query3.Add(new TermQuery(new Term("group", "stuff")), BooleanClause.Occur.MUST);
        query3.Add(new TermQuery(new Term("group", "other stuff")), BooleanClause.Occur.MUST);
        SearchAndDisplay("Third sample", directory, query3);
    }

    private static void SearchAndDisplay(string title, Directory directory, Query query3) {
        var searcher = new IndexSearcher(directory);
        Hits hits = searcher.Search(query3);
        Console.WriteLine(title);
        for (int i = 0; i < hits.Length(); i++) {
            Console.WriteLine(hits.Doc(i).GetField("name").StringValue());
        }
    }

HakonB 2009-11-19 09:31:17

Hi HakonB, thanks for the reply. I have used untokenized for a few other lookups but the problem is one item can be in both "Stuff" and "Other Stuff" and it needs to be found when searching for either or both.EG:A in stuff and other stuffB in just stuffC in just other stuffSearch for stuff {A,B}Search for other stuff {A,C}Search for stuff and other stuff {A}

Tim Schneider 2009-11-19 10:08:44

I've added another sample that illustrates how to get the correct results - that is if I understand you now :-)

HakonB 2009-11-19 12:09:00

Ah, thanks! That seems like exactly what I was after. It never occurred to me that I could actually add 2 fields of the same name to 1 document. Still thinking too much like a typical relational database I guess =)

Tim Schneider 2009-11-19 22:44:09

ansaurus

tags:

views:

answers:

Storing relational data in a Lucene.NET index

related questions