views:

86

answers:

2

I have a data set with multiple layers of annotation over the underlying text, such as part-of-tags, chunks from a shallow parser, name entities, and others from various natural language processing (NLP) tools. For a sentence like The man went to the store, the annotations might look like:


Word  POS  Chunk       NER
====  ===  =====  ========
The    DT     NP    Person     
man    NN     NP    Person
went  VBD     VP         -
to     TO     PP         - 
the    DT     NP  Location
store  NN     NP  Location

I'd like to index a bunch of documents with annotations like these using Lucene and then perform searches across the different layers. An example of a simple query would be to retrieve all documents where Washington is tagged as a person. While I'm not absolutely committed to the notation, syntactically end-users might enter the query as follows:

Query: Word=Washington,NER=Person

I'd also like to do more complex queries involving the sequential order of annotations across different layers, e.g. find all the documents where there's a word tagged person followed by the words arrived at followed by a word tagged location. Such a query might look like:

Query: "NER=Person Word=arrived Word=at NER=Location"

What's a good way to go about approaching this with Lucene? Is there anyway to index and search over document fields that contain structured tokens?

Payloads

One suggestion was to try to use Lucene payloads. But, I thought payloads could only be used to adjust the rankings of documents, and that they aren't used to select what documents are returned.

The latter is important since, for some use-cases, the number of documents that contain a pattern is really what I want.

Also, only the payloads on terms that match the query are examined. This means that payloads could only even help with the rankings of the first example query, Word=Washington,NER=Person, whereby we just want to make sure the term Washingonton is tagged as a Person. However, for the second example query, "NER=Person Word=arrived Word=at NER=Location", I need to check the tags on unspecified, and thus non-matching, terms.

+1  A: 

What you are looking for are payloads. Lucid Imagination has a detailed blog entry on the subject. Payloads allow you to store a byte array of metadata about individual terms. Once you have indexed your data with the payloads including, you can create a new similarity mechanism that takes your payloads into account when scoring.

Eric Hauser
I thought payloads could only be used to adjust the rankings of documents. Can they also be used to actually select what documents are returned?
dmcer
Sure, payloads work with scoring but scoring is the way that documents are retrieved. Documents can be excluded based on terms -- think of NOT queries. You may have to write your own QueryParser for do the second item.
Eric Hauser
A: 

You can indeed do search for patterns of text in Lucene using SpanQuery and adjust the slop distance to limit how many terms of each other the query terms can occur, and even the order in which they appear.

Mikos