views:

70

answers:

1

Hi all,

I am testing Lucene.NET for our searching requirements, and I've got couple of questions.

We have documents in XML format. Every document contains multi-language text. The number of languages and the languages itself vary from document to document. See example below:

<document>This is a sample document, which is describing a <word lang="de">tisch</word>, a <word lang="en">table</word> and a <word lang="en">desk</word>.</document>

The keywords of a document are tagged with a special element and language attribute.

When I am creating lucene index I extract the text content from the XML and pairs of language and keyword (I am not sure if I have to), like this:

This is a sample document, which is describing a tisch, a table and a desk.

de - tisch
en - table
en - desk

I don't know exactly how to create an index that I will be able to search for example: - all the documents that contains word tisch in German (and not the document which contains word tisch in other languages).

And also I want to specifiy sorting at runtime: I want to sort by user specified language order (depending on a user interface). For example, if we have two documents:

<document>This is a sample document, which is describing a <word lang="de">tisch</word>.</document>
<document>This is a another sample document, which is describing a <word lang="en">table</word>.</document>

and a user on an English interface searches by "tisch OR table" I want to get the second result first.

Any information or advice is appreciated.

Many thanks!

+1  A: 

You have a design decision to make, where the options are:

  • Use a single index, where each document has a field per each language it uses, or
  • Use M indexes, M being the number of languages in the corpus.

If you use the multi-index approach, it will be easier to restrict search to a specific language or set of languages - just search the indexes for these languages, not using the other languages. Also, sorting by language becomes easier. Therefore, if you do not have an "AND" search that requires keywords from different languages appear in the same document, I would suggest the M-index approach.

Based on your example, I assume that the part of the documents not specially tagged is in English. If this is so, you can add the document text to the English index as a separate field; The other indexes need only store a document id, which will make them lighter.

Yuval F