views:

60

answers:

2

Hi All,

Using Hibernate Search Annotations (mostly just @Field(index = Index.TOKENIZED)) I've indexed a number of fields related to a persisted class of mine called Compound. I've setup text search over all the indexed fields, using the MultiFieldQueryParser, which has so far worked fine.

Among the fields indexed and searchable is a field called compoundName, with sample values:

  • 3-Hydroxyflavone
  • 6,4'-Dihydroxyflavone

When I search for either of these values in full the related Compound instances are returned. However problems occur when I use the partial name and introduce wildcards:

  • searching for 3-Hydroxyflav* still gives the correct hit, but
  • searching for 6,4'-Dihydroxyflav* fails to find anything.

Now as I'm quite new to Lucene / Hibernate-search, I'm not quite sure where to look at this point.. I think it might have something to do with the ' present in the second query, but I don't know how to proceed.. Should I look into Tokenizers / Analyzers / QueryParsers or something else entirely?

Or can anyone tell me how I can get the second wildcard search to match, preferably without breaking the MultiField-search behavior?

I'm using Hibernate-Search 3.1.0.GA & Lucene-core 2.9.3.


Some relevant code bits to illustrate my current approach:

Relevant parts of the indexed Compound class:

@Entity
@Indexed
@Data
@EqualsAndHashCode(callSuper = false, of = { "inchikey" })
public class Compound extends DomainObject {
    @NaturalId
    @NotEmpty
    @Length(max = 30)
    @Field(index = Index.TOKENIZED)
    private String                  inchikey;

    @ManyToOne
    @IndexedEmbedded
    private ChemicalClass           chemicalClass;

    @Field(index = Index.TOKENIZED)
    private String                  commonName;
...
}

How I currently search over the indexed fields:

String[] searchfields = Compound.getSearchfields();
MultiFieldQueryParser parser = 
    new MultiFieldQueryParser(Version.LUCENE_29, searchfields, new StandardAnalyzer(Version.LUCENE_29));
FullTextSession fullTextSession = Search.getFullTextSession(getSession());
FullTextQuery fullTextQuery = 
    fullTextSession.createFullTextQuery(parser.parse("searchterms"), Compound.class);
List<Compound> hits = fullTextQuery.list();
+1  A: 

I think your problem is a combination of analyzer and query language problems. It is hard to say what exactly causes the problem. To find this out I recommend you inspect you index using the Lucene index tool Luke.

Since in your Hibernate Search configuration you are not using a custom analyzer the default - StandardAnalyzer - is used. This would be consistent with the fact that you use StandardAnalyzer in the constructor of MultiFieldQueryParser (always use the same analyzer for indexing and searching!). What I am not so sure of is how "6,4'-Dihydroxyflavone" gets tokenized by StandardAnalyzer. That the first thing you have to find out. For example the javadoc says:

Splits words at hyphens, unless there's a number in the token, in which case the whole token is interpreted as a product number and is not split.

It might be that you need to write your own analyzer which tokenizes your chemical names the way you need it for your use cases.

Next the query parser. Make sure you understand the query syntax - Lucene query syntax. Some characters have special meaning, for example a '-'. It could be that your query is parsed the wrong way.

Either way, first step os to find out how your chemical names get tokenized. Hope that helps.

Hardy
FYI, I quickly checked how the StandardAnalyzer tokenizes your examples. "3-Hydroxyflavone" seems to fall under the product rule mentioned above. It becomes a single token "3-hydroxyflavone". "6,4'-Dihydroxyflavone" on the other hand becomes two tokens "6,4" and "dihydroxyflavone".
Hardy
Wow, thanks! I was just trying to use Luke here to test the same.. Gues this means I need to use an alternate Analyzer? (I've tried to set the field to UN_TOKENIZED, but that even breaks the first search example..)
Tim
It appears the StandardTokenizer splits words at apostrophes.. That at least pinpoints the problem, but it will take me some time to fix this.. :) Thanks for the help!
Tim
+1  A: 

Use WhitespaceAnalyzer instead of StandardAnalyzer. It will just split at whitespace, and not at commas, hyphens etc. (It will not lowercase them though, so you will need to build your own chain of whitespace + lowercase, assuming you want your search to be case-insensitive). If you need to do things differently for different fields, you can use a PerFieldAnalyzer.

You can't just set it to un-tokenized, because that will interpret your entire body of text as one token.

Xodarap
Tim