Lucene.NET: Camel case tokenizer? | ansaurus

tags:

views:

67

answers:

1

+2 Q:

Lucene.NET: Camel case tokenizer?

I've started playing with Lucene.NET today and I wrote a simple test method to do indexing and searching on source code files. The problem is that the standard analyzers/tokenizers treat the whole camel case source code identifier name as a single token.

I'm looking for a way to treat camel case identifiers like MaxWidth into three tokens: maxwidth, max and width. I've looked for such a tokenizer, but I couldn't find it. Before writing my own: is there something in this direction? Or is there a better approach than writing a tokenizer from scratch?

UPDATE: in the end I decided to get my hands dirty and I wrote a CamelCaseTokenFilter myself. I'll write a post about it on my blog and I'll update the question.

+1 A:

Solr has a WordDelimiterFactory which generates a tokenizer similar to what you need. Maybe you can translate the source code into C#.

Yuval F 2010-09-10 21:23:17

Yes, I've noticed it, although it doesn't really do what I'm looking for. In the end I wrote CamelCaseTokenFilter myself. But I'll accept your answer.

Igor Brejc 2010-09-11 06:13:19

related questions

Lucene.Net Search result to highlight search keywords

Does a pom.xml.template tell me everything I need to know to use the project as a dependency

Can someone compare a Fuzzy Query to a LuceneDictionary solution?

Has anyone used lucene.net with Linq-to-Entities?

Can someone give me a high overview of how lucene.net works?

Using Lucene to count results in categories

Which search technology to use with ASP.NET?

How to do query auto-completion/suggestions in Lucene?

Should an index be optimised after incremental indexes in Lucene?

What is the best search approach using Lucene?

How to best search against a DB with Lucene?

Is there a fast, accurate Highlighter for Lucene?

How to sort by Lucene.Net field and ignore common stop words such as 'a' and 'the'?

How do I estimate the size of a Lucene index?

Analyzer for Russian language in Lucene and Lucene.Net

In Lucene how do terms get used in calculating scores, can I override it with a CustomScoreQuery?

Troubleshoot Java Lucene ignoring Field

Best full text search alternative to ms sql, c++ solution

Strategies for keeping a Lucene Index up to date with domain model changes

How to get facet ranges in solr results?

Using Lucene to search for email addresses

WildcardQuery error in Solr

With Lucene: Why do I get a Too Many Clauses error if I do a prefix search?

Lucene exact ordering

Lucene Score results