I've started playing with Lucene.NET today and I wrote a simple test method to do indexing and searching on source code files. The problem is that the standard analyzers/tokenizers treat the whole camel case source code identifier name as a single token.
I'm looking for a way to treat camel case identifiers like MaxWidth
into three tokens: maxwidth
, max
and width
. I've looked for such a tokenizer, but I couldn't find it. Before writing my own: is there something in this direction? Or is there a better approach than writing a tokenizer from scratch?
UPDATE: in the end I decided to get my hands dirty and I wrote a CamelCaseTokenFilter
myself. I'll write a post about it on my blog and I'll update the question.