ansaurus

Question

Using Lucene to search for email addresses

Answer 1

+1 A:

So... what´s the question?

EDIT: "How can I do this?" wasn't there when I posted my answer :)

Juan Manuel 2008-08-20 22:45:46

Just wondering, isn't there an option of removing replies later, as to remove clutter from the page.

Photodeus 2010-03-23 15:34:08

Answer 2

A:

Sorry, Juan, I've updated the question to clarify.

I saw, it's no problem, the good thing about the beta is that we are all learning how to use the system. Sorry I can't help with your question

Juan Manuel 2008-08-20 23:02:51

Answer 3

+6 A:

No one gave a satisfactory answer, so we started poking around Lucene documentation and discovered we can accomplish this using custom Analyzers and Tokenizers.

The answer is this: create a WhitespaceAndAtSymbolTokenizer and a WhitespaceAndAtSymbolAnalyzer, then recreate your index using this analyzer. Once you do this, a search for "@gmail.com" will return all gmail addresses, because it's seen as a separate word thanks to the Tokenizer we just created.

Here's the source code, it's actually very simple:

class WhitespaceAndAtSymbolTokenizer : CharTokenizer
{
    public WhitespaceAndAtSymbolTokenizer(TextReader input)
        : base(input)
    {
    }

    protected override bool IsTokenChar(char c)
    {
        // Make whitespace characters and the @ symbol be indicators of new words.
        return !(char.IsWhiteSpace(c) || c == '@');
    }
}


internal class WhitespaceAndAtSymbolAnalyzer : Analyzer
{
    public override TokenStream TokenStream(string fieldName, TextReader reader)
    {
        return new WhitespaceAndAtSymbolTokenizer(reader);
    }
}

That's it! Now you just need to rebuild your index and do all searches using this new Analyzer. For example, to write documents to your index:

IndexWriter index = new IndexWriter(indexDirectory, new WhitespaceAndAtSymbolAnalyzer());
index.AddDocument(myDocument);

Performing searches should use the analyzer as well:

IndexSearcher searcher = new IndexSearcher(indexDirectory);
Query query = new QueryParser("TheFieldNameToSearch", new WhitespaceAndAtSymbolAnalyzer()).Parse("@gmail.com");
Hits hits = query.Search(query);

Judah Himango 2008-08-21 16:38:40

Answer 4

+4 A:

I see you have your solution, but mine would have avoided this and added a field to the documents you're indexing called email_domain, into which I would have added the parsed out domain of the email address. It might sound silly, but the amount of storage associated with this is pretty minimal. If you feel like getting fancier, say some domain had many subdomains, you could instead make a field into which the reversed domain went, so you'd store com.gmail, com.company.department, or ae.eim so you could find all the United Arab Emirates related addresses with a prefix query of 'ae.'

dlamblin 2008-08-22 21:07:01

Answer 5

A:

You could a separate field that indexes the email address reversed: Index '[email protected]' as 'moc.liamg@oof' Which enables you to do a query for "moc.liamg@*"

2008-09-17 14:13:41

Hmm. That sounds really hackish.

Judah Himango 2008-10-05 21:15:13

Answer 6

+2 A:

There also is setAllowLeadingWildcard

But be careful. This could get very performance expensive (thats why it is disabled by default). Maybe in some cases this would be an easy solution, but I would prefer a custom Tokenizer as stated by Judah Himango, too.

Markus Lux 2008-09-19 07:37:46

ansaurus

tags:

views:

answers:

Using Lucene to search for email addresses

related questions