views:

114

answers:

1

I'm currently modifying my regex for this:

http://stackoverflow.com/questions/2782031/extracting-email-addresses-in-an-html-block-in-ruby-rails

basically, im making another obfuscator that uses ROT13 by parsing a block of text for all links that contain a mailto referrer(using hpricot). One use case this doesn't catch is that if the user just typed in an email address(without turning it into a link via tinymce)

So here's the basic flow of my method: 1. parse a block of text for all tags with href="mailto:..." 2. replace each tag with a javascript function that changes this into ROT13 (using this script: http://unixmonkey.net/?p=20) 3. once all links are obfuscated, pass the resulting block of text into another function that parses for all emails(this one has an email regex that reverses the email address and then adds a span to that email - to reverse it back)

step 3 is supposed to clean the block of text for remaining emails that AREN'T in a href tags(meaning it wasn't parsed by hpricot). Problem with this is that the emails that were converted to ROT13 are still found by my regex. What i want to catch are just emails that WEREN'T CONVERTED to ROT13.

How do i do this? well all emails the WERE CONVERTED have a trailing "'.replace" in them. meaning, i need to get all emails WITHOUT that string. so far i have this regex:

/\b([A-Z0-9._%+-]+@[A-Z0-9.-]+.[A-Z]{2,4}('.replace))\b/i

but this gets all the emails with the trailing '.replace i want to get the opposite and I'm currently stumped with this. any help from regex gurus out there?

MORE INFO:

Here's the regex + the block of text im parsing:

http://www.rubular.com/r/NqXIHrNqjI

as you can see, the first two 'email addresses' are already obfuscated using ROT13. I need a regex that gets the emails [email protected] and [email protected]

+3  A: 

On negative lookaheads

You can use a negative lookahead to assert that a pattern doesn't match.

For example, the following regex matches all strings that doesn't end with ".replace" string:

^(?!.*\.replace$).*$

As another example, this regex matches all a*b*, except aabb:

^(?!aabb$)a*b*$

Ideally,

See also


Specific solution

The following regex works in this scenario: (see on rubular.com):

/\b([A-Z0-9._%+-]+@(?![A-Z0-9.-]*'\.replace\b)[A-Z0-9.-]+\.[A-Z]{2,4})\b/i
polygenelubricants
Of course, the "appropriate" regex feature (for asserting that a string doesn't *end* in a certain way) would have been lookbehind, but Ruby doesn't support that, so this is the correct workaround.
Tim Pietzcker
thanks dude, that is one epic regex i couldn't have thought of(they are my weakness, sadly)
corroded