views:

135

answers:

3

I am creating a parser that wards off against spamming and harvesting of emails from a block of text that comes from tinyMCE (so it may or may not have html tags in it)

I've tried regexes and so far this has been successful:

/\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b/i

problem is, i need to ignore all email addresses with mailto hrefs. for example:

<a href="mailto:[email protected]">[email protected]</a>

should only return the second email add.

To get a background of what im doing, im reversing the email addresses in a block so the above example would look like this:

<a href="mailto:[email protected]">moc.liam@tset</a>

problem with my current regex is that it also replaces the one in href. Is there a way for me to do this with a single regex? Or do i have to check for one then the other? Is there a way for me to do this just by using gsub or do I have to use some nokogiri/hpricot magicks and whatnot to parse the mailtos? Thanks in advance!

Here were my references btw:

so.com/questions/504860/extract-email-addresses-from-a-block-of-text

so.com/questions/1376149/regexp-for-extracting-a-mailto-address

im also testing using this:

http://rubular.com/

edit

here's my current helper code:

def email_obfuscator(text)
  text.gsub(/\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b/i) { |m|
    m = "<span class='anti-spam'>#{m.reverse}</span>"
  }
end

which results in this:

<a target="_self" href="mailto:<span class='anti-spam'>moc.liamg@tset</span>"><span class="anti-spam">moc.liamg@tset</span></a>
A: 

Would this work?

/\b(?<!mailto:)[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b/i

The (?<!mailto:) is a negative lookbehind, which will ignore any matches starting with mailto:

I don't have Ruby set up at work, unfortunately, but it worked with PHP when I tested it...

John Yeates
i tried it using rubular but it says Undefined (?...) sequence.i think the < is the culprit. what does it stand for again?
corroded
Hmm, looks like Ruby doesn't support lookbehind according to http://www.ruby-doc.org/docs/ProgrammingRuby/html/language.html#UJ - that's annoying.The ?<! means that the string you're matching (the email address) mustn't be preceded by the lookbehind string (mailto:) in order for the match to succeed. In this case you'd probably be best off with serg555's suggestion.
John Yeates
i'd up this since it is also helpful but i don't have the right priveleges. anyway, thanks for the help!
corroded
A: 

Another option if lookbehind doesn't work:

/\b(mailto:)?([A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4})\b/i

This would match all emails, then you can manually check if first captured group is "mailto:" then skip this match.

serg
it works using rubular, but just another question, how do i check if the first captured group is mailto? I'll pass it to the function again?here is my current code for the obfuscator: (see above)
corroded
Sorry, I am not familiar with Ruby. Usually when you do a regexp search it will return you an array of matched elements, which are split into captured groups.
serg
I researched on that too but then again, you will have to know which group to pick. What im trying to aim at here is to 'replace on the fly', where something like this could happen:1. start parsing the block of text2. oh i see an email address, lemme invert that.3. oh i see another email address, but then this one has a mailto: before it, it must be a hyperlink. move on.4. i see an email again, this time with no mailto:, invert it again.5. back to step 2and so on.
corroded
So do you have an ability to check every matched email and act differently based on what it contains?
serg
i think so. does that mean each email i get should be checked against another regex? or maybe i can do a regex that returns an email address or an email address with a mailto: appended. then make an if statement that should decide whether or not to reverse it?
corroded
I think this is what you have been suggesting yesterday and i kinda got lost(maybe because I have been at it for hours). I just got back on this today and I tried your regex then having an if-else statement that checks if the string has a mailto: and voila! Thanks!
corroded
A: 

Why not just store all the matched emails in an array and remove any duplicates? You can do this easily with the ruby standard library and (I imagine) it's probably quicker/more maintainable than adding more complexity to your regex.

emails = ["[email protected]", "[email protected]", "[email protected]"]
emails.uniq # => ["[email protected]", "[email protected]"]
Damien Wilson
As said in the function above, i will just replace the emails with their inverted counterparts, meaning if i put them in an array i will have to remember from what part of the text blog i got them from.
corroded