views:

2385

answers:

8

STILL NOT RESOLVED :(   [Feb 11th]

I have a large text file full of random data and want to pull out all the email addresses from it.

I would like to do this in Ruby, with pseudo code like this:

monster_data_string = "asfsfsdfsdfsf  sfda **[email protected]** sdfdsf"
monster_data_string.match(EMAIL_REGEX)

Does anyone know what Ruby email regular expression I would use to accomplish this?

Please keep in mind that I'm looking for a Ruby answer to this. I have already tried numerous regex found by googling but most of them cause Ruby runtime errors stating that characters like "+" and "" are invalid/unrecognized.*

What I have already tried is:

monster_data_string.match(/^([^@\s]+)@((?:[-a-z0-9]+\.)+[a-z]{2,})$/i)

but I receive Ruby errors stating that "+" is an invalid character

Thanks in advance

A: 

The first hit while googling "email regex" was exactly the link I was looking for. That site is also great for learning regular expressions. Hope this helps.

Mike
I've tried those regular expression but they don't work in Ruby
Heh, that page is great -- nothing like redefining the question to something you *can* answer, and then answering that instead of the question you were asked. At least he does eventually give a decent answer.
womble
Just to note again, using the regular expression in the linked page above generates Ruby error messages :(
A: 

Blimey, two in 10 minutes... see http://stackoverflow.com/questions/535600

womble
This post is to further clarify my original http://stackoverflow.com/questions/535600 post
So why wouldn't you just clarify your original question?
womble
A: 

Given that it is not possible to parse every valid email address using a regexp you are left with two choices:

Make a regexp that matches as many valid email addresses as possible and live with the the fact that some valid but rarely used forms of email address might get overlooked.

or

Make a regexp that Matches anything that "might be" an email address and then live with the false positives

I use the second approach to weed out obviously wrong email addresses when validating user sign up email addresses on a web page

Gleaned from Ruby Cookbook which has a very good section on email address validation:

valid = '[^ @]+'
/^#{valid}@#{valid}\.#{valid}/

Apparently there is a 6343 character Perl regexp written by Paul Warren that does a very good job and also works in Ruby, but even that is not foolproof (I think it might also have some performance implications).

Noel Walters
+1  A: 

What kind of runtime error messages are you gettting? Is it regarding the regexps as invalid, or is it breaking due to the target string being too large?

Andrew Grimm
It's related to the regexp being invalid. Errors statting that the "+" or "*" characters are invalid/unrecognized.
Are you sure you've tried escaping them properly?
Andrew Grimm
I've tried using the \ character to escape them but it's still not working
I have tried specifically the following code string_of_data.match(/^([^@\s]+)@((?:[-a-z0-9]+\.)+[a-z]{2,})$/i) where string_of_data is the string variable read in that contains the randomly mixed data of words and email addresses
You probably don't want to hear "Works for me", right? Can you try generating the simplest combination of string_of_data and regular expression that doesn't work, and the most complex combination that does work, and pasting all that on a gist or a pastie?
Andrew Grimm
I tried using monster_data_string = "aa **[email protected]** sf" and regexp = /([^@\s]+)@((?:[-a-z0-9]+\.)+[a-z]{2,})/i (I removed the ^ and $) in "try ruby! (in your browser)", and that worked.
Andrew Grimm
A: 

To try and help you get there (though not very elegantly, I admit):

I think the start and end anchors (^ and $) aren't helping. You may also want to filter the asterisks?:

irb(main):001:0> mds = "asfsfsdfsdfsf  sfda **[email protected]** sdfdsf"
  => "asfsfsdfsdfsf  sfda **[email protected]** sdfdsf"
irb(main):003:0> mds.match(/^([^@\s]+)@((?:[-a-z0-9]+\.)+[a-z]{2,})$/i)
  => nil
irb(main):004:0> mds.match(/([^@\s]+)@((?:[-a-z0-9]+\.)+[a-z]{2,})/i)
  => #<MatchData "**[email protected]" 1:"**joe" 2:"example.com">
irb(main):005:0> mds.match(/([^@\s*]+)@((?:[-a-z0-9]+\.)+[a-z]{2,})/i)
  => #<MatchData "[email protected]" 1:"joe" 2:"example.com">
Brent.Longborough
+1  A: 

If you're getting an error message about + or * being invalid in regexes, you're doing something very wrong. This is a valid regex in Ruby, although it's not the one you want:

/^([^@\s]+)@((?:[-a-z0-9]+\.)+[a-z]{2,})$/i

For one thing, you don't want to anchor the regex to the start and end of lines (^ and $) if you're trying to pluck the addresses from "random" text. But once you've gotten rid of the anchors, your regex will match **[email protected] in your test string, which I presume you don't want. This regex from Regular-Expressions.info does a better job, but read that page for tips on tweaking it to meet your particular needs.

/\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b/i

Finally (and you may already know this), you won't want to use the match() method because that will only find the first match. Try scan() instead.

Alan Moore
+1  A: 

Watch this...

f =  File.open("content.txt")
content = f.read    
r = Regexp.new(/\b[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,4}\b/)     
emails = content.scan(r).uniq                                    
puts YAML.dump(emails)
A: 

Even better,

require 'yaml'

content = "asfsfsdfsdfsf  sfda **[email protected]** sdfdsf [email protected]"

r = Regexp.new(/\b([a-zA-Z0-9._%+-]+)@([a-zA-Z0-9.-]+?)(\.[a-zA-Z.]*)\b/)     
emails = content.scan(r).uniq                                    
puts YAML.dump(emails)

will give you

    ---
    - - joe
      - example
      - .com.au
    - - cool_me
      - example
      - .com.au
cool_me5000