views:

62

answers:

2

I'm writing myself a simple screen scraping application to play around with the HTMLAgilityPack library, and after getting it to work on several different types of HtmlNodes, I figured I'd get fancy and throw in a Regex for Email addresses as well. The only problem is that the application never finds any matches, or maybe it is but not returning properly. This takes place even on sites known to contain email addresses. Can anyone spot what I'm doing wrong here?

      string url = String.Format("http://{0}", mainForm.Target);
      string reg = "\b[A-Z0-9._%+-]+@[A-Z0-9.-]+.[A-Z]{2,4}\b";
      try
            {
                WebClient wClient = new WebClient();
                Stream data = wClient.OpenRead(url);
                StreamReader read = new StreamReader(data);
                MatchCollection matches = Regex.Matches(read.ReadToEnd(), reg, RegexOptions.IgnoreCase|RegexOptions.Multiline);
                foreach (Match match in matches)
                {
                    textBox1.AppendText(match.ToString() + Environment.NewLine);
                }
+2  A: 

Use raw strings:

string reg = @"\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b";

Without that, \b becomes backspace. Also, your last period should be \., so it only matches a literal period.

Matthew Flaschen
+1 - I think the official term is 'verbatim string literal'.
Alex Humphrey
Sweet, its displaying matches now :) How would I go about ensuring it doesn't display duplicates?
Stev0
@Stev, you mean if the same email appears more than once in the text? You could add each match to a [`HashSet`](http://msdn.microsoft.com/en-us/library/bb359438.aspx), then only append it if `Add` returns true (it wasn't already there).
Matthew Flaschen
A: 

Check the string that is returned by read.ReadToEnd() and see if you can find email addresses in this string with your regex. I guess that your problem doesn't have anything to do with StreamReader.

empi