tags:

views:

1649

answers:

8

So i am working with some email header data, and for the to:, from:, cc:, and bcc: fields the email address(es) can be expressed in a number of different ways:

First Last <[email protected]>
Last, First <[email protected]>
[email protected]

And these variations can appear in the same message, in any order, all in one comma separated string:

First, Last <[email protected]>, [email protected], First Last <[email protected]>

I've been trying to come up with a way to parse this string into separate First Name, Last Name, E-Mail for each person (omitting the name if only an email address is provided).

Can someone suggest the best way to do this?

I've tried to Split on the commas, which would work except in the second example where the last name is placed first. I suppose this method could work, if after i split, i examine each element and see if it contains a '@' or '<'/'>', if it doesn't then it could be assumed that the next element is the first name. Is this a good way to approach this? Have i overlooked another format the address could be in?


UPDATE: Perhaps i should clarify a little, basically all i am looking to do is break up the string containing the multiple addresses into individual strings containing the address in whatever format it was sent in. I have my own methods for validating and extracting the information from an address, it was just tricky for me to figure out the best way to separate each address.

Here is the solution i came up with to accomplish this:

String str = "Last, First <[email protected]>, [email protected], First Last <[email protected]>, \"First Last\" <[email protected]>";

List<string> addresses = new List<string>();
int atIdx = 0;
int commaIdx = 0;
int lastComma = 0;
for (int c = 0; c < str.Length; c++)
{
    if (str[c] == '@')
        atIdx = c;

    if (str[c] == ',')
        commaIdx = c;

    if (commaIdx > atIdx && atIdx > 0)
    {
        string temp = str.Substring(lastComma, commaIdx - lastComma);
        addresses.Add(temp);
        lastComma = commaIdx;
        atIdx = commaIdx;
    }

    if (c == str.Length -1)
    {
        string temp = str.Substring(lastComma, str.Legth - lastComma);
        addresses.Add(temp);
    }
}

if (commaIdx < 2)
{
    // if we get here we can assume either there was no comma, or there was only one comma as part of the last, first combo
    addresses.Add(str);
}

The above code generates the individual addresses that i can process further down the line.

+3  A: 

There isn't really an easy solution to this. I would recommend making a little state machine that reads char-by-char and do the work that way. Like you said, splitting by comma won't always work.

A state machine will allow you to cover all possibilities. I'm sure there are many others you haven't seen yet. For example: "First Last" [email protected]

Look for the RFC about this to discover what all the possibilities are. Sorry, I don't know the number. There are probably multiple as this is the kind of things that evolves.

sjbotha
A: 

You could use regular expressions to try to separate this out, try this guy:

^(?<name1>[a-zA-Z0-9]+?),? (?<name2>[a-zA-Z0-9]+?),? (?<address1>[a-zA-Z0-9.-_<>]+?)$

will match: Last, First [email protected]; Last, First <[email protected]>; First last [email protected]; First Last <[email protected]>. You can add another optional match in the regex at the end to pick up the last segment of First, Last <[email protected]>, [email protected] after the email address enclosed in angled braces.

Hope this helps somewhat!

EDIT:

and of course you can add more characters to each of the sections to accept quotations etc for whatever format is being read in. As sjbotha mentioned, this could be difficult as the string that is submitted is not necessarily in a set format.

This link can give you more information about matching AND validating email addresses using regular expressions.

Anders
This regex isn't going to validate all of the possible email address formats.
Scott Dorman
Correct. Read my post again and notice I did not say it would validate the address, only match it. The regex (according to specs) to match "all possible" email addresses is very long and complex. Since his question isn't about validating email, but parsing a string, this could work decently well.
Anders
A: 

There is no generic simple solution to this. The RFC you want is RFC2822, which describes all of the possible configurations of an email address. The best you are going to get that will be correct is to implement a state-based tokenizer that follows the rules specified in the RFC.

Scott Dorman
but your method doesnt validate emails!!!!!!!!!1!1! OH NOES
Anders
it's not important to validate the emails, only to extract the important info regardless of what format it is in
Jason Miesionczek
A: 

Here is how I would do it:

  • You can try to standardize the data as much as possible i.e. get rid of such things as the < and > symbols and all of the commas after the '.com.' You will need the commas that separate the first and last names.
  • After getting rid of the extra symbols, put every grouped email record in a list as a string. You can use the .com to determine where to split the string if need be.
  • After you have the list of email addresses in the list of strings, you can then further split the email addresses using only whitespace as the delimeter.
  • The final step is to determine what is the first name, what is the last name, etc. This would be done by checking the 3 components for: a comma, which would indicate that it is the last name; a . which would indicate the actual address; and whatever is left is the first name. If there is no comma, then the first name is first, last name is second, etc.

    I don't know if this is the most concise solution, but it would work and does not require any advanced programming techniques
jle
the problem with this is the '.com'. it is possible that any top level domain/country code could be present there.
Jason Miesionczek
+1  A: 

At the risk of creating two problems, you could create a regular expression that matches any of your email formats. Use "|" to separate the formats within this one regex. Then you can run it over your input string and pull out all of the matches.

public class Address
{
 private string _first;
 private string _last;
 private string _name;
 private string _domain;

 public Address(string first, string last, string name, string domain)
 {
  _first = first;
  _last = last;
  _name = name;
  _domain = domain;
 }

 public string First
 {
  get { return _first; }
 }

 public string Last
 {
  get { return _last; }
 }

 public string Name
 {
  get { return _name; }
 }

 public string Domain
 {
  get { return _domain; }
 }
}

[TestFixture]
public class RegexEmailTest
{
 [Test]
 public void TestThreeEmailAddresses()
 {
  Regex emailAddress = new Regex(
   @"((?<last>\w*), (?<first>\w*) <(?<name>\w*)@(?<domain>\w*\.\w*)>)|" +
   @"((?<first>\w*) (?<last>\w*) <(?<name>\w*)@(?<domain>\w*\.\w*)>)|" +
   @"((?<name>\w*)@(?<domain>\w*\.\w*))");
  string input = "First, Last <[email protected]>, [email protected], First Last <[email protected]>";

  MatchCollection matches = emailAddress.Matches(input);
  List<Address> addresses =
   (from Match match in matches
    select new Address(
     match.Groups["first"].Value,
     match.Groups["last"].Value,
     match.Groups["name"].Value,
     match.Groups["domain"].Value)).ToList();
  Assert.AreEqual(3, addresses.Count);

  Assert.AreEqual("Last", addresses[0].First);
  Assert.AreEqual("First", addresses[0].Last);
  Assert.AreEqual("name", addresses[0].Name);
  Assert.AreEqual("domain.com", addresses[0].Domain);

  Assert.AreEqual("", addresses[1].First);
  Assert.AreEqual("", addresses[1].Last);
  Assert.AreEqual("name", addresses[1].Name);
  Assert.AreEqual("domain.com", addresses[1].Domain);

  Assert.AreEqual("First", addresses[2].First);
  Assert.AreEqual("Last", addresses[2].Last);
  Assert.AreEqual("name", addresses[2].Name);
  Assert.AreEqual("domain.com", addresses[2].Domain);
 }
}

There are several down sides to this approach. One is that it doesn't validate the string. If you have any characters in the string that don't fit one of your chosen formats, then those characters are just ignored. Another is that the accepted formats are all expressed in one place. You cannot add new formats without changing the monolithic regex.

Michael L Perry
A: 

Here is the solution i came up with to accomplish this:

String str = "Last, First <[email protected]>, [email protected], First Last <[email protected]>, \"First Last\" <[email protected]>";

List<string> addresses = new List<string>();
int atIdx = 0;
int commaIdx = 0;
int lastComma = 0;
for (int c = 0; c < str.Length; c++)
{
if (str[c] == '@')
    atIdx = c;

if (str[c] == ',')
    commaIdx = c;

if (commaIdx > atIdx && atIdx > 0)
{
    string temp = str.Substring(lastComma, commaIdx - lastComma);
    addresses.Add(temp);
    lastComma = commaIdx;
    atIdx = commaIdx;
}

if (c == str.Length -1)
{
    string temp = str.Substring(lastComma, str.Legth - lastComma);
    addresses.Add(temp);
}
}

if (commaIdx < 2)
{
    // if we get here we can assume either there was no comma, or there was only one comma as part of the last, first combo
    addresses.Add(str);
}
Jason Miesionczek
A: 

I use the following regular expression in Java to get email string from RFC-compliant email address: "[A-Za-z0-9]+[A-Za-z0-9.-]+@[A-Za-z0-9]+[A-Za-z0-9.-]+[.][A-Za-z0-9]{2,3}"

Alex Yakimovich
A: 

Hello everyone,

I am trying to figure out how to get this email how transfer this email over to cfscript, any help I can get would be awesome

Please ask your own question.
Jason Miesionczek