tags:

views:

1113

answers:

6

Hi, We need to parse email headers. We need to extract domain\IPs through which the mail has traversed.Also, we need to figure if an IP is an internal IP. Is there already a library which can help out , especially in C\C++.

For example,

Received: from server.mymailhost.com (mail.mymailhost.com [126.43.75.123])
    by pilot01.cl.msu.edu (8.10.2/8.10.2) with ESMTP id NAA23597;
    Fri, 12 Jul 2002 16:11:20 -0400 (EDT)

We need to extract the "by" server.

thanks

A: 

It is not difficult to parse such headers, even manually line-by-line. A regex could help there by looking at by\s+(\w)+\(. For C++, you could try that library or that one.

Keltia
Headers can be multiline
Sergej Andrejev
A: 

You'll want to use Regular Expressions possibly

(?<=by).*(?=with)

This will give you pilot01.cl.msu.edu (8.10.2/8.10.2)

Edit: I find it amusing that this was modded down when it actually gets what the OP asked for.

C#:

string header = "Received: from server.mymailhost.com (mail.mymailhost.com [126.43.75.123]) by pilot01.cl.msu.edu (8.10.2/8.10.2) with ESMTP id NAA23597; Fri, 12 Jul 2002 16:11:20 -0400 (EDT)";
       System.Text.RegularExpressions.Regex r = new System.Text.RegularExpressions.Regex(@"(?<=by).*(?=with)");
       System.Text.RegularExpressions.Match m = r.Match(header);
       Console.WriteLine(m.Captures[0].Value);
       Console.ReadKey();

I didnt claim that it was complete, but am wondering if the person that gave it a -1 even tried. Meh..

RandomNoob
A: 

Have you considered using regular expressions?

Here is a list of internal, non-routable address ranges.

Dave Swersky
+2  A: 

vmime should be fine, moreless any mail library will allow you to do that.

Axelle Ziegler
A: 

You can use regular expressions. It would look like this(not tested):

#include <regex.h>

regex_t *re = malloc(sizeof(regex_t));

const char *restr = "by ([A-Za-z.]+) \(([^\)]*)\)";

check(regcomp(re, restr, REG_EXTENDED | REG_ICASE), "regcomp");

size_t nmatch = 1;

regmatch_t *matches = malloc(sizeof(regmatch_t) * nmatch);

int ret = regexec(re, YOUR_STRING, nmatch, matches, 0);

check(ret != 0, "regexec");

int size;

size = matches[2].rm_eo - matches[2].rm_so;
char *host = malloc(sizeof(char) * size);
strncpy(host, YOUR_STRING + matches[2].rm_so, size );
host[size] = '\0';

size = matches[3].rm_eo - matches[3].rm_so;
char *ip = malloc(sizeof(char) * size);
strncpy(ip, YOUR_STRING + matches[3].rm_so, size );
ip[size] = '\0';

check is a macro to help you figure out if there are any problems:

#define check(condition, description) if (condition) { fprintf(stdout, "%s:%i - %s - %s\n", __FILE__, __LINE__, description, strerror(errno)); exit(1); }
Tiago
+2  A: 

The format used by 'Received' lines is defined in RFC 2821, and regex can't parse it.

(You can try anyway, and for a limited subset of headers produced by known software you might succeed, but when you attach this to the range of strange stuff found in real-world mail it will fail.)

Use an existing RFC 2821 parser and you should be OK, but otherwise you should expect failure, and write the software to cope with it. Don't base anything important like a security system around it.

We need to extract the "by" server.

'from' is more likely to be of use. The hostname given in a 'by' line is as seen by the host itself, so there is no guarantee it will be a publically resolvable FQDN. And of course you don't tend to get valid (TCP-Info) there.

bobince