tags:

views:

190

answers:

3

Hi,

I'm trying to work out how to extract POP3 headers using this regex

^(?[a-zA-Z-]+)(?(?=:).+)$

Delivered-To: [email protected]

The group returns the ':' character as well which I want to avoid. I'm busting trying to work this out but can't.

Need collective wisdom :-)

+1  A: 

I would go with something like

/^([^:]+):(.*)$/

Then you would have

  • $1 - header name
  • $2 - value
Sergej Andrejev
Very cleaver and minimal approach. Thanks that worked.
Sir Psycho
Keep in mind that this is a very common trick which can be used for many situations
Sergej Andrejev
I would make the [^:]+ character class possessive, in order to prevent possible needless backtracking: [^:]++
Geert
+2  A: 

Hi,

Just so you are aware, this will not handle wrapped headers. In fact, that regex will take a wrapped header, and prepend it to a real header. Especially if the wrapped header doesn't have a ":" in the following lines.

Building upon Sergej Andrejev's Regex, this one will handle not capturing the wrapped lines:

^([^:\s+]+):(.*)$

However, the best thing to do, is to actually read the headers line by line, and parse accordingly. It's a pain (as I've had to do it for production code), but it's the most accurate.

Cheers!

Dave

dave wanta
A: 

Sorry, copied the wrong code:

^(\S+):\s((\s\S:))*)

It works with multi lines.

That regex is not going to work at all. Its syntax is invalid to begin with.
Geert
Sorry, copied the wrong code: ^(\S+):\s(([\s\S](?!^(\S+):))*) It works with multi lines.