views:

325

answers:

4

I have an vCard application that needs to read vCard Data, and have found a RegularExpression which gets the FieldName, Encoding and FieldValue from the file, here it is below:

^(?<FIELDNAME>[\w-]{1,})(?:(?:;?)(?:ENCODING=(?<ENC>[^:;]*)|CHARSET=(?<CHARSET>[^:;]*))){0,2}:(?:(?<CONTENT>(?:[^\r\n]*=\r\n){1,}[^\r\n]*)|(?<CONTENT>[^\r\n]*))

This Regular Expression Reads these kind of values fine:

ORG:Company
FN;ENCODING=QUOTED-PRINTABLE;CHARSET=UTF-8:RoguePlanetoid

However I want it to read these values also

TEL;WORK;VOICE:0200 0000000

Without skipping them. How can I modify the RegularExpression so TEL;WORK;VOICE ends up as part of the "FIELDNAME" and 0200 0000000 is part of the "CONTENT".

I am unfamiliar with complex RegularExpressions and cannot figure out how to modify it, there is a regular expression that gets these:

^(?:TEL)([^:]*):(?<TEL>[^\r\n]*)

However it only gets the FieldName as "TEL" and I need the whole value for this so I can tell the numbers apart in my application.


If possible the Regular Expression would read the WORK and VOICE elements also like the CHARSET and ENCODING in the current regular expression, so they can treated like an Attribute and Type for example, however anything which allows the Regular Expression to read the whole TEL;WORK;VOICE as the FIELDNAME will be fine.


Edit

^(?<FIELDNAME>[^:]{1,})(?:(?:;?)(?:ENCODING=(?<ENC>[^:;]*)|CHARSET=(?<CHARSET>[^:;]*))){0,2}:(?:(?<CONTENT>(?:[^\r\n]*=\r\n){1,}[^\r\n]*)|(?<CONTENT>[^\r\n]*))

Reads up to the first Colon which covers the Whole FieldName, however it would be nice to store each SemiColon Element in a seperate item such as ATTRIBUTE or TYPE.

A: 

If all you want is to capture TEL;WORK;VOICE then this will do it:

^(.*?:)

this essentially captures everything from the beginning of the line until and including the first colon. To exclude the colon simply move it outside the capturing parens

here's the full regex (without the matching variables FIELDNAME AND CONTENT):

^(.*?):(.*)$

so ^(.*?): captures everything up until the first colon and (.*)? matches everything after the first colon until the end of line. You can put the matching variable names before the 2 parts of the regex

ennuikiller
That sounds like what I need, however where do I put the modification in my RegEx so the FIELDNAME will be captured correctly, like all the other fields, as I read the FIELDNAME and CONTENT in my code to populate a field list.
RoguePlanetoid
A: 

I believe this does what you want. It's in C# because I'm not set up to test VB, but you shouldn't have any trouble converting it.

Regex r = new Regex(
    @"^(?<FIELD>[^\s:;]+)(;(?<PARAM>[^;:]+))*:(?<CONTENT>.*(?>\r\n[ \t].*)*)$",
    RegexOptions.ExplicitCapture | RegexOptions.Multiline);
string target = @"TEL;WORK;VOICE:0200 0000000";
Match m = r.Match(target);
if (m.Success)
{
  Console.WriteLine("field name: {0}", m.Groups["FIELD"].Value);
  foreach (Capture c in m.Groups["PARAM"].Captures)
  {
    Console.WriteLine("  type:  {0}", c.Value);
  }
  Console.WriteLine("content: {0}", m.Groups["CONTENT"].Value);
}

EDIT: Now that I know where you got the regex from, I can see the author is trying to do too much work in the regex. "Encoding" and "charset" are just two of many possible parameter names; I don't see any reason to match those two by name and not any others. Just iterate through the "PARAM" captures like I did and handle each one as appropriate.

The author also allows for line folding, which probably does belong in the regex. The rules governing line folding seem pretty simple: if a line starts with a space or a tab, it's a continuation of the previous line. That also means the "FIELD" subexpression needs to be revised to disallow whitespace as well as colons and semicolons.

I've revised my regex and added the Multiline modifier, which should have been there all along. :-/

I feel I should mention that, if you're writing a complete vCard processing app, you probably shouldn't be building it on top of regexes. A non-regex solution will be easier to write (though not as much fun) and easier to maintain.

Alan Moore
I got the Regular Expression from here : http://blog.smithfamily.dk/CategoryView,category,vcard.aspx, it was the only example that was not tied to the fieldnames themselves.
RoguePlanetoid
Actually this inspired me to find a solution that works, so will mark this as answer!
RoguePlanetoid
Okay, thanks. I was editing the answer when you did that, and SO didn't notify me. I still think my way is better, but if you're happy...
Alan Moore
A: 

The Regular Expression which works is:

^(?<FIELDNAME>[\w-]{1,})(?:(?:;?)(?:ENCODING=(?<ENC>[^:;]*)|CHARSET=(?<CHARSET>[^:;]*)|(?<PARAM>[^:;]+))){0,2}:(?:(?<CONTENT>(?:[^\r\n]*=\r\n){1,}[^\r\n]*)|(?<CONTENT>[^\r\n]*))

Hopefully if someone else finds this useful, as it solved the problem with getting the Parameters from the vCard Data

RoguePlanetoid
A: 

This is a pretty good and detailed blog post that describes parsing VCard fields and gives the regular expressions that it uses. It could be of help to you.

http://borick.blogspot.com/

Rick