views:

197

answers:

1

Hello,
I use a library to parse an iCalendar file, but I don't understand the regex to split property.
iCalendar property has 3 different style:

BEGIN:VEVENT
DTSTART;VALUE=DATE:20080402
RRULE:FREQ=YEARLY;WKST=MO

The library uses this regex that I would like to understand:

var matches:Array = data.match(/(.+?)(;(.*?)=(.*?)((,(.*?)=(.*?))*?))?:(.*)$/);
p.name = matches[1];
p.value = matches[9];                   
p.paramString = matches[2];

Thanks.

+4  A: 

That's a terrible regular expression! .* and .*? mean to match as many (greedy) or as few (lazy) of anything. These should only be used as a last resort. Improper use will result in catastrophic backtracking when the regex cannot match the input text. All you need to understand about this regular expression that you don't want to write regexes like this.

Let me show how I would approach the problem. Apparently the iCalendar File Format is line-based. Each line has a property and a value separated by a colon. The property can have parameters that are separated from it by a semicolon. This implies that a property cannot contain line breaks, semicolons or colons, that the optional parameters cannot contain line breaks or colons, and that the value cannot contain line breaks. This knowledge allows us to write an efficient regular expression that uses negated character classes:

([^\r\n;:]+)(;[^\r\n:]+)?:(.+)

Or in ActionScript:

var matches:Array = data.match(/([^\r\n;:]+)(;[^\r\n:]+)?:(.+)/);  
p.name = matches[1];
p.value = matches[3];
p.paramString = matches[2];

As explained by RegexBuddy:

Match the regular expression below and capture its match into backreference number 1 «([^\r\n;:]+)»
   Match a single character NOT present in the list below «[^\r\n;:]+»
      Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
      A carriage return character «\r»
      A line feed character «\n»
      One of the characters “;:” «;:»
Match the regular expression below and capture its match into backreference number 2 «(;[^\r\n:]+)?»
   Between zero and one times, as many times as possible, giving back as needed (greedy) «?»
   Match the character “;” literally «;»
   Match a single character NOT present in the list below «[^\r\n:]+»
      Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
      A carriage return character «\r»
      A line feed character «\n»
      The character “:” «:»
Match the character “:” literally «:»
Match the regular expression below and capture its match into backreference number 3 «(.+)»
   Match any single character that is not a line break character «.+»
      Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Jan Goyvaerts
+1 great explanation!
aSeptik