tags:

views:

189

answers:

4

I'm trying to to convert an existing PHP Regular Expression match case to apply to a slightly different style of document.

Here's the original style of the document:

**FOODS - TYPE A** 
___________________________________ 
**PRODUCT** 
1) Mi Pueblito Queso Fresco Authentic Mexican Style Fresh Cheese; 
2) La Fe String Cheese 
**CODE** 
Sell by date going back to February 1, 2009 

And the successfully-running PHP Regex match code that only returns "true" if the line is surrounded by asterisks, and stores each side of the "-" as $m[1] and $m[2], respectively.

 if ( preg_match('#^\*\*([^-]+)(?:-(.*))?\*\*$#', $line, $m) ) { 
    // only for **header - subheader** $m[2] is set. 
    if ( isset($m[2]) ) { 
      return array(TYPE_HEADER, array(trim($m[1]), trim($m[2]))); 
    } 
    else { 
      return array(TYPE_KEY, array($m[1])); 
    } 
  } 

So, for line 1: $m[1] = "FOODS" AND $m[2] = "TYPE A"; Line 2 would be skipped; Line 3: $m[1] = "PRODUCT", etc.

The question: How would I re-write the above regex match if the headers did not have the asterisks, but still was all-caps, and was at least 4 characters long? For example:

FOODS - TYPE A 
___________________________________ 
PRODUCT
1) Mi Pueblito Queso Fresco Authentic Mexican Style Fresh Cheese; 
2) La Fe String Cheese 
CODE
Sell by date going back to February 1, 2009 

Thank you.

+2  A: 

Along the lines of (don't forget the "u" flag for Unicode regexes):

^(?:\*\*)?(?=[^*]{4,})(\p{Lu}+)(?:\s*-\s*(\p{Lu}+))?(?:\*\*)?\s*$
^               # start of line
(?:\*\*)?       # two stars, optional
(?=[^*]{4,})    # followed by at least 4 non-star characters
(\p{Lu}+)       # group 1, Unicode upper case letters
(?:             # start no capture group
  \s*-\s*       #   space*, dash, space*
  (\p{Lu}+)     #   group 2, Inicode upper case letters
)?              # end no capture group, make optional
(?:\*\*)?       # two stars, optional
\s*             # optional trailing spaces
$               # end of line

EDIT: Simplified, as per the comments:

^(?=[A-Z ]{4,})([A-Z ]+)(?:-([A-Z ]+))?\s*$
^               # start of line
(?=[A-Z -]{4,}) # followed by at least 4 upper case characters, spaces or dashes
([A-Z ]+)       # group 1, upper case letters or space
(?:             # start no capture group
  -             #   a dash
  ([A-Z ]+)     #   group 2, upper case letters or space
)?              # end no capture group, make optional
\s*             # optional trailing spaces
$               # end of line

Contents of groups 1 and 2 must be trimmed before use.

Tomalak
This is good, but is there a simpler expression, since the asterisks will no longer be used, and all chars are English uppercase letters? I appreciate your detail, but I'm also trying to learn by example. Thanks.
Yaaqov
Well done, and clarified. Thank you. By the way, is there a tool you're aware of that automatically comments on the components of regular expressions? This is very helpful for a beginner like myself.
Yaaqov
@Yaaqov: Maybe. I don't know any such tool, though.
Tomalak
@Yaaqov: I would highly recommend RegexBuddy (http://www.regexbuddy.com/). It's not free but it's in my list of must-have tools when dealing with regex stuff.
Amry
A: 

So all you need to know is that the header starts with four uppercase ASCII letters? This should work:

'#^([A-Z]{4}[^-]*)(?:-(.*))?$#'
Alan Moore
+1  A: 
^([A-Z]{4,}(?:[A-Z ]*[A-Z])?)(?:\s*-\s*([A-Z]{4,}(?:[A-Z ]*)?))?$

What about this one? It would match uppercase words of at least 4 characters and an optional subheader of again at least 4 uppercase letters.

Aurril
+1  A: 

The regular expression:

^(?=.{4})([^-]+)(?:-(.*))?$

The explanation:

^          # start of line
(?=.{4})   # look ahead to make sure there are at least 4 characters
([^-]+)    # get all characters until it finds a dash character, if there is any
(?:-(.*))? # optional: skip the dash and continue get all characters until EOL
$          # end of line

I assumed you were only interested on lines having at least 4 characters.

Also, I cheated a bit so that the regex will match any characters, not just English uppercase letters, since it leads to simpler expression. Anyhow, if you want to make sure it only accepts uppercase letters, this should do it:

^(?=.{4})([A-Z\s]+)(?:-([A-Z\s]+))?$
Amry