views:

66

answers:

3

Hi - I'm looking for a good example of using Regular Expressions in PHP to "reverse engineer" a form letter (with a known format, of course) that has been pasted into a multiline textbox and sent to a script for processing.

So, for example, let's assume this is the original plain-text input (taken from a USDA press release):

WASHINGTON, April 5, 2010 - North American Bison Co-Op, a New Rockford, N.D., establishment is recalling approximately 25,000 pounds of whole beef heads containing tongues that may not have had the tonsils completely removed, which is not compliant with regulations that require the removal of tonsils from cattle of all ages, the U.S. Department of Agriculture's Food Safety and Inspection Service (FSIS) announced today.

For clarity, the fields that are variables are highlighted below:

[pr_city=]WASHINGTON, [pr_date=]April 5, 2010 - [corp_name=]North American Bison Co-Op, a [corp_city=]New Rockford, [corp_state=]N.D., establishment is recalling approximately [amount=]25,000 pounds of [product=]whole beef heads containing tongues that may not have had the tonsils completely removed, which is not compliant with regulations that require [reason=]the removal of tonsils from cattle of all ages, the U.S. Department of Agriculture's Food Safety and Inspection Service (FSIS) announced today.

How could I efficiently extract the contents of the

  • pr_city
  • pr_date
  • corp_name
  • corp_city
  • corp_state
  • amount
  • product
  • reason

fields from my example?

Any help would be appreciated, thanks.

+3  A: 

Well, a regex that works on your example could look like this (line breaks introduced to keep this beast legible, need to be removed prior to use):

/^(?P<pr_city>[^,]+), (?P<pr_date>[^-]+) - (?P<corp_name>.*?), a 
(?P<corp_city>[^,]+), (?P<corp_state>[^,]+), establishment is 
recalling approximately (?P<amount>.*?) of (?P<product>.*?), 
which is not compliant with regulations that require (?P<reason>.*?), 
the U\.S\. Department of Agriculture\'s Food Safety and Inspection 
Service \(FSIS\) announced today\.$/

So, in PHP you could do

if (preg_match('/^(?P<pr_city>[^,]+), (?P<pr_date>[^-]+) - (?P<corp_name>.*?), a (?P<corp_city>[^,]+), (?P<corp_state>[^,]+), establishment is recalling approximately (?P<amount>.*?) of (?P<product>.*?), which is not compliant with regulations that require (?P<reason>.*?), the U\.S\. Department of Agriculture\'s Food Safety and Inspection Service \(FSIS\) announced today\.$/', $subject, $regs)) {
    $prcity = $regs['pr_city'];
    $prdate = $regs['pr_date'];
    ... etc.
} else {
    $result = "";
}

This assumes a couple of things, for instance that there are no line breaks, and that the input is the entire string (and not a larger string from which this part has to be extracted from). I've tried to make assumptions about legal values that make some sense, but there is the very real chance that other inputs could break this. So some more test cases are probably needed.

Tim Pietzcker
Excellent. Thanks for the quick turnaround. Would you mind breaking down the following expressions? (?P<pr_city>[^,]+), (?P<pr_date>[^-]+) - (?P<corp_name>.*?) - I'm still trying to learn regex. Much appreciated.
Yaaqov
Sure. `(?P<name>...)` denotes a named capturing group, so you can refer to a match's name instead of its number. The syntax for this is rather inconsistent across regex flavors. `[^,]+` means "match one or more characters that are anything but commas", and `.*?` means "match any number of characters except newlines, trying to match as few as possible to make the overall match work".
Tim Pietzcker
Makes sense - thank you
Yaaqov
+2  A: 

If the surrounding text is constant, then something like this partial regex could do the trick:

preg_match('/^(.*?), (.*?)- (.*?), a (.*?), (.*?), establishment is recalling approximately (.*?), which is not compliant with regulations that require (.*?), the U.S. Department of Agriculture's Food Safety and Inspection Service (FSIS) announced today./', $text, $matches);

$matches[1] = 'WASHINGTON';
$matches[2] = 'April 5, 2010';
$matches[3] = ... etc...

If the surrounding text changes, then you're going to end up with a ton of false matches, no matches, etc... Essentially you'd need an AI to parse/understand PR releases.

Marc B
+1  A: 

Edit: Please disregard this crazy answer, as the other two are better. I should probably delete it, but I'm keeping it up for reference.

I have a crazy idea that just might work: build an XML string from the input by adding markups, then parse it. It might look something like this (completely untested) code:

preg_replace('([^,]*), ([^-]*)- ...etc...', '<pr_city>\1</pr_city><pr_date>\2</pr_date> ...etc...');

Parsing the XML afterwards is a needlessly complicated process that is best left to the PHP documentation: http://www.php.net/manual/en/function.xml-parse.php .

You could also consider converting it to JSON with this method, then using json_decode() to parse it. In any case, you have to think about what happens when " marks and > symbols appear in the input.

It might be easier to just match and remove one piece of the text at a time.

Joey Adams
Thanks - I'll take a look at that link.
Yaaqov