views:

108

answers:

4

Good morning -

I'm interested in seeing an efficient way of parsing the values of an heirarchical text file (i.e., one that has a Title => Multiple Headings => Multiple Subheadings => Multiple Keys => Multiple Values) into a simple XML document. For the sake of simplicity, the answer would be written using:

  • Regex (preferrably in PHP)
  • or, PHP code (e.g., if looping were more efficient)

Here's an example of an Inventory file I'm working with. Note that Header = FOODS, Sub-Header = Type (A, B...), Keys = PRODUCT (or CODE, etc.) and Values may have one more more lines.

**FOODS - TYPE A**
___________________________________
**PRODUCT**
1) Mi Pueblito Queso Fresco Authentic Mexican Style Fresh Cheese;
2) La Fe String Cheese
**CODE**
Sell by date going back to February 1, 2009
**MANUFACTURER**
Quesos Mi Pueblito, LLC, Passaic, NJ.
**VOLUME OF UNITS**
11,000 boxes
**DISTRIBUTION**
NJ, NY, DE, MD, CT, VA
___________________________________
**PRODUCT**
1) Peanut Brittle No Sugar Added;
2) Peanut Brittle Small Grind;
3) Homestyle Peanut Brittle Nuggets/Coconut Oil Coating
**CODE**
1) Lots 7109 - 8350 inclusive;
2) Lots 8198 - 8330 inclusive;
3) Lots 7075 - 9012 inclusive;
4) Lots 7100 - 8057 inclusive;
5) Lots 7152 - 8364 inclusive
**MANUFACTURER**
Star Kay White, Inc., Congers, NY.
**VOLUME OF UNITS**
5,749 units
**DISTRIBUTION**
NY, NJ, MA, PA, OH, FL, TX, UT, CA, IA, NV, MO and IN
**FOODS - TYPE B**
___________________________________
**PRODUCT**
Cool River Bebidas Naturales - West Indian Cherry Fruit Acerola 16% Juice;
**CODE**
990-10/2 10/5
**MANUFACTURER**
San Mar Manufacturing Corp., Catano, PR.
**VOLUME OF UNITS**
384
**DISTRIBUTION**
PR

And here's the desired output (please excuse any XML syntactical errors):

<foods>
    <food type = "A" >
        <product>Mi Pueblito Queso Fresco Authentic Mexican Style Fresh Cheese</product>
        <product>La Fe String Cheese</product>
        <code>Sell by date going back to February 1, 2009</code>
        <manufacturer>Quesos Mi Pueblito, LLC, Passaic, NJ.</manufacturer>
        <volume>11,000 boxes</volume>
        <distibution>NJ, NY, DE, MD, CT, VA</distribution>
    </food>
    <food type = "A" >
        <product>Peanut Brittle No Sugar Added</product>
        <product>Peanut Brittle Small Grind</product>
        <product>Homestyle Peanut Brittle Nuggets/Coconut Oil Coating</product>
        <code>Lots 7109 - 8350 inclusive</code>
    <code>Lots 8198 - 8330 inclusive</code>
    <code>Lots 7075 - 9012 inclusive</code>
    <code>Lots 7100 - 8057 inclusive</code>
    <code>Lots 7152 - 8364 inclusive</code>
        <manufacturer>Star Kay White, Inc., Congers, NY.</manufacturer>
        <volume>5,749 units</volume>
        <distibution>NY, NJ, MA, PA, OH, FL, TX, UT, CA, IA, NV, MO and IN</distribution>
    </food>
    <food type = "B" >
        <product>Cool River Bebidas Naturales - West Indian Cherry Fruit Acerola 16% Juice</product>
        <code>990-10/2 10/5</code>
        <manufacturer>San Mar Manufacturing Corp., Catano, PR</manufacturer>
        <volume>384</volume>
        <distibution>PR</distribution>
    </food>
</FOODS>
<!-- and so forth -->

So far, my approach (which might be quite inefficient with a huge text file) would be one of the following:

  1. Loops and multiple Select/Case statements, where the file is loaded into a string buffer, and while looping through each line, see if it matches one of the header/subheader/key lines, append the appropriate xml tag to a xml string variable, and then add the child nodes to the xml based on IF statements regarding which key name is most recent (which seems time-consuming and error-prone, esp. if the text changes even slightly) -- OR

  2. Use REGEX (Regular Expressions) to find and replace key fields with appropriate xml tags, clean it up with an xml library, and export the xml file. Problem is, I barely use regular expressions, so I'd need some example-based help.

Any help or advice would be appreciated.

Thanks.

+1  A: 

See: http://www.tuxradar.com/practicalphp/21/5/6

This tells you how to parse a text file into tokens using PHP. Once parsed you can place it into anything you want.

You need to search for specific tokens in the file based on your criteria:

for example: PRODUCT

This gives you the XML Tag

Then 1) can have special meaning

1) Peanut Brittle...

This tells you what to put in the XML tag.

I do not know if this is the most efficient way to accomplish your task but it is the way a compiler would parse a file and has the potential to make very accurate.

Todd Moses
Thanks, but this is for the actual mechanics of "parsing" - the ciritical peice missing here is how to parse a text file (or string) with different levels of heirarchy and save it into a xml file.
Yaaqov
You parse it then you get every piece. You then assign levels of hierarchy base upon specifics in the token. This is how a compiler works. It looks like you could use the numbers 1) ... to assign levels for the XML.
Todd Moses
Can you modify your answer to show how to send different levels of heirarchy into an xml output?
Yaaqov
My point was that this process of "load line, **multiple** nested conditions, separate function call to add a node to xml file, loop" is feasible, but is this the **best** or most *efficient* way to do this, vs. running a handful of regex expression to search and replace, and at least get a rough cut of an xml string finished?
Yaaqov
Thanks Todd for answering. I still am looking for a more "minimalistic" solution based on Regex (see the changed title), but I appreciate your time.
Yaaqov
+2  A: 

An example you can use as a starting point. At least I hope it gives you an idea...

<?php
define('TYPE_HEADER', 1);
define('TYPE_KEY', 2);
define('TYPE_DELIMETER', 3);
define('TYPE_VALUE', 4);

$datafile = 'data.txt';
$fp = fopen($datafile, 'rb') or die('!fopen');

// stores (the first) {header} in 'name' and the root simplexmlelement in 'element'
$container = array('name'=>null, 'element'=>null);
// stores the name for each item element, the value for the type attribute for subsequent item elements and the simplexmlelement of the current item element
$item = array('name'=>null, 'type'=>null, 'current_element'=>null);
// the last **key** encountered, used to create new child elements in the current item element when a value is encountered
$key = null;

while ( false!==($t=getstruct($fp)) ) {
  switch( $t[0] ) {
    case TYPE_HEADER:
      if ( is_null($container['element']) ) {
        // this is the first time we hit **header - subheader**
        $container['name'] = $t[1][0];
        // ugly hack, < . name . />
        $container['element'] = new SimpleXMLElement('<'.$container['name'].'/>');
        // each subsequent new item gets the new subheader as type attribute
        $item['type'] = $t[1][1];
        // dummy implementation: "deducting" the item names from header/container[name]
        $item['name'] = substr($t[1][0], 0, -1);
      }
      else {
        // hitting **header - subheader** the (second, third, nth) time 
        /*
        header must be the same as the first time (stored in container['name']).
        Otherwise you need another container element since 
        xml documents can only have one root element
        */
        if ( $container['name'] !== $t[1][0] ) {
          echo $container['name'], "!==",  $t[1][0], "\n";
          die('format error');
        }
        else {
          // subheader may have changed, store it for future item elements
          $item['type'] = $t[1][1];
        }
      }
      break;
    case TYPE_DELIMETER:
      assert( !is_null($container['element']) );
      assert( !is_null($item['name']) );
      assert( !is_null($item['type']) );
      /* that's maybe not a wise choice.
      You might want to check the complete item before appending it to the document.
      But the example is a hack anyway ...so create a new item element and append it to the container right away
      */
      $item['current_element'] = $container['element']->addChild($item['name']);
      // set the type-attribute according to the last **header - subheader** encountered
      $item['current_element']['type'] = $item['type'];
      break;
    case TYPE_KEY:
      $key = $t[1][0];
      break;
    case TYPE_VALUE:
      assert( !is_null($item['current_element']) );
      assert( !is_null($key) );
      // this is a value belonging to the "last" key encountered
      // create a new "key" element with the value as content
      // and addit to the current item element
      $tmp = $item['current_element']->addChild($key, $t[1][0]);
      break;
    default:
      die('unknown token');
  }
}

if ( !is_null($container['element']) ) {
  $doc = dom_import_simplexml($container['element']);
  $doc = $doc->ownerDocument;
  $doc->formatOutput = true;
  echo $doc->saveXML();
}
die;


/*
Take a look at gettoken() at http://www.tuxradar.com/practicalphp/21/5/6
It breaks the stream into much simpler pieces.
In the next step the parser would "combine" or structure the simple tokens into more complex things.
This function does both....
@return array(id, array(parameter)
*/
function getstruct($fp) {
  if ( feof($fp) ) {
    return false;
  }
  // shortcut: all we care about "happens" on one line
  // so let php read one line in a single step and then do the pattern matching
  $line = trim(fgets($fp));

  // this matches **key** and **header - subheader**
  if ( preg_match('#^\*\*([^-]+)(?:-(.*))?\*\*$#', $line, $m) ) {
    // only for **header - subheader** $m[2] is set.
    if ( isset($m[2]) ) {
      return array(TYPE_HEADER, array(trim($m[1]), trim($m[2])));
    }
    else {
      return array(TYPE_KEY, array($m[1]));
    }
  }
  // this matches _____________ and means "new item"
  else if ( preg_match('#^_+$#', $line, $m) ) {
    return array(TYPE_DELIMETER, array());
  }
  // any other non-empty line is a single value
  else if ( preg_match('#\S#', $line) ) {
    // you might want to filter the 1),2),3) part out here
    // could also be two diffrent token types
    return array(TYPE_VALUE, array($line));
  }
  else {
    // skip empty lines, would be nicer with tail-recursion...
    return getstruct($fp);
  }
}

prints

<?xml version="1.0"?>
<FOODS>
  <FOOD type="TYPE A">
    <PRODUCT>1) Mi Pueblito Queso Fresco Authentic Mexican Style Fresh Cheese;</PRODUCT>
    <PRODUCT>2) La Fe String Cheese</PRODUCT>
    <CODE>Sell by date going back to February 1, 2009</CODE>
    <MANUFACTURER>Quesos Mi Pueblito, LLC, Passaic, NJ.</MANUFACTURER>
    <VOLUME OF UNITS>11,000 boxes</VOLUME OF UNITS>
    <DISTRIBUTION>NJ, NY, DE, MD, CT, VA</DISTRIBUTION>
  </FOOD>
  <FOOD type="TYPE A">
    <PRODUCT>1) Peanut Brittle No Sugar Added;</PRODUCT>
    <PRODUCT>2) Peanut Brittle Small Grind;</PRODUCT>
    <PRODUCT>3) Homestyle Peanut Brittle Nuggets/Coconut Oil Coating</PRODUCT>
    <CODE>1) Lots 7109 - 8350 inclusive;</CODE>
    <CODE>2) Lots 8198 - 8330 inclusive;</CODE>
    <CODE>3) Lots 7075 - 9012 inclusive;</CODE>
    <CODE>4) Lots 7100 - 8057 inclusive;</CODE>
    <CODE>5) Lots 7152 - 8364 inclusive</CODE>
    <MANUFACTURER>Star Kay White, Inc., Congers, NY.</MANUFACTURER>
    <VOLUME OF UNITS>5,749 units</VOLUME OF UNITS>
    <DISTRIBUTION>NY, NJ, MA, PA, OH, FL, TX, UT, CA, IA, NV, MO and IN</DISTRIBUTION>
  </FOOD>
  <FOOD type="TYPE B">
    <PRODUCT>Cool River Bebidas Naturales - West Indian Cherry Fruit Acerola 16% Juice;</PRODUCT>
    <CODE>990-10/2 10/5</CODE>
    <MANUFACTURER>San Mar Manufacturing Corp., Catano, PR.</MANUFACTURER>
    <VOLUME OF UNITS>384</VOLUME OF UNITS>
    <DISTRIBUTION>PR</DISTRIBUTION>
  </FOOD>
</FOODS>

Unfortunately the status of the php module for ANTLR currently is "Runtime is in alpha status." but it might be worth a try anyway...

VolkerK
This is a great example - I can take it from here. Thanks a lot for your help!
Yaaqov
BTW, did I miss something? Where did you reference ANTLR in your code?
Yaaqov
oh no no, I didn't use ANTLR for the example. I just urge you to take a look at the project even though php is not a viable target platform yet.
VolkerK
A: 

Instead of Regex or PHP use the XSLT 2.0 unparsed-text() function to read the file (see http://www.biglist.com/lists/xsl-list/archives/200508/msg00085.html)

Andreas
A: 

Another Hint for an XSLT 1.0 Solution is here: http://bytes.com/topic/net/answers/808619-read-plain-file-xslt-1-0-a

Andreas
I'll have to take a look at this for future projects - thanks.
Yaaqov