ansaurus

Question

How best use Regular Expressions to convert Heirarchical Text File into XML?

Answer 1

+1 A:

See: http://www.tuxradar.com/practicalphp/21/5/6

This tells you how to parse a text file into tokens using PHP. Once parsed you can place it into anything you want.

You need to search for specific tokens in the file based on your criteria:

for example: PRODUCT

This gives you the XML Tag

Then 1) can have special meaning

1) Peanut Brittle...

This tells you what to put in the XML tag.

I do not know if this is the most efficient way to accomplish your task but it is the way a compiler would parse a file and has the potential to make very accurate.

Todd Moses 2010-02-18 19:27:11

Thanks, but this is for the actual mechanics of "parsing" - the ciritical peice missing here is how to parse a text file (or string) with different levels of heirarchy and save it into a xml file.

Yaaqov 2010-02-18 19:44:33

You parse it then you get every piece. You then assign levels of hierarchy base upon specifics in the token. This is how a compiler works. It looks like you could use the numbers 1) ... to assign levels for the XML.

Todd Moses 2010-02-18 19:48:20

Can you modify your answer to show how to send different levels of heirarchy into an xml output?

Yaaqov 2010-02-18 19:51:23

My point was that this process of "load line, **multiple** nested conditions, separate function call to add a node to xml file, loop" is feasible, but is this the **best** or most *efficient* way to do this, vs. running a handful of regex expression to search and replace, and at least get a rough cut of an xml string finished?

Yaaqov 2010-02-18 20:12:27

Thanks Todd for answering. I still am looking for a more "minimalistic" solution based on Regex (see the changed title), but I appreciate your time.

Yaaqov 2010-02-18 22:02:14

Answer 2

+2 A:

An example you can use as a starting point. At least I hope it gives you an idea...

<?php
define('TYPE_HEADER', 1);
define('TYPE_KEY', 2);
define('TYPE_DELIMETER', 3);
define('TYPE_VALUE', 4);

$datafile = 'data.txt';
$fp = fopen($datafile, 'rb') or die('!fopen');

// stores (the first) {header} in 'name' and the root simplexmlelement in 'element'
$container = array('name'=>null, 'element'=>null);
// stores the name for each item element, the value for the type attribute for subsequent item elements and the simplexmlelement of the current item element
$item = array('name'=>null, 'type'=>null, 'current_element'=>null);
// the last **key** encountered, used to create new child elements in the current item element when a value is encountered
$key = null;

while ( false!==($t=getstruct($fp)) ) {
  switch( $t[0] ) {
    case TYPE_HEADER:
      if ( is_null($container['element']) ) {
        // this is the first time we hit **header - subheader**
        $container['name'] = $t[1][0];
        // ugly hack, < . name . />
        $container['element'] = new SimpleXMLElement('<'.$container['name'].'/>');
        // each subsequent new item gets the new subheader as type attribute
        $item['type'] = $t[1][1];
        // dummy implementation: "deducting" the item names from header/container[name]
        $item['name'] = substr($t[1][0], 0, -1);
      }
      else {
        // hitting **header - subheader** the (second, third, nth) time 
        /*
        header must be the same as the first time (stored in container['name']).
        Otherwise you need another container element since 
        xml documents can only have one root element
        */
        if ( $container['name'] !== $t[1][0] ) {
          echo $container['name'], "!==",  $t[1][0], "\n";
          die('format error');
        }
        else {
          // subheader may have changed, store it for future item elements
          $item['type'] = $t[1][1];
        }
      }
      break;
    case TYPE_DELIMETER:
      assert( !is_null($container['element']) );
      assert( !is_null($item['name']) );
      assert( !is_null($item['type']) );
      /* that's maybe not a wise choice.
      You might want to check the complete item before appending it to the document.
      But the example is a hack anyway ...so create a new item element and append it to the container right away
      */
      $item['current_element'] = $container['element']->addChild($item['name']);
      // set the type-attribute according to the last **header - subheader** encountered
      $item['current_element']['type'] = $item['type'];
      break;
    case TYPE_KEY:
      $key = $t[1][0];
      break;
    case TYPE_VALUE:
      assert( !is_null($item['current_element']) );
      assert( !is_null($key) );
      // this is a value belonging to the "last" key encountered
      // create a new "key" element with the value as content
      // and addit to the current item element
      $tmp = $item['current_element']->addChild($key, $t[1][0]);
      break;
    default:
      die('unknown token');
  }
}

if ( !is_null($container['element']) ) {
  $doc = dom_import_simplexml($container['element']);
  $doc = $doc->ownerDocument;
  $doc->formatOutput = true;
  echo $doc->saveXML();
}
die;


/*
Take a look at gettoken() at http://www.tuxradar.com/practicalphp/21/5/6
It breaks the stream into much simpler pieces.
In the next step the parser would "combine" or structure the simple tokens into more complex things.
This function does both....
@return array(id, array(parameter)
*/
function getstruct($fp) {
  if ( feof($fp) ) {
    return false;
  }
  // shortcut: all we care about "happens" on one line
  // so let php read one line in a single step and then do the pattern matching
  $line = trim(fgets($fp));

  // this matches **key** and **header - subheader**
  if ( preg_match('#^\*\*([^-]+)(?:-(.*))?\*\*$#', $line, $m) ) {
    // only for **header - subheader** $m[2] is set.
    if ( isset($m[2]) ) {
      return array(TYPE_HEADER, array(trim($m[1]), trim($m[2])));
    }
    else {
      return array(TYPE_KEY, array($m[1]));
    }
  }
  // this matches _____________ and means "new item"
  else if ( preg_match('#^_+$#', $line, $m) ) {
    return array(TYPE_DELIMETER, array());
  }
  // any other non-empty line is a single value
  else if ( preg_match('#\S#', $line) ) {
    // you might want to filter the 1),2),3) part out here
    // could also be two diffrent token types
    return array(TYPE_VALUE, array($line));
  }
  else {
    // skip empty lines, would be nicer with tail-recursion...
    return getstruct($fp);
  }
}

prints

<?xml version="1.0"?>
<FOODS>
  <FOOD type="TYPE A">
    <PRODUCT>1) Mi Pueblito Queso Fresco Authentic Mexican Style Fresh Cheese;</PRODUCT>
    <PRODUCT>2) La Fe String Cheese</PRODUCT>
    <CODE>Sell by date going back to February 1, 2009</CODE>
    <MANUFACTURER>Quesos Mi Pueblito, LLC, Passaic, NJ.</MANUFACTURER>
    <VOLUME OF UNITS>11,000 boxes</VOLUME OF UNITS>
    <DISTRIBUTION>NJ, NY, DE, MD, CT, VA</DISTRIBUTION>
  </FOOD>
  <FOOD type="TYPE A">
    <PRODUCT>1) Peanut Brittle No Sugar Added;</PRODUCT>
    <PRODUCT>2) Peanut Brittle Small Grind;</PRODUCT>
    <PRODUCT>3) Homestyle Peanut Brittle Nuggets/Coconut Oil Coating</PRODUCT>
    <CODE>1) Lots 7109 - 8350 inclusive;</CODE>
    <CODE>2) Lots 8198 - 8330 inclusive;</CODE>
    <CODE>3) Lots 7075 - 9012 inclusive;</CODE>
    <CODE>4) Lots 7100 - 8057 inclusive;</CODE>
    <CODE>5) Lots 7152 - 8364 inclusive</CODE>
    <MANUFACTURER>Star Kay White, Inc., Congers, NY.</MANUFACTURER>
    <VOLUME OF UNITS>5,749 units</VOLUME OF UNITS>
    <DISTRIBUTION>NY, NJ, MA, PA, OH, FL, TX, UT, CA, IA, NV, MO and IN</DISTRIBUTION>
  </FOOD>
  <FOOD type="TYPE B">
    <PRODUCT>Cool River Bebidas Naturales - West Indian Cherry Fruit Acerola 16% Juice;</PRODUCT>
    <CODE>990-10/2 10/5</CODE>
    <MANUFACTURER>San Mar Manufacturing Corp., Catano, PR.</MANUFACTURER>
    <VOLUME OF UNITS>384</VOLUME OF UNITS>
    <DISTRIBUTION>PR</DISTRIBUTION>
  </FOOD>
</FOODS>

Unfortunately the status of the php module for ANTLR currently is "Runtime is in alpha status." but it might be worth a try anyway...

VolkerK 2010-02-18 22:18:54

This is a great example - I can take it from here. Thanks a lot for your help!

Yaaqov 2010-02-18 22:34:48

BTW, did I miss something? Where did you reference ANTLR in your code?

Yaaqov 2010-02-18 22:42:27

oh no no, I didn't use ANTLR for the example. I just urge you to take a look at the project even though php is not a viable target platform yet.

VolkerK 2010-02-18 22:59:25

Answer 3

A:

Instead of Regex or PHP use the XSLT 2.0 unparsed-text() function to read the file (see http://www.biglist.com/lists/xsl-list/archives/200508/msg00085.html)

Andreas 2010-02-19 08:37:21

Answer 4

A:

Another Hint for an XSLT 1.0 Solution is here: http://bytes.com/topic/net/answers/808619-read-plain-file-xslt-1-0-a

Andreas 2010-02-19 08:42:51

I'll have to take a look at this for future projects - thanks.

Yaaqov 2010-02-19 13:13:48

ansaurus

tags:

views:

answers:

How best use Regular Expressions to convert Heirarchical Text File into XML?

related questions