tags:

views:

180

answers:

3

I've split a large body of XHTML into individual array elements, and I now need to iterate through them and split it at regular intervals. That's not a problem, but I want to ensure I don't split it in the middle of an XHTML tag. So the array looks like:

[41] => <p>
[42] => materials
[43] => and
[44] => dosage
[45] => forms:</p>
[46] => <ul>
[47] => <li>
[48] => Drug
[49] => substance:
[50] => small
[51] => and
[52] => biomolecule</li>
[53] => <li>
[54] => Excipients</li>
[55] => <li>
[56] => Solid
[57] => oral
[58] => dosages</li>

So if I wanted to split the array at key point 50, I would be splitting an unordered list in 2 which is no good.

I'd like to iterate through and find all start and end points for tags, bearing in mind that an unordered list could be nested with several others.

Here's what I've got so far (granted it's a little messy)

// Find all xhtml tags
$pattern_to_find_opening_tag = "?????";
$pattern_to_find_closing_tag = "?????";

$tags = array(); $i=0;
foreach ($words as $key => $word)
{
  // If we find an opening tag, add it to the array
  if ( preg_match($pattern_to_find_opening_tag,$word,$matches) )
  {
    // The opened and closed keys represent the tags position in the words array
    $tags[$i]['tag'] = $matches[0];
    $tags[$i]['opened'] = $key;
    $tags[$i]['closed'] = false;
    $i++;
  }
  // If we find a closing tag, find it's opening position
  elseif ( preg_match($pattern_to_find_closing_tag,$word,$matches) )
  {
    // Start from the top
    $top_down_tags = array_reverse($tags);
    foreach ($top_down_tags as $tag_key => $tag)
    {
      // Find the next opened tag with no closing point
      if ($tag['tag'] == $matches[0] && !$tag['closed']) $tags[$tag_key]['closed'] = $key;
    }
  }
}

The chances are, I'm way off the mark with this being fairly unaccustomed to regex, so I'd appreciate any help whatsoever! Thanks guys & gals.

+1  A: 

You might want to look into Stream Wrappers or SimpleXML to process the HTML. Also, it would be helpful to know a little bit more about what you are trying to achieve. Why do you want to split the XHTML? To me this sounds like you are using an approach not really fitting the usecase.

Edit After reading your comments, I don't think this is something you should try to solve on the markup level. It's all about presentation. Check these articles about multi column layouts at quirksmode, alistapart and cvwdesign.

Gordon
It's a function to split a string of user generated XHTML into chunks - the reason I'm looking for opening and closing tags is because I don't want to split the chunks in the middle of an open XHTML tag because it would obviously break down.The $words array is generated using$words = preg_split('/\s/', $stringfromdb, -1, PREG_SPLIT_NO_EMPTY);
Wil
But why do you want to split the XHTML into chunks at all?
Gordon
To break it into columns basically. The function will find convenient points in the XHTML to close a previous column div and open a new one. The column lengths are determined either by percentage length or specific word count.
Wil
A: 

Exploding the xHTML into a large array like this means you've made life harder for yourself because your hierarchy has disappeared. I think you should re-think this approach.

Maybe use a regex first to extract whole tags into an array for splitting?

Update

An example capturing pattern you could iterate over a document to work outside in (possibly).

<([^<>]+)>.*?</\1>

See the examples from the manual about escaping this pattern correctly. More information on capturing HTML tags with regex's.

Greg K
That would work because I could happily just keep adding the resulting tags and tag contents into chunks until they were full.Perhaps then I just need an expression to find XHTML tags and return a) the tag and b) the contents, then I can go from there. Any advice for that?
Wil
A: 

Since an XHTML document has a root node, splitting anywhere inside will at least split that root node.

If your input consists of individual XHTML nodes without a root node, a regular expression is still the wrong way to achieve what you want to do, because XHTML is not a regular language.

The proper tool is an XHTML or XML parser. If you don't find one that doesn't assume that the whole document is in one root node, you can write one yourself---that's not too hard, since XML is designed to be easily parsable.

Svante
It isn't an entire document, simply a body of formatted user generated content. There will only really be p, ul, blockquote etc. type tags in there.
Wil