tags:

views:

57

answers:

2

I'm not very good in regex... so if somebody could help me with this one (maybe trivial)

[update] First i'm not looking for the best way of manipulating xml (SimpleXMLElement,DOM etc... is fine). I'm just looking for this regex out of the context off XML.

i have xml like that

<myxml>
<node>21</node> som text with <entite>some</entite> other <b>nodes</b>
<node>22</node> some text
</myxml>

I would like to extract <node> with all other entite and text block until next <node> result could be like :

Array {
 [0] = "<node>21</node> som text with <entite>some</entite> other <b>nodes</b>",
 [1] = "<node>22</node> some text"
}

I don't want to use DOMElement for parsing the XML, so i realy looking for regex.

thanks if you have an idea.

+5  A: 

Please don't use regexes to parse XML. That's what XML parsers are for.

PHP has many built right in. Try the DOM or SimpleXML on for size. Given your requirement of picking up text nodes between two sibling tags, you might also consider working with XMLReader, it may well be easier for you to work with for this specific task.

Charles
Parsing HTML, XHTML, or XML with regular expressions is just asking for trouble. In fact, there's a (somewhat) famous post here on Stack Overflow on this subject, but I don't know where it is right now.
Thomas Owens
With SimpleXML you can't manipulate text block... the DOM is fine, i use it for other part of my app but i need this regex for some specifics reasons. it's a regex question not about xml treatment and parsing :) i just want to have the expression.
rubijn
Note that the user explicitly says that (s)he doesn't want to use the DOM, perhaps it failed, or the XML is not real XML, or (s)he wants to experiment with regexes?
Abel
@Abel, that phrase was added by the asker after I created my answer... and doesn't change the answer at all. Parsing XML with regular expressions [may invoke the wrath of Zalgo himself](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454).
Charles
I didn't mean it changes your answer and I wasn't aware he added it later. Yes, it's evil, I know. It reminds me somehow on earlier famous XML frameworks that were solely based on regex-parsing (one such in Perl iirc). See also his own remark under your answer: there's apparently a pressing reason to risk Zalgo's wrath :)
Abel
Funny that discussion is about the good and evil but not about the basic answer = a regex. Next time i put my question out of the context, without any trace of xml and mask my real goal... With this method i could maybe have answer not doctrine...
rubijn
@rubijn, your regex could easily have created XML that was not well formed. For example, it could accidentally remove critical whitespace between or inside of attributes. This would mean that your regex could effectively *destroy* the XML. This is why you should use an XML parser instead.
Charles
A: 

Use splitting to chunk this down:

<?php

$str = <<<EOT
<myxml>
<node>21</node> som text with <entite>some</entite> other <b>nodes</b>
<node>22</node> some text
</myxml>
EOT;

$res = array_slice( preg_split( "~(?=<node(?:[^>]|\".*?\"|'.*?')*>|</myxml>)~", $str ), 1, -1 );
print_r( $res );

Breakdown of the expression:

(?=           # match before
  <node       # "<node"
  (?:         # match and don't capture this group
    [^>]        # match non ">"
    |           # OR
    \".*?\"     # match '"' and anything (don't be greedy) until the next '"'
    |           # OR
    '.*?'       # match "'" and anything (don't be greedy) until the next "'"
  )*          # ... as often as you like
  >           # ">"
  |           # OR
  </myxml>    # "</myxml>"
)             # 

You can throw out the ([^>]|\".*?\"|'.*?')* part if you are sure that <node> never has any attributes.

Mandatory disclaimer: Please don't do this. Parsing XML with regexp is a really bad idea!

Borgar
thanks a lot for the answer and the explanation of the regex !
rubijn