ansaurus

Question

regex for parsing xml string with multiple text block

Answer 1

+5 A:

Please don't use regexes to parse XML. That's what XML parsers are for.

PHP has many built right in. Try the DOM or SimpleXML on for size. Given your requirement of picking up text nodes between two sibling tags, you might also consider working with XMLReader, it may well be easier for you to work with for this specific task.

Charles 2010-07-26 23:38:08

Parsing HTML, XHTML, or XML with regular expressions is just asking for trouble. In fact, there's a (somewhat) famous post here on Stack Overflow on this subject, but I don't know where it is right now.

Thomas Owens 2010-07-26 23:39:52

With SimpleXML you can't manipulate text block... the DOM is fine, i use it for other part of my app but i need this regex for some specifics reasons. it's a regex question not about xml treatment and parsing :) i just want to have the expression.

rubijn 2010-07-26 23:44:02

Note that the user explicitly says that (s)he doesn't want to use the DOM, perhaps it failed, or the XML is not real XML, or (s)he wants to experiment with regexes?

Abel 2010-07-26 23:45:36

@Abel, that phrase was added by the asker after I created my answer... and doesn't change the answer at all. Parsing XML with regular expressions [may invoke the wrath of Zalgo himself](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454).

Charles 2010-07-27 00:06:47

I didn't mean it changes your answer and I wasn't aware he added it later. Yes, it's evil, I know. It reminds me somehow on earlier famous XML frameworks that were solely based on regex-parsing (one such in Perl iirc). See also his own remark under your answer: there's apparently a pressing reason to risk Zalgo's wrath :)

Abel 2010-07-27 07:24:10

Funny that discussion is about the good and evil but not about the basic answer = a regex. Next time i put my question out of the context, without any trace of xml and mask my real goal... With this method i could maybe have answer not doctrine...

rubijn 2010-07-27 08:44:52

@rubijn, your regex could easily have created XML that was not well formed. For example, it could accidentally remove critical whitespace between or inside of attributes. This would mean that your regex could effectively *destroy* the XML. This is why you should use an XML parser instead.

Charles 2010-07-27 16:45:57

Answer 2

A:

Use splitting to chunk this down:

<?php

$str = <<<EOT
<myxml>
<node>21</node> som text with <entite>some</entite> other <b>nodes</b>
<node>22</node> some text
</myxml>
EOT;

$res = array_slice( preg_split( "~(?=<node(?:[^>]|\".*?\"|'.*?')*>|</myxml>)~", $str ), 1, -1 );
print_r( $res );

Breakdown of the expression:

(?=           # match before
  <node       # "<node"
  (?:         # match and don't capture this group
    [^>]        # match non ">"
    |           # OR
    \".*?\"     # match '"' and anything (don't be greedy) until the next '"'
    |           # OR
    '.*?'       # match "'" and anything (don't be greedy) until the next "'"
  )*          # ... as often as you like
  >           # ">"
  |           # OR
  </myxml>    # "</myxml>"
)             #

You can throw out the ([^>]|\".*?\"|'.*?')* part if you are sure that <node> never has any attributes.

_{Mandatory disclaimer: Please don't do this. Parsing XML with regexp is a really bad idea!}

Borgar 2010-07-27 12:31:46

thanks a lot for the answer and the explanation of the regex !

rubijn 2010-07-28 09:46:51

ansaurus

tags:

views:

answers:

regex for parsing xml string with multiple text block

related questions