ansaurus

Question

Complex edit xml file

Answer 1

+1 A:

For your entertainment and edification, you may want to read this: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags

The "correct" solution is to use an XML library and search through the nodes to perform the operation. However, it would probably be much easier to just use a str_replace, even if there's a chance of damaging the XML formatting. You have to gauge the likelihood of receiving something like <a href="[ID]"> and the importance of defending against such cases, and weigh those factors against development time.

Joey Adams 2010-06-18 23:26:33

Answer 2

A:

The only other option I can think of is if you could format the xml differently.

<x>
  <y>
    <z>[ID]</z>

Orbit 2010-06-18 23:27:39

unfortunately, i'm working with specified format and cant changes it format

cru3l 2010-06-18 23:30:32

Answer 3

+1 A:

Removing the specific string is simple:

<?php
$xml = '<x>
    <y>some text</y>
    <y>[ID] hello</y>
    <y>world [/ID]</y>
    <y>some text</y>
    <y>some text</y>
</x>';

$d = new DOMDocument();
$d->loadXML($xml);
$x = new DOMXPath($d);
foreach($x->query('//text()[(contains(.,\'[ID]\') or contains(.,\'[/ID]\'))]') as $elm){
    $elm->nodeValue = preg_replace('/\[\/?ID\]/','',$elm->nodeValue);
}
var_dump($d->saveXML());
?>

When just removing textnodes in a specific tag, one could alter te preg_replace to these 2:

 $elm->nodeValue = preg_replace('/\[ID\].*$/','',$elm->nodeValue);
 $elm->nodeValue = preg_replace('/^.*\[/ID\]/','',$elm->nodeValue);

Resulting in for your example:

<x>
<y>some text</y>
<y></y>
<y></y>
<y>some text</y>
<y>some text</y>
</x>

However, removing tags in between without damaging well formed XML is quite tricky. Before venturing into lot of DOM actions, how would you like to handle:

An [/ID] higher in the DOM-tree:

<foo>[ID] foo
    <bar> lorem [/ID] ipsum </bar>
</foo>

An [/ID] lower in the DOM-tree

<foo> foo
    <bar> lorem [ID] ipsum </bar>
    [/ID]
</foo>

And open/close spanning siblings, as per your example:

<foo> foo
    <bar> lorem [ID] ipsum </bar>
    <bar> lorem [/ID] ipsum </bar>
</foo>

And a real dealbreaker of a question: is nesting possible, is that nesting well formed, and what should it do?

<foo> foo
    <bar> lo  [ID] rem [ID] ipsum </bar>
    <bar> lorem [/ID] ipsum </bar>
    [/ID]
</foo>

Without further knowledge how these case should be handled there is no real answer.

Edit, well futher information was given, the actual, fail-safe solution (i.e.: parse XML, don't use regexes) seems kind of long, but will work in 99.99% of cases (personal typos and brainfarts excluded of course :) ):

<?php
$xml = '<x>
    <y>some text</y>
    <y>
      <a> something </a>
      well [ID] hello
      <a> and then some</a>
    </y>
    <y>some text</y>
    <x>
      world
      <a> also </a>
        foobar [/ID] something
      <a> these nodes </a>
    </x>
    <y>some text</y>
    <y>some text</y>
</x>';
echo $xml;
$d = new DOMDocument();
$d->loadXML($xml);
$x = new DOMXPath($d);
foreach($x->query('//text()[contains(.,\'[ID]\')]') as $elm){
        //if this node also contains [/ID], replace and be done:
        if(($startpos = strpos($elm->nodeValue,'[ID]'))!==false && $endpos = strpos($elm->nodeValue,'[/ID]',$startpos)){
                $elm->replaceData($startpos, $endpos-$startpos + 5,'');
                var_dump($d->saveXML($elm));
                continue;
        }
        //delete all siblings of this textnode not being text and having [/ID]
        while($elm->nextSibling){
                if(!($elm->nextSibling instanceof DOMTEXT) || ($pos =strpos($elm->nodeValue,'[/ID]'))===false){
                        $elm->parentNode->removeChild($elm->nextSibling);
                } else {
                        //id found in same element, replace and go to next [ID]
                        $elm->parentNode->appendChild(new DOMTExt(substr($elm->nextSibling->nodeValue,$pos+5)));
                        $elm->parentNode->removeChild($elm->nextSibling);
                        continue 2;
                }
        }
        //siblings of textnode deleted, string truncated to before [ID], now let's delete intermediate nodes
        while($sibling = $elm->parentNode->nextSibling){ // in case of example: other <y> elements:
                //loop though childnodes and search a textnode with [/ID]
                while($child = $sibling->firstChild){
                        //delete if not a textnode
                        if(!($child instanceof DOMText)){
                                $sibling->removeChild($child);
                                continue;
                        }
                        //we have text, check for [/ID]
                        if(($pos = strpos($child->nodeValue,'[/ID]'))!==false){
                                //add remaining text in textnode:
                                $elm->appendData(substr($child->nodeValue,$pos+5));
                                //remove current textnode with match:
                                $sibling->removeChild($child);
                                //sanity check: [ID] was in <y>, is [/ID]?
                                if($sibling->tagName!= $elm->parentNode->tagname){
                                        trigger_error('[/ID] found in other tag then [/ID]: '.$sibling->tagName.'<>'.$elm->parentNode->tagName, E_USER_NOTICE);
                                }
                                //add remaining childs of sibling to parent of [ID]:
                                while($sibling->firstChild){
                                        $elm->parentNode->appendChild($sibling->firstChild);
                                }
                                //delete the sibling that was found to hold [/ID]
                                $sibling->parentNode->removeChild($sibling);
                                //done: end both whiles
                                break 2;
                        }
                        //textnode, but no [/ID], so remove:
                        $sibling->removeChild($child);
                }
                //no child, no text, so no [/ID], remove:
                $elm->parentNode->parentNode->removeChild($sibling);
        }
}
var_dump($d->saveXML());
?>

Wrikken 2010-06-19 00:02:18

thanks for example, but xml can be like this too: `<x><y>some text</y><y>[ID]hello</y><y>world</y><y>[/ID]some text</y></x>`. The "[ID]" and "[/ID]" it just start and end "tags", and between them can be many `<y></y>` xml-tags without "[ID]" or "[/ID]" in node values. But we need to find and delete all this text. Does your script handle this?and about format. tag in which we are looking the text is always at same level in DOM, so don't need to worry about that.

cru3l 2010-06-19 00:21:49

"the text is always at same level in DOM" => I assume from you further phrasing and example they're at the same level, but possibly in siblings, never higher or lower, and nesting never occurs?

Wrikken 2010-06-19 00:44:06

i mean that `[ID]` and `[\ID]` always situated in `"<y></y>"` tags. And this tags are always on same level. I wanted to say, that nodes we looking for, not always adjacent. For example, xml can be this: `"<x><y>[ID]Lorem ipsum dolor sit amet,</y><y>onsectetur adipiscing elit.</y><y>Aenean placerat porttitor tristique[/ID]</y></x>"` , and we need to delete all text between ID's (in all "y"-tags, in that case). Can your script handle this? thanks!

cru3l 2010-06-19 01:14:03

Edited the answet with the specific solution.

Wrikken 2010-06-19 02:05:42

awesome! thanks a lot!

cru3l 2010-06-19 04:21:39

example in your post gives this output: http://goo.gl/MQT3Script don't correct parse starting [ID] tag. Any suggestions?

cru3l 2010-06-19 12:57:06

It works here: http://pastebin.com/qzymR5uq What's you PHP version?

Wrikken 2010-06-19 13:02:45

PHP version is 5.3.1.

cru3l 2010-06-19 16:03:29

Wrikken 2010-06-19 16:26:44

anyway, i checked output in your pastebin.com link, and.. it worked not as i expected, sorry :( your input $xml is not valid for my program. "[ID]" and "[/ID]" can be situated ONLY (!) in "<y>"-tags. And while parsing, script must change only "<y>"-tags, and should not change "<a>","<x>" and other's tags. This example of correct parsing i want: http://goo.gl/jvDTMain rule: only "<y>"-tags can be change (should not parsing even descendant of "<y>"-tags). This is what i needed from beginning, and hopefully that this problem can be solved.thanks a lot for you time!

cru3l 2010-06-19 18:12:47

*sigh* that's why I asked you first to more define more detailed how nodes should change. I've taken a lot of time to provide you with a decent example including description of what is does, I'm sure you're able to alter the code (for instance: only delete tags when tagName=='y', or textnodes on a certain level) to you liking. With keeping altering the description of actual needed functionality, I'm not going to do the work for you for free, sorry. I hope the example will get you to your goal.

Wrikken 2010-06-19 20:52:39

thanks anyway. your example is very useful!

cru3l 2010-06-19 21:15:24

ansaurus

tags:

views:

answers:

Complex edit xml file

related questions