tags:

views:

61

answers:

3

For example, we have this xml:

<x>
    <y>some text</y>
    <y>[ID] hello</y>
    <y>world [/ID]</y>
    <y>some text</y>
    <y>some text</y>
</x>

and we need to remove words "[ID]", "[/ID]" and text between them (which we don't know, when parsing), of course without damage xml formatting.

The only solution i can think is that:

  1. Find in xml the text by using regex, for example: "/\[ID\].*?\[\/ID\]/". In our case, result will be "[ID]hello</y><y>world[/ID]"

  2. In result from prev step we need to find text without xml-tags by using this regex: "/(?<=^|>)[^><]+?(?=<|$)/", and delete this text. The result will be "</y><y>"

  3. Made changes in original xml by doing smth like this:

    str_replace($step1string,$step2string,$xml);

is this correct way to do this? I just think that this "str_replace"'s things it's not best way to edit xml, so maybe you know better solution?

+1  A: 

For your entertainment and edification, you may want to read this: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags

The "correct" solution is to use an XML library and search through the nodes to perform the operation. However, it would probably be much easier to just use a str_replace, even if there's a chance of damaging the XML formatting. You have to gauge the likelihood of receiving something like <a href="[ID]"> and the importance of defending against such cases, and weigh those factors against development time.

Joey Adams
A: 

The only other option I can think of is if you could format the xml differently.

<x>
  <y>
    <z>[ID]</z>
Orbit
unfortunately, i'm working with specified format and cant changes it format
cru3l
+1  A: 

Removing the specific string is simple:

<?php
$xml = '<x>
    <y>some text</y>
    <y>[ID] hello</y>
    <y>world [/ID]</y>
    <y>some text</y>
    <y>some text</y>
</x>';

$d = new DOMDocument();
$d->loadXML($xml);
$x = new DOMXPath($d);
foreach($x->query('//text()[(contains(.,\'[ID]\') or contains(.,\'[/ID]\'))]') as $elm){
    $elm->nodeValue = preg_replace('/\[\/?ID\]/','',$elm->nodeValue);
}
var_dump($d->saveXML());
?>

When just removing textnodes in a specific tag, one could alter te preg_replace to these 2:

 $elm->nodeValue = preg_replace('/\[ID\].*$/','',$elm->nodeValue);
 $elm->nodeValue = preg_replace('/^.*\[/ID\]/','',$elm->nodeValue);

Resulting in for your example:

<x>
<y>some text</y>
<y></y>
<y></y>
<y>some text</y>
<y>some text</y>
</x>

However, removing tags in between without damaging well formed XML is quite tricky. Before venturing into lot of DOM actions, how would you like to handle:

An [/ID] higher in the DOM-tree:

<foo>[ID] foo
    <bar> lorem [/ID] ipsum </bar>
</foo>

An [/ID] lower in the DOM-tree

<foo> foo
    <bar> lorem [ID] ipsum </bar>
    [/ID]
</foo>

And open/close spanning siblings, as per your example:

<foo> foo
    <bar> lorem [ID] ipsum </bar>
    <bar> lorem [/ID] ipsum </bar>
</foo>

And a real dealbreaker of a question: is nesting possible, is that nesting well formed, and what should it do?

<foo> foo
    <bar> lo  [ID] rem [ID] ipsum </bar>
    <bar> lorem [/ID] ipsum </bar>
    [/ID]
</foo>

Without further knowledge how these case should be handled there is no real answer.


Edit, well futher information was given, the actual, fail-safe solution (i.e.: parse XML, don't use regexes) seems kind of long, but will work in 99.99% of cases (personal typos and brainfarts excluded of course :) ):

<?php
$xml = '<x>
    <y>some text</y>
    <y>
      <a> something </a>
      well [ID] hello
      <a> and then some</a>
    </y>
    <y>some text</y>
    <x>
      world
      <a> also </a>
        foobar [/ID] something
      <a> these nodes </a>
    </x>
    <y>some text</y>
    <y>some text</y>
</x>';
echo $xml;
$d = new DOMDocument();
$d->loadXML($xml);
$x = new DOMXPath($d);
foreach($x->query('//text()[contains(.,\'[ID]\')]') as $elm){
        //if this node also contains [/ID], replace and be done:
        if(($startpos = strpos($elm->nodeValue,'[ID]'))!==false && $endpos = strpos($elm->nodeValue,'[/ID]',$startpos)){
                $elm->replaceData($startpos, $endpos-$startpos + 5,'');
                var_dump($d->saveXML($elm));
                continue;
        }
        //delete all siblings of this textnode not being text and having [/ID]
        while($elm->nextSibling){
                if(!($elm->nextSibling instanceof DOMTEXT) || ($pos =strpos($elm->nodeValue,'[/ID]'))===false){
                        $elm->parentNode->removeChild($elm->nextSibling);
                } else {
                        //id found in same element, replace and go to next [ID]
                        $elm->parentNode->appendChild(new DOMTExt(substr($elm->nextSibling->nodeValue,$pos+5)));
                        $elm->parentNode->removeChild($elm->nextSibling);
                        continue 2;
                }
        }
        //siblings of textnode deleted, string truncated to before [ID], now let's delete intermediate nodes
        while($sibling = $elm->parentNode->nextSibling){ // in case of example: other <y> elements:
                //loop though childnodes and search a textnode with [/ID]
                while($child = $sibling->firstChild){
                        //delete if not a textnode
                        if(!($child instanceof DOMText)){
                                $sibling->removeChild($child);
                                continue;
                        }
                        //we have text, check for [/ID]
                        if(($pos = strpos($child->nodeValue,'[/ID]'))!==false){
                                //add remaining text in textnode:
                                $elm->appendData(substr($child->nodeValue,$pos+5));
                                //remove current textnode with match:
                                $sibling->removeChild($child);
                                //sanity check: [ID] was in <y>, is [/ID]?
                                if($sibling->tagName!= $elm->parentNode->tagname){
                                        trigger_error('[/ID] found in other tag then [/ID]: '.$sibling->tagName.'<>'.$elm->parentNode->tagName, E_USER_NOTICE);
                                }
                                //add remaining childs of sibling to parent of [ID]:
                                while($sibling->firstChild){
                                        $elm->parentNode->appendChild($sibling->firstChild);
                                }
                                //delete the sibling that was found to hold [/ID]
                                $sibling->parentNode->removeChild($sibling);
                                //done: end both whiles
                                break 2;
                        }
                        //textnode, but no [/ID], so remove:
                        $sibling->removeChild($child);
                }
                //no child, no text, so no [/ID], remove:
                $elm->parentNode->parentNode->removeChild($sibling);
        }
}
var_dump($d->saveXML());
?>
Wrikken
thanks for example, but xml can be like this too: `<x><y>some text</y><y>[ID]hello</y><y>world</y><y>[/ID]some text</y></x>`. The "[ID]" and "[/ID]" it just start and end "tags", and between them can be many `<y></y>` xml-tags without "[ID]" or "[/ID]" in node values. But we need to find and delete all this text. Does your script handle this?and about format. tag in which we are looking the text is always at same level in DOM, so don't need to worry about that.
cru3l
"the text is always at same level in DOM" => I assume from you further phrasing and example they're at the same level, but possibly in siblings, never higher or lower, and nesting never occurs?
Wrikken
i mean that `[ID]` and `[\ID]` always situated in `"<y></y>"` tags. And this tags are always on same level. I wanted to say, that nodes we looking for, not always adjacent. For example, xml can be this: `"<x><y>[ID]Lorem ipsum dolor sit amet,</y><y>onsectetur adipiscing elit.</y><y>Aenean placerat porttitor tristique[/ID]</y></x>"` , and we need to delete all text between ID's (in all "y"-tags, in that case). Can your script handle this? thanks!
cru3l
Edited the answet with the specific solution.
Wrikken
awesome! thanks a lot!
cru3l
example in your post gives this output: http://goo.gl/MQT3Script don't correct parse starting [ID] tag. Any suggestions?
cru3l
It works here: http://pastebin.com/qzymR5uq What's you PHP version?
Wrikken
PHP version is 5.3.1.
cru3l
Wrikken
anyway, i checked output in your pastebin.com link, and.. it worked not as i expected, sorry :( your input $xml is not valid for my program. "[ID]" and "[/ID]" can be situated ONLY (!) in "<y>"-tags. And while parsing, script must change only "<y>"-tags, and should not change "<a>","<x>" and other's tags. This example of correct parsing i want: http://goo.gl/jvDTMain rule: only "<y>"-tags can be change (should not parsing even descendant of "<y>"-tags). This is what i needed from beginning, and hopefully that this problem can be solved.thanks a lot for you time!
cru3l
*sigh* that's why I asked you first to more define more detailed how nodes should change. I've taken a lot of time to provide you with a decent example including description of what is does, I'm sure you're able to alter the code (for instance: only delete tags when tagName=='y', or textnodes on a certain level) to you liking. With keeping altering the description of actual needed functionality, I'm not going to do the work for you for free, sorry. I hope the example will get you to your goal.
Wrikken
thanks anyway. your example is very useful!
cru3l