views:

263

answers:

2

I am trying to use MediaWiki's API to get articles in XML format and include them on my page. I created a simple code which basically gets the XML representation of an article using ?action=parse&page=Page_Name&format=xml requests. The code is following:

if($_GET["page"]=='') die("Page not specified (possibly direct call)");
$pagename = $_GET["page"];
$handle = @fopen("mediawiki/api.php?action=parse&page=".$pagename."&format=xml", "r");
if ($handle) {
        while (!feof($handle)) {
        $buffer = $buffer.fgets($handle);
        }       
    $buffer = html_entity_decode($buffer);
    /*
    echo $buffer;
    */
    $xml = simplexml_load_string($buffer);
    foreach($xml->parse->children() as $child){
        switch($child->getName()){
            case "text":
                echo $child->asXML()."<br/>";
                break;
            case "categories":
                echo "<h3>Categories this project is related to: </h3><br/>";
                foreach($child->children() as $grandChild){
                    echo $grandChild." | ";
                }
                break;
        }
    }
    fclose($handle);
}

Now the problem is that I'm getting very strange output. Any <a name="" href=""></a> becomes converted to <a name="" href=""/> which makes all following text to become a link (I guess since there is not closing tag </a>). This is observed both in Mozilla Firefox and Google Chrome. I'm suspecting $buffer = html_entity_decode($buffer); to cause this problem. Is there a parameter for html_entity_decode(); I should specify to avoid this? Is it caused by some other error or misuse of html_entity_decode(); in my code?

(To see the XML output of the Wiki's API, you can try http://en.wikipedia.org/w/api.php?action=parse&amp;page=No_Such_Page&amp;format=xml with different page parameters)

POSSIBLE SOLUTION: I didn't want to go to JSON, as Jordan suggested, so I came up with this solution. I simply moved html_entity_decode to the case "text": block. So now I have there echo html_entity_decode($child->asXML())."<br/>";. Do you think this is feasible enough?

+1  A: 

The problem isn't with html_entity_decode(). The problem is that SimpleXML is treating the contents of the <text> element as XML instead of text. By default, SimpleXML compresses empty elements (<a></a> to <a />). One way to get around this is to import the SimpleXML object into a DOM object, and use the LIBXML_NOEMPTYTAG option when saving the output. The problem with this option is that any <br /> elements will be output as <br></br>.

The simpler alternative is to use a different response format from the API. I would suggest using the json response format and use the json_decode() function to parse the response.

Jordan Ryan Moore
Thanks for your answer. I think you are right.
Azimuth
+1  A: 

That's not strange output, that's valid XML. When you have an empty tag, XML lets you use a short closing syntax that's not always valid in HTML or XHTML

<foo></foo>
<foo />

The html_entity_decode(); function converts html entities, such as

&gt; converts to
>

You'll need to post-process your xml fragment and convert it into proper HTML. The easiest way to do this is with the DomDocument API.

$foo = new DomDocument();
$foo->loadHtml('<p> Testing <a href="" /> </p>'); 
echo $foo->saveHtml();

This will take an XML fragment, and convert it into and HTML document, which includes fixing all the self closing tags. You'll still need to parse out stuff in the <body/>, but that's a lot easier than fixing all the self closing tags yourself.

Alan Storm
@Alan, please read my comment to the first answer
Azimuth