ansaurus

Question

Answer 1

+1 A:

Never use regex on parsing HTML. In php, use DOM.

Here's a more simple one: http://simplehtmldom.sourceforge.net/

Ruel 2010-10-10 16:00:35

Thanks for your response Ruel.The problem with using DOM is that the html isn't complete. It's just partial html that makes up a page. I guess I could get around it though by surrounding all in `<html></html>` tags.

Toggo 2010-10-10 16:13:37

Answer 2

+1 A:

If it was coherent, you could use a simplistic regex. But it'll fail if your anchors have classes or anything. Also it doesn't corrently encode the title= attribute:

preg_replace('#<(a\s+href="[^"]+")>([^<>]+)</a>#ims', '<$1 title="$2">$2</a>',);

Therefore phpQuery/querypath is likely the robuster approach:

$html = phpQuery::newDocument($html);
foreach ($html->find("a") as $a) {
    if (empty($a->attr("title")) {
         $a->attr("title", $a->text());
    }
}
print $html->getDocument();

mario 2010-10-10 16:04:55

Thanks for your response mario. I was hoping not to have to install any additional libraries, but looks like I may have to!

Toggo 2010-10-10 16:15:14

True, that's the disadvantage there. However it can deal with partial HTML files more easily. But btw, QueryPath is the smaller of the two libraries.

mario 2010-10-10 16:25:11

Answer 3

+1 A:

Use PHP's built-in DOM parsing. Much more reliable than regex. Be aware that loading HTML into the PHP DOM will normalize it.

$doc = new DOMDocument();
@$doc->loadHTML($html); //supress parsing errors with @

$links = $doc->getElementsByTagName('a');
foreach ($links as $link) {
if ($link->getAttribute('title') == '') {
        $link->setAttribute('title', $link->nodeValue);
    }
}
$html = $doc->saveHTML();

Brent Baisley 2010-10-10 16:40:01

ansaurus

tags:

views:

answers:

Parsing HTML and replacing strings

related questions