views:

48

answers:

1

I would like to load a HTML document and modify it's text in PHP. For example, if I have a document like this:

<html>
<head><title>Test - Example.com</title></head>
<body>
<p><a href="http://www.example.com"&gt;Link number 1: Example.com</a></p>
<p>Link number 2: Example.com - some random text</p>
</body>
</html>

I would like to add an active link () to the second paragraph. But I don't want to touch other places where the Example.com string occurs, like the first paragraph or the title of the document. So I cannot use regular expressions for this, as I need to take into account the structure of the document. Any ideas as to how to tackle this problem? Also the HTML documents I will be receiving might be live webpages, so they might contain errors, JavaScript code, etc.

+1  A: 

The "proper" way to do it would be via PHP's DOM object, which can import HTML, after which you can use XPath to dig down to the exact link you want. Of course, DOM is highly picky about invalid markup and can barf on quite simple errors that browsers handle nicely. You may have to massage the input to fix up the worst of the errors before you can round-trip the content through DOM.

The worst stop-dead-in-DOM's-tracks error I've found is having multiple html and/or body blocks (e.g. a stupid server inserting a self-contained <html> block before the actual page contents).

Marc B