tags:

views:

83

answers:

2

A friend is writing an advertisement script that puts links around select phrases in HTML code.

Naturally if the phrase is already inside an <a> element (or another element that doesn't allow it - like if the phrase is found in the attribute of an element), he doesn't want the script to write out a link as it would break validation.

He asked me what I thought. After some bumbling around, I'm asking you all what you think.

Just to clarify, the input is a whole blog post in HTML. Example:

<p>This is a short blog post about ponies!</p>
<p>I have <a href="/ponies">written about ponies before</a>.</p>
<p><img src="/media/ponies.jpg" /></p>

For this example, say I want to replace ponies (any case) with <a href="http://www.ponies.com"&gt;ponies&lt;/a&gt; (but with the original case).

The output from above should read:

<p>This is a short blog post about <a href="http://www.ponies.com"&gt;ponies&lt;/a&gt;!&lt;/p&gt;
<p>I have <a href="/ponies">written about ponies before</a>.</p>
<p><img src="/media/ponies.jpg" /></p>

We don't need full code but good ideas/regexes are immensely welcome. He's writing this in PHP but language-neutral is fine.

+3  A: 

Im sorry but i have to say

Parsing Html The Cthulhu Way

astander
+1 for the suck-up-to-the-owner answer (which is a good answer btw)
Buggabill
Doesn't answer the question, merely points out one wrongity wrong way people sometimes approach this problem.
Adam Davis
Indeed. I'm not trying to parse the HTML so much as just check an phrase is rendered text and isn't inside invalid elements. Sure the answer may be to parse the HTML to find that out, but telling us what not to do, doesn't get us any nearer to the best solution for this problem.
Oli
+5  A: 

Use an XPath expression that finds text nodes containing the string you want, but only if they're children of acceptable elements:

//p/text()[contains(.,'ponies')]

That will give you text nodes that you know you can fiddle with directly. At this point, you can safely use a regular expression to find the keyword, but you're better off doing a direct search-and-replace instead of a pattern match.

Used against the example input provided, the only match is "This is a short blog post about ponies!". The "ponies" in the <a> element is not matched, because this looks only for direct children of <p> elements. You can refine this to make it match other elements, such as <div>s, or only specific <p> elements (such as those with specific classes).

The nice bonus about using an XPath expression like this is it will only return text nodes. Which means that "ponies" will never appear alongside any HTML elements, so you're definitely safe in using regular expressions after XPath has done its thing, without evoking Cthulhu's wrath.

XPath is a common method of dealing with XML and HTML. PHP has many XPath libraries for you to choose from. Odds are you're already using a library that works with XPath.


An alternative method is to find all text nodes in the HTML document, and filter them by what their parents are. The result is exactly the same, but depending on your requirements this way might scale better:

//text()[parent::p and contains(.,'ponies')]

This expression reads like this:

//text()                  # Find all text nodes in the document
    [parent::p            # whose parent is a "p" element
    and                   # and
    contains(.,'ponies')] # contains the string "ponies"
Welbog
+1 Happy birthweek!
alex