ansaurus

Question

Check a phrase is not in an <a> (or other) element

Answer 1

+3 A:

Im sorry but i have to say

Parsing Html The Cthulhu Way

astander 2009-11-18 17:40:51

+1 for the suck-up-to-the-owner answer (which is a good answer btw)

Buggabill 2009-11-18 17:44:56

Doesn't answer the question, merely points out one wrongity wrong way people sometimes approach this problem.

Adam Davis 2009-11-18 21:17:24

Indeed. I'm not trying to parse the HTML so much as just check an phrase is rendered text and isn't inside invalid elements. Sure the answer may be to parse the HTML to find that out, but telling us what not to do, doesn't get us any nearer to the best solution for this problem.

Oli 2009-11-18 22:08:50

Answer 2

+5 A:

Use an XPath expression that finds text nodes containing the string you want, but only if they're children of acceptable elements:

//p/text()[contains(.,'ponies')]

That will give you text nodes that you know you can fiddle with directly. At this point, you can safely use a regular expression to find the keyword, but you're better off doing a direct search-and-replace instead of a pattern match.

Used against the example input provided, the only match is "This is a short blog post about ponies!". The "ponies" in the <a> element is not matched, because this looks only for direct children of <p> elements. You can refine this to make it match other elements, such as <div>s, or only specific <p> elements (such as those with specific classes).

The nice bonus about using an XPath expression like this is it will only return text nodes. Which means that "ponies" will never appear alongside any HTML elements, so you're definitely safe in using regular expressions after XPath has done its thing, without evoking Cthulhu's wrath.

XPath is a common method of dealing with XML and HTML. PHP has many XPath libraries for you to choose from. Odds are you're already using a library that works with XPath.

An alternative method is to find all text nodes in the HTML document, and filter them by what their parents are. The result is exactly the same, but depending on your requirements this way might scale better:

//text()[parent::p and contains(.,'ponies')]

This expression reads like this:

//text()                  # Find all text nodes in the document
    [parent::p            # whose parent is a "p" element
    and                   # and
    contains(.,'ponies')] # contains the string "ponies"

Welbog 2009-11-18 17:50:39

+1 Happy birthweek!

alex 2009-11-19 14:43:21

ansaurus

tags:

views:

answers:

Check a phrase is not in an <a> (or other) element

related questions