views:

723

answers:

4

I am working on a WordPress site where one of the pages lists excerpts about corporate clients.

Let's say I have a web page where the visible text looks like this:

"SuperAmazing.com, a subsidiary of Amazing, the leading provider of integrated messaging and collaboration services, today announced the availability of an enhanced version of its Enterprise Messaging Service (CMS) 2.0, a lower cost webmail alternative to other business email solutions such as Microsoft Exchange, GroupWise and LotusNotes offerings."

But let's say there can be an HTML link or image in this text, so the raw HTML might look like this:

"<img src="/images/corporate/logos/super_amazing.jpg" alt="Company logo for SuperAmazing.com" /> SuperAmazing.com, a subsidiary of <a href="http://www.amazing.com/"&gt;Amazing&lt;/a&gt;, the leading provider of integrated messaging and collaboration services, today announced the availability of an enhanced version of its Enterprise Messaging Service (CMS) 2.0, a lower cost webmail alternative to other business email solutions such as Microsoft Exchange, GroupWise and LotusNotes offerings."

Here is what I need to do: find out if there is a link inside of the first 20 visible words.

These are first 20 visible words:

"SuperAmazing.com, a subsidiary of Amazing, the leading provider of integrated messaging and collaboration services, today announced the availability of an"

I need to get the character count, including the HTML, out to the 20 visible word, which in this case would be "an", though of course it'll be different for each excerpt on the page.

(I'm willing to count "SuperAmazing.com" as 2 words if that makes things easier.)

I tried number of regular expressions for counting words, but they all count the HTML, not the visible words.

So what would be the correct regular expression for finding the full character count, including the HTML, for the first 20 visible words?

+1  A: 

I'm not sure about using PHP regular expressions to count words.

Assuming you can isolate the visible words in a variable, my initial approach would be to explode/split it at the spaces (or whatever gives what you regard as words) and put the results into an array.

After the split, limit the array to 20 elements.

Then apply a regular expression to each of the array elements and decide if any match a link.

To get the character count, join/implode the array of twenty words (without spaces) and find the length of the string.

pavium
That's an interesting approach! I didn't think of split()
Fiarr
A: 

Regex and HTML do not mix. Counting using regex is unusual. Regex is the wrong solution to your problem. Use an HTML parsing library to extract the text. Then use some form of tokenizer to extract the words. You will save yourself a lot of headaches in the long run.

What headaches? Suppose you manage to construct a monstrous regex that does what you want. Now suppose two years later there's an edge case you didn't account for and you need to modify that monstrosity. You will at that point wish you had a coded solution that you could modify easily.

jmucchiello
A: 

The function "getTextFromNode" and "getTextFromDocument" give you the text-only content of the HTML. The function "getFirstWords" returns the first number of words from text.

function getTextFromNode($Node, $Text = "") {
    if ($Node->tagName == null)
        return $Text.$Node->textContent;

    $Node = $Node->firstChild;
    if ($Node != null)
        $Text = getTextFromNode($Node, $Text);

    while($Node->nextSibling != null) {
        $Text = getTextFromNode($Node->nextSibling, $Text);
        $Node = $Node->nextSibling;
    }
    return $Text;
}

function getTextFromDocument($DOMDoc) {
    return getTextFromNode($DOMDoc->documentElement);
}

function getFirstWords($Text, $Count = 1) {
    if (!($Count > 0))
     $Count = 1;

    $Text = trim($Text);

    $TextParts = split('[ ]+', $Text, 21);
    if (count($TextParts) == $Count)
     $TextParts[$Count - 1] = "";

    $NewText = join(" ", $TextParts);
    return $NewText;
}

And you can use it by:

$Doc = new DOMDocument();
$Doc->loadHTMLFile("Test.html");

$Text = getTextFromDocument($Doc);
echo "Text from HTML: ".$Text."\n";

$NewText = getFirstWords($Text, 21);
echo "First 20 words from HTML: ".$NewText."\n";

Hope this helps.

NawaMan
A: 

Here's a reasonably good regex for matching the first twenty visible words:

'~^(?:\s*+(?:(?:[^<>\s]++|</?\w[^<>]*+>)++)){1,20}~'

This matches one to twenty whitespace-separated tokens, where a token is defined as one or more words or tags not separated by whitespace (where a "word" is defined as one or more characters other than whitespace or angle brackets). For example, this would be one token:

<a href="http://www.amazing.com/"&gt;Amazing&lt;/a&gt;

...but this is two tokens:

<a href="http://www.superduper.com/"&gt;Super Duper</a>

This will treat a standalone tag (like the <img> tag in your example, or any tag that's surrounded by whitespace) as a separate token, which throws off the count--it only matches up to the word "of" in your example. It also won't correctly handle <br> tags, or block-level tags like <p> and <table>, if they don't have any whitespace around them. Only you can know how much of a problem that will be.

EDIT: If that isolated <img> tag is something you see a lot, you could preprocess the text to remove the whitespace following it. That would effectively merge it with the first subsequent "real" token, resulting in a more accurate character count. I know it only changes the count by one or two characters in this case, but if the twentieth word happened to "supercalifragilisticexpialidocious" you'd probably notice the difference. :)

Alan Moore
Thanks, Alan. I'm experimenting with this now. It looks very close to what I need. I think white space around HTML won't be too much of a problem. And I think it will be okay if the word count is off by 1.
cerhovice
I just realized the accuracy of the word count could be more important than I assumed at first. See my edit about that IMG tag.
Alan Moore