ansaurus

Question

PHP and regular expressions: how to get the character count of all characters in a string containing HTML, but measuring only 20 visible words?

Answer 1

+1 A:

I'm not sure about using PHP regular expressions to count words.

Assuming you can isolate the visible words in a variable, my initial approach would be to explode/split it at the spaces (or whatever gives what you regard as words) and put the results into an array.

After the split, limit the array to 20 elements.

Then apply a regular expression to each of the array elements and decide if any match a link.

To get the character count, join/implode the array of twenty words (without spaces) and find the length of the string.

pavium 2009-09-04 01:22:45

That's an interesting approach! I didn't think of split()

Fiarr 2009-09-04 02:11:41

Answer 2

A:

Regex and HTML do not mix. Counting using regex is unusual. Regex is the wrong solution to your problem. Use an HTML parsing library to extract the text. Then use some form of tokenizer to extract the words. You will save yourself a lot of headaches in the long run.

What headaches? Suppose you manage to construct a monstrous regex that does what you want. Now suppose two years later there's an edge case you didn't account for and you need to modify that monstrosity. You will at that point wish you had a coded solution that you could modify easily.

jmucchiello 2009-09-04 01:44:06

Answer 3

A:

The function "getTextFromNode" and "getTextFromDocument" give you the text-only content of the HTML. The function "getFirstWords" returns the first number of words from text.

function getTextFromNode($Node, $Text = "") {
    if ($Node->tagName == null)
        return $Text.$Node->textContent;

    $Node = $Node->firstChild;
    if ($Node != null)
        $Text = getTextFromNode($Node, $Text);

    while($Node->nextSibling != null) {
        $Text = getTextFromNode($Node->nextSibling, $Text);
        $Node = $Node->nextSibling;
    }
    return $Text;
}

function getTextFromDocument($DOMDoc) {
    return getTextFromNode($DOMDoc->documentElement);
}

function getFirstWords($Text, $Count = 1) {
    if (!($Count > 0))
     $Count = 1;

    $Text = trim($Text);

    $TextParts = split('[ ]+', $Text, 21);
    if (count($TextParts) == $Count)
     $TextParts[$Count - 1] = "";

    $NewText = join(" ", $TextParts);
    return $NewText;
}

And you can use it by:

$Doc = new DOMDocument();
$Doc->loadHTMLFile("Test.html");

$Text = getTextFromDocument($Doc);
echo "Text from HTML: ".$Text."\n";

$NewText = getFirstWords($Text, 21);
echo "First 20 words from HTML: ".$NewText."\n";

Hope this helps.

NawaMan 2009-09-04 02:47:55

Answer 4

A:

Here's a reasonably good regex for matching the first twenty visible words:

'~^(?:\s*+(?:(?:[^<>\s]++|</?\w[^<>]*+>)++)){1,20}~'

This matches one to twenty whitespace-separated tokens, where a token is defined as one or more words or tags not separated by whitespace (where a "word" is defined as one or more characters other than whitespace or angle brackets). For example, this would be one token:

<a href="http://www.amazing.com/"&gt;Amazing&lt;/a&gt;

...but this is two tokens:

<a href="http://www.superduper.com/"&gt;Super Duper</a>

This will treat a standalone tag (like the <img> tag in your example, or any tag that's surrounded by whitespace) as a separate token, which throws off the count--it only matches up to the word "of" in your example. It also won't correctly handle <br> tags, or block-level tags like <p> and <table>, if they don't have any whitespace around them. Only you can know how much of a problem that will be.

EDIT: If that isolated <img> tag is something you see a lot, you could preprocess the text to remove the whitespace following it. That would effectively merge it with the first subsequent "real" token, resulting in a more accurate character count. I know it only changes the count by one or two characters in this case, but if the twentieth word happened to "supercalifragilisticexpialidocious" you'd probably notice the difference. :)

Alan Moore 2009-09-04 03:05:30

Thanks, Alan. I'm experimenting with this now. It looks very close to what I need. I think white space around HTML won't be too much of a problem. And I think it will be okay if the word count is off by 1.

cerhovice 2009-09-04 15:58:30

I just realized the accuracy of the word count could be more important than I assumed at first. See my edit about that IMG tag.

Alan Moore 2009-09-04 17:19:49

ansaurus

tags:

views:

answers:

PHP and regular expressions: how to get the character count of all characters in a string containing HTML, but measuring only 20 visible words?

related questions