views:

138

answers:

4

I need a PHP script which takes a URL of a web page and then echoes how many times a word is mentioned.

Example

This is a generic HTML page:

<html>
<body>
<h1> This is the title </h1>
<p> some description text here, <b>this</b> is a word. </p>
</body>
</html>

This will be the PHP script:

<?php
htmlurl="generichtml.com";
the script here
echo(result);
?>

So the output will be a table like this:

WORDS       Mentions
This        2
is          2
the         1
title       1
some        1
description 1
text        1
a           1
word        1

This is something like the search bots do when they are surfing the web, so, any idea of how to begin, or even better, do you have a PHP script which already does this?

+10  A: 

The one line below will do a case insensitive word count after stripping all HTML tags from your string.

Live Example

print_r(array_count_values(str_word_count(strip_tags(strtolower($str)), 1)));

To grab the source code of a page you can use cURL or file_get_contents()

$str = file_get_contents('http://www.example.com/');

From inside out:

  1. Use strtolower() to make everything lower case.
  2. Strip HTML tags using strip_tags()
  3. Create an array of words used using str_word_count(). The argument 1 returns an array containing all the words found inside the string.
  4. Use array_count_values() to capture words used more than once by counting the occurrence of each value in your array of words.
  5. Use print_r() to display the results.
Peter Ajtai
Nice and simple, but doesn’t take care of HTML tags...
Timwi
@Timwi - now it does
Peter Ajtai
+1 I would add a `strtolower()` in there too.
NullUserException
@NullU - Thanks, good idea.
Peter Ajtai
@DomingoSL - Live example with your sample code - http://codepad.org/7YJGYBVt
Peter Ajtai
Well yeah, but how 'bout `script` and `style` tags?
Yi Jiang
@Yi Jiang - If you want to deal with those separately, many HTML parsers already exist. There's no point in rewriting one, since they are fussy and complicated beasts.
Peter Ajtai
A: 

The previous code is a point where start. The next step is delete html tags with the regular expressions. Look for ereg and eregi functions. Some other tricks are required for style and script tags (you have to remove the content) Points and commas have to be removed too...

Charlie
`ereg`'s been deprecated and, to begin with, regexes are not an adequate tool for parsing arbitrary HTML.
Artefacto
How can regular expression be deprecated if they exist from perl O.O?
Charlie
Answers are not always listed in chronological order on SO, so `previous code` isn't very helpful. A url link (each answer has a unique one) or author reference is better.
Peter Ajtai
Regular expressions haven't been deprecated, only the ereg extension. Use PCRE instead (the `preg_` function family).
Artefacto
Ah ok :) I misunderstood
Charlie
@ Peter Ajtai: Sorry I'm new here, thanks for the info ;)
Charlie
+1  A: 
ConroyP
This is a clean solution but style and script tag content still exist. Than all the head of the page should be removed.
Charlie
If you use the regExpressions not valid html code could be analyzed ;)Punctuation is still a problem
Charlie
Please don't parse HTML with regular expressions.
Artefacto
@ConroyP - btw, strip_tags() (which you use) already removes multi line HTML comments and CDATA - http://codepad.org/gpdden0T http://php.net/manual/en/function.strip-tags.php .
Peter Ajtai
A: 

This is a complex job that you should not attempt on your own.

You have to extract text that is not part of tags/comments and is not a child for elements such as script and style. For this, you'll also need a lax HTML parser (like the one implemented in libxml2 and used in DOMDocument.

Then you have to tokenize the text, which presents its own challenges. Finally, you'd interested in some form of stemming before proceeding to counting the terms.

I recommend you use specialized tools for this. I haven't used any of these, but you can try HTMLParser for parsing and Lucene for tokenization/stemming (the purpose of Lucene is Text Retrieval, but those operations are necessary for building the index).

Artefacto
A complex job? The ConroyP code works well and does a big part of what you listed. HTML has a very regular syntax
Charlie
@Charlie There's so many things that are missing... Dealing with encodings that are not ASCII, proper handling of HTML (I could easily build an HTML document with a bible transcription that would yield him no words whatsoever for his code), a proper tokenizer (`str_word_count` is very basic and only handles ASCII), a stemmer, ...
Artefacto
A stemmer? Fist why add a stemmer that will not be able to find the roots of every languages? (what is the purpose? The original question asked for a simple HTML parser, not a language analyzer)
Charlie
You can [find](http://snowball.tartarus.org/) stemmers for several languages. The OP dind't say he want stemming, but it's legitimate to assume he wants, esp since there's already some form of term normalization in his question ("This" and "this" are counted as the same). And I suppose you concede the other points...
Artefacto
Yes, my doubts are still on the stemmer. You the Italian on the list you signaled doesn't correctly match to 30% of Italian words, and the vocabulary it contains is just the 1% of Italian words (I'm not kidding). Martin Porter has written an algorithm good for English (perhaps) but not good for other more complex languages.
Charlie