ansaurus

Question

What data should I care when retrieving only the contents of a HTML webpage?

Answer 1

+1 A:

Getting Your Text...

Titles (<h1> - <h6>), images (<img />), paragraphs (<p>) and links (<a>). Not much more than that. Unless you want to count tables too.

If you want to pull all of the text from the Body, you can do so easily with a scraper-tool like phpQuery (requires PHP):

phpQuery::newDocument(file_get_contents("http://www.somesite.com"));
$body = pq("body")->text();
print $body;

In that example, $text would be the total content of your entire page. You could then search for keywords in there to help you determine the content.

Scanning Your Text for Keywords...

As you stated in your comment, you're wanting to guard against porn-url's being submitted. Using this method, you can get the text. Once you have the text, you could scan it and build a list of keywords/instances. That list should give you a good idea about the content/subject of the page (unless the page is just a video of some sort).

To learn how you can build these keywords/instances list, view the following Question: Quickly Build List of Keywords from Text, Including # of Instances

Jonathan Sampson 2009-07-02 21:13:52

should be *to* learn how ...

a_m0d 2009-07-27 04:20:09

ansaurus

tags:

views:

answers:

What data should I care when retrieving only the contents of a HTML webpage?

Getting Your Text...

Scanning Your Text for Keywords...

related questions