We all know that the contents of an HTML page aren't just the data between open and closed tags, for example, <p></p>
.
Beyond image "alt" and any "title" attributes, what HTML offer to me that I should consider as a content?
Any suggestions?
We all know that the contents of an HTML page aren't just the data between open and closed tags, for example, <p></p>
.
Beyond image "alt" and any "title" attributes, what HTML offer to me that I should consider as a content?
Any suggestions?
Titles (<h1>
- <h6>
), images (<img />
), paragraphs (<p>
) and links (<a>
). Not much more than that. Unless you want to count tables too.
If you want to pull all of the text from the Body, you can do so easily with a scraper-tool like phpQuery (requires PHP):
phpQuery::newDocument(file_get_contents("http://www.somesite.com"));
$body = pq("body")->text();
print $body;
In that example, $text would be the total content of your entire page. You could then search for keywords in there to help you determine the content.
As you stated in your comment, you're wanting to guard against porn-url's being submitted. Using this method, you can get the text. Once you have the text, you could scan it and build a list of keywords/instances. That list should give you a good idea about the content/subject of the page (unless the page is just a video of some sort).
To learn how you can build these keywords/instances list, view the following Question: Quickly Build List of Keywords from Text, Including # of Instances