views:

284

answers:

5

Is there a way to extract desired data from a raw html which has been written unsemantically with no IDs and classes? I mean, suppose there is a saved html file of a webpage(profile) and I want to extract the data like (say)'hobbies'. Is it possible to do this using PHP?

+3  A: 

BeautifulSoup http://www.crummy.com/software/BeautifulSoup/, maybe ?

empc
isn't this for Python? OP's looking for PHP.
echo
+1  A: 

Sounds like you're looking for a PHP DOM Parser, such as this one. It'll probably be a bit tricky to pull out the data you need if the HTML is truly devoid of semantic structure, but a DOM parser is the place to start.

echo
+1  A: 

Yes the technique is called web scraping. You could use the DOM if its valid html. If the page is dynamically generated the generator would have used some structure, and from my experience you can always isolate elements of interest.

If DOM does not work for you, you can just use regular expressions (thats what I always used to do when writing web-spiders). Regular expressions are more effective and quicker that writing scraping logic against a DOM heirarchy. So you need to open a few of the profile pages and analyze the static structure. Then just write a regular expression to isolate the fields of interest.

Hassan Syed
+2  A: 

Use regex! I kid, I kid. If you know the state of the same page, and the format is guaranteed to remain similar enough, then you can try writing a manual parser. Alternatively, there are a lot of libraries out there that will parse html for. I'm not familiar enough with PHP to recommend one, but I'm sure some Googleing could take you a long way. I've had luck with John Resig's pure javascript HTML parser before.

At the end of the day, if you need semantic information from an html page that isn't constructed semantically, you're probably doomed programmatically and your best bet may be a mechanical turk.

Chris Clark
A: 

There's two approaches to take with PHP. The first is to clean your document up using the tidy extension so it's valid XHTML, and therefore well-formed XML, and therefore can be parsed using XML tools.

The second is to use the PHP release of html5lib parser, which attempts to implement the HTML5 research into current browser parsing routines. If it displays in a browser, html5lib can parse it.

Using either approach you'll end up with a DOM object you can query using xpath expressions. Since your theoretical documents lack semantic structure, you'll want toook at the document parts from a "the 5th span inside the 3rd p" mindset.

More information here (self-link warning).

Alan Storm