views:

452

answers:

4

Im working on a system that requires the parsing of HTML documents under PHP.

my question is simply this:

What's the best method of parsing content for relative information.

When I parse a site I don't want random content I want to find relevant content such as blocks of text, images, links etc. but obviously I don't want header links or footer links.

So is there anyway you can advise me to look at.. tips / tricks are also welcome :)

Regards

+4  A: 

Try Simple HTML Dom Parser:

http://simplehtmldom.sourceforge.net/

// Create DOM from URL or file
$html = file_get_html('http://www.example.com/');

// Find all images 
foreach($html->find('img') as $element) 
       echo $element->src . '<br>';

// Find all links 
foreach($html->find('a') as $element) 
       echo $element->href . '<br>
NAVEED
I know about SimpleDom, but I was just looking for some more Professional approaches +1
RobertPitt
What would make an approach more "professional"?
donut
How is using a tested library with 75,000 downloads and many active users unprofessional? I'm curious :)
Erik
Well firstly there's things I need to prepare for such as bad DOM's, Invlid code, also js analysing against DNSBL engine, this will also be used to look out for malicious sites / content, also the as i have built my site around a framework i have built it needs to be clean, readable, and well structured. SimpleDim is great but the code is slightly messy
RobertPitt
as I said, I have used simple DOM many times before and its Excellent, just looking for a system with cleaner code that's highly extendible, OO(P|D) Wise etc
RobertPitt
@Robert you might also want to check out http://htmlpurifier.org/ for the security related things.
Gordon
He's got one valid point: simpleHTMLDOM is hard to extend, unless you use decorator pattern, which I find unwieldy. I've found myself *shudder* just making changes to the underlying class(es) themselves.
Erik
+14  A: 

I prefer using one of the native XML extensions, like

If you prefer a 3rd party lib, I'd suggest not to use SimpleHtmlDom, but a lib that actually uses DOM/libxml underneath instead of String Parsing:

You can use the above for parsing HTML5, but there can be quirks due to the markup HTML5 allows. So for HTML5 you want to consider using a dedicated parser, like

Or use a WebService like

If you want to spend some money, have a look at

Last and least recommended, you can - to a very limited certain degree - extract data from HTML with Regular Expressions. In general using Regular Expressions on HTML is discouraged, because HTML is not regular and the aforementioned libraries do a much better job on this. See

Gordon
Which is best of them in your opinion ?
NAVEED
@Naveed that depends on your needs. I have no need for CSS Selector queries, which is why I use DOM with XPath exclusively. phpQuery aims to be a jQuery port. Zend_Dom is lightweight. You really have to check them out to see which one you like best.
Gordon
+1 For nice collection.
NAVEED
I never knew Zend had created Zend_Dom :) +1
RobertPitt
i selected yours as the best answer because you actually posted many alternatives and some i never knew about, ill be doing some benches on the Zend_Dom and see how that goes, Thanks
RobertPitt
+2  A: 

This is commonly referred to as screen scraping, by the way. The library I have used for this is Simple HTML Dom Parser.

Joel Verhagen
Not strictly true (http://en.wikipedia.org/wiki/Screen_scraping#Screen_scraping). The clue is in "screen"; in the case described, there's no screen involved. Although, admittedly, the term has suffered an awful lot of recent misuse.
Bobby Jack
Im not screen scraping, the content that will be parsed will be authorized by the content supplier under my agreement.
RobertPitt
I love it when I learn something from answering a question :)
Joel Verhagen
A: 

you should also look at http://stackoverflow.com/questions/3603511/html-scraping-and-css-queries

Quamis