ansaurus

Question

Answer 1

+4 A:

http://simplehtmldom.sourceforge.net/

// Create DOM from URL or file
$html = file_get_html('http://www.example.com/');

// Find all images 
foreach($html->find('img') as $element) 
       echo $element->src . '<br>';

// Find all links 
foreach($html->find('a') as $element) 
       echo $element->href . '<br>

NAVEED 2010-08-26 17:18:25

I know about SimpleDom, but I was just looking for some more Professional approaches +1

RobertPitt 2010-08-26 17:29:53

What would make an approach more "professional"?

donut 2010-08-26 17:32:56

How is using a tested library with 75,000 downloads and many active users unprofessional? I'm curious :)

Erik 2010-08-26 17:34:51

Well firstly there's things I need to prepare for such as bad DOM's, Invlid code, also js analysing against DNSBL engine, this will also be used to look out for malicious sites / content, also the as i have built my site around a framework i have built it needs to be clean, readable, and well structured. SimpleDim is great but the code is slightly messy

RobertPitt 2010-08-26 17:35:16

as I said, I have used simple DOM many times before and its Excellent, just looking for a system with cleaner code that's highly extendible, OO(P|D) Wise etc

RobertPitt 2010-08-26 17:42:16

@Robert you might also want to check out http://htmlpurifier.org/ for the security related things.

Gordon 2010-08-31 07:40:46

He's got one valid point: simpleHTMLDOM is hard to extend, unless you use decorator pattern, which I find unwieldy. I've found myself *shudder* just making changes to the underlying class(es) themselves.

Erik 2010-09-17 21:46:59

Answer 2

+14 A:

I prefer using one of the native XML extensions, like

DOM or
XMLReader.

If you prefer a 3rd party lib, I'd suggest not to use SimpleHtmlDom, but a lib that actually uses DOM/libxml underneath instead of String Parsing:

You can use the above for parsing HTML5, but there can be quirks due to the markup HTML5 allows. So for HTML5 you want to consider using a dedicated parser, like

html5lib

Or use a WebService like

YQL or
ScraperWiki.

If you want to spend some money, have a look at

PHP Architects Guide to Webscraping with PHP

Last and least recommended, you can - to a very limited certain degree - extract data from HTML with Regular Expressions. In general using Regular Expressions on HTML is discouraged, because HTML is not regular and the aforementioned libraries do a much better job on this. See

Gordon 2010-08-26 17:19:41

Which is best of them in your opinion ?

NAVEED 2010-08-26 17:31:54

@Naveed that depends on your needs. I have no need for CSS Selector queries, which is why I use DOM with XPath exclusively. phpQuery aims to be a jQuery port. Zend_Dom is lightweight. You really have to check them out to see which one you like best.

Gordon 2010-08-26 17:38:07

+1 For nice collection.

NAVEED 2010-08-26 17:38:08

I never knew Zend had created Zend_Dom :) +1

RobertPitt 2010-08-26 17:38:47

i selected yours as the best answer because you actually posted many alternatives and some i never knew about, ill be doing some benches on the Zend_Dom and see how that goes, Thanks

RobertPitt 2010-08-26 22:02:16

Answer 3

+2 A:

This is commonly referred to as screen scraping, by the way. The library I have used for this is Simple HTML Dom Parser.

Joel Verhagen 2010-08-26 17:20:17

Not strictly true (http://en.wikipedia.org/wiki/Screen_scraping#Screen_scraping). The clue is in "screen"; in the case described, there's no screen involved. Although, admittedly, the term has suffered an awful lot of recent misuse.

Bobby Jack 2010-08-26 17:24:09

Im not screen scraping, the content that will be parsed will be authorized by the content supplier under my agreement.

RobertPitt 2010-08-26 17:30:47

I love it when I learn something from answering a question :)

Joel Verhagen 2010-08-26 17:59:19

Answer 4

A:

you should also look at http://stackoverflow.com/questions/3603511/html-scraping-and-css-queries

Quamis 2010-09-01 09:35:34

ansaurus

tags:

views:

answers:

Best methods to parse HTML

related questions