tags:

views:

25

answers:

2

When I try to load an html file as xml using simplexml_load_string i get many errors and warnings regarding the html and it fails, it there a way to properly load an html file using simplexml ?

this html file may have unneeded spaces and maybe some other errors that i would like simplexml to ignore.

A: 

check this man page, one of those options (LIBXML_NOERROR for example) might help you.. but keep in mind that a html is not necessarily a valid xml, so parsing it as xml might not work.

kgb
+2  A: 

I would suggest using PHP Simple HTML DOM. I've used it myself for anything from page scraping to manipulating HTML template files and its very simple and quite powerful and should suit your needs just fine.

Here's a few examples from their docs that show the kind of things you can do:

// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');

// Find all images
foreach($html->find('img') as $element)
       echo $element->src . '<br>';

// Find all links
foreach($html->find('a') as $element)
       echo $element->href . '<br>'; 
seengee
I've had problems with Simple DOM because it uses the PHP DOM extension internally, and it wont load completly broken HTML pages.
Quamis
could always clean the content first with PHP Tidy http://php.net/manual/en/book.tidy.php
seengee