ansaurus

Question

how to extract all text from HTML file using PHP?

Answer 1

A:

First remove script and style tags with full content, then use your current way of cleaning tags and you'll get the text.

Superfilin 2009-10-02 08:26:39

Answer 2

+7 A:

I personally think you should switch to an XML reader of some sort (SimpleXML, Document Object Model or XMLReader) to parse the HTML document. I'd go for a mix of DOM, SimpleXML and XPath to extract what you need - everthing else will miserably fail when parsing arbitrary documents:

$dom = new DOMDocument();
$dom->loadHTML($html_content); // use DOMDocument because it can load HTML
$xml = simplexml_import_dom($dom); // switch to SimpleXML because it's easier to use.
$pTags = $xml->xpath('/html/body//p');
$tagsWithAltAttribute = $xml->xpath('/html/body//*[@alt]');
// ...

Stefan Gehrig 2009-10-02 08:29:15

I would go for this solution too. However it'll break if the HTML content itself isn't valid(have broken tags etc.)

rubayeet 2009-10-02 08:33:08

You're right - but building a parser using string- and regex-functions that can cope with arbitrary AND possibly malformed or invalid documents will be a lot more complicated. One solution would be to run the HTML string through HTML Tidy (http://de3.php.net/manual/en/book.tidy.php) before passing it to the XML reader.If the OP will parse well-known structured HTML (same structure all the tim) he should probably go for the regex-solution.

Stefan Gehrig 2009-10-02 08:42:17

@Stefan Gehrig: Thanks this will work fine but I will try to know how to fetch all texts not only "alt" - it is a lot easier and safer than regular expressions

ahmed 2009-10-02 08:47:55

Unless you're using XHTML (which is a bad idea for various reasons atm) or XHTML-compatible HTML (which is mostly pointless), the document will never be well-formed XML (unless the source contains no meta tags, no links, no images ...). If you need that level of cleanliness, you're better off using a full-blown HTML sanitiser.

Alan 2009-10-02 08:49:15

@Alan: That's why we use the `DOMDocument::loadHTML()` method. It can deal with pure HTML and does not need XHTML to build the DOM tree. It cannot deal however with invalid HTML documens...

Stefan Gehrig 2009-10-02 08:53:38

Answer 3

A:

first you can search for the and blocks and remove them from the html.

i have this function i use alot

        function search($start,$end,$string, $borders=true){
      $reg="!".preg_quote($start)."(.*?)".preg_quote($end)."!is";
      preg_match_all($reg,$string,$matches);

      if($borders) return $matches[0]; 
      else return $matches[1]; 
     }

the function will return matching blocks in array.

$array = search("<script>" , "</script>" , $html)

once you have the script and styles gone , use strip_tags to get the text

Sabeen Malik 2009-10-02 08:33:32

This won't work unless your script and style tags use type attributes like 95% of them do.

Alan 2009-10-02 08:39:29

that was an example u can use search("<script" , "</script>" , $html)

Sabeen Malik 2009-10-02 08:40:31

Answer 4

A:

Any kind of parsing is not an option as long as you can't be sure the source is 100% well-formed XML (which HTML4, by definition, is not).

A simple preg_replace should suffice. Something like

preg_replace('/<(script|style).*>.*<\/\1>/i', '', $html);

should be enough to replace all the script and style elements and their contents with an empty string (i.e. strip them).

If you want to avoid XSS attacks, however, you're probably better off using a HTML sanitiser to normalise the HTML and then strip all the bad code.

Alan 2009-10-02 08:43:45

Answer 5

A:

I posted this as an answer to another post, but here it is again:

We've just launched a new natural language processing API over at repustate.com. Using a REST API (so just using curl will be fine), you can clean any HTML or PDF and get back just the text parts. Our API is free so feel free to use to your heart's content. Check it out and compare the results to readability.js - I think you'll find they're almost 100% the same.

Martin 2010-05-31 19:52:46

ansaurus

tags:

views:

answers:

how to extract all text from HTML file using PHP?

related questions