views:

600

answers:

5

Hello

how to extract all text from HTML file

I want to extract all text, in the alt attributes, < p > tags, etc..

however I don't want to extract the text between style and script tags

Thanks

right now I have the following code

    <?PHP
    $string =  trim(clean(strtolower(strip_tags($html_content))));
    $arr = explode(" ", $string);
    $count = array_count_values($arr);
    foreach($count as $value => $freq) {
          echo trim ($value)."---".$freq."<br>";
    }

    function clean($in){
           return preg_replace("/[^a-z]+/i", " ", $in);
    }

    ?>

This works great but it retrieves script and style tags which I don't want to retrieve and the other problem I am not sure if it does retrieve attributes like alt - since strip_tags function might remove all HTML tags with their attributes

Thanks

A: 

First remove script and style tags with full content, then use your current way of cleaning tags and you'll get the text.

Superfilin
+7  A: 

I personally think you should switch to an XML reader of some sort (SimpleXML, Document Object Model or XMLReader) to parse the HTML document. I'd go for a mix of DOM, SimpleXML and XPath to extract what you need - everthing else will miserably fail when parsing arbitrary documents:

$dom = new DOMDocument();
$dom->loadHTML($html_content); // use DOMDocument because it can load HTML
$xml = simplexml_import_dom($dom); // switch to SimpleXML because it's easier to use.
$pTags = $xml->xpath('/html/body//p');
$tagsWithAltAttribute = $xml->xpath('/html/body//*[@alt]');
// ...
Stefan Gehrig
I would go for this solution too. However it'll break if the HTML content itself isn't valid(have broken tags etc.)
rubayeet
You're right - but building a parser using string- and regex-functions that can cope with arbitrary AND possibly malformed or invalid documents will be a lot more complicated. One solution would be to run the HTML string through HTML Tidy (http://de3.php.net/manual/en/book.tidy.php) before passing it to the XML reader.If the OP will parse well-known structured HTML (same structure all the tim) he should probably go for the regex-solution.
Stefan Gehrig
@Stefan Gehrig: Thanks this will work fine but I will try to know how to fetch all texts not only "alt" - it is a lot easier and safer than regular expressions
ahmed
Unless you're using XHTML (which is a bad idea for various reasons atm) or XHTML-compatible HTML (which is mostly pointless), the document will never be well-formed XML (unless the source contains no meta tags, no links, no images ...). If you need that level of cleanliness, you're better off using a full-blown HTML sanitiser.
Alan
@Alan: That's why we use the `DOMDocument::loadHTML()` method. It can deal with pure HTML and does not need XHTML to build the DOM tree. It cannot deal however with invalid HTML documens...
Stefan Gehrig
A: 

first you can search for the and blocks and remove them from the html.

i have this function i use alot

        function search($start,$end,$string, $borders=true){
      $reg="!".preg_quote($start)."(.*?)".preg_quote($end)."!is";
      preg_match_all($reg,$string,$matches);

      if($borders) return $matches[0]; 
      else return $matches[1]; 
     }

the function will return matching blocks in array.

$array = search("<script>" , "</script>" , $html)

once you have the script and styles gone , use strip_tags to get the text

Sabeen Malik
This won't work unless your script and style tags use type attributes like 95% of them do.
Alan
that was an example u can use search("<script" , "</script>" , $html)
Sabeen Malik
A: 

Any kind of parsing is not an option as long as you can't be sure the source is 100% well-formed XML (which HTML4, by definition, is not).

A simple preg_replace should suffice. Something like

preg_replace('/<(script|style).*>.*<\/\1>/i', '', $html);

should be enough to replace all the script and style elements and their contents with an empty string (i.e. strip them).

If you want to avoid XSS attacks, however, you're probably better off using a HTML sanitiser to normalise the HTML and then strip all the bad code.

Alan
A: 

I posted this as an answer to another post, but here it is again:

We've just launched a new natural language processing API over at repustate.com. Using a REST API (so just using curl will be fine), you can clean any HTML or PDF and get back just the text parts. Our API is free so feel free to use to your heart's content. Check it out and compare the results to readability.js - I think you'll find they're almost 100% the same.

Martin