views:

97

answers:

3

lets say I have an html document

how can I remove every thing from the document

I want to remove the HTML tags

I want to remove any special character

I want to remove everything except letters

and extract the text

Thanks

+4  A: 

You can use strip_tags and preg_replace to accomplish this:

function clean($in)
{
    // Remove HTML
    $out = strip_tags($in);
    // Filter all other characters
    return preg_replace("/[^a-z]+/i", "", $out);
}

[^a-z] will match any character other than A to Z, the + sign specifies that it should match any sequence length of such characters and the /i-modifier specifies that it's a case insensitive search. All matched characters will be replaced with an empty string leaving only the characters left.

If you want to keep spaces you can use [^a-z ] instead and if you want to keep numbers as well [^a-z0-9 ]. This allows you to whitelist all allowed characters and discard the rest.

Emil H
While that answers the question in the title, it doesn't remove html tags, which the body suggests he wants removed...
Stobor
True. See Jeremys answer as well.
Emil H
+2  A: 

Use strip_tags() to get rid of HTML first, then use Emil H's regex.

yjerem
+1. Missed that one. Of course.
Emil H
+2  A: 

Prepend a

$in = preg_replace("/<[^>]*>/", "", $in);

to Emil H's solution, so your Tags will get striped. Else, a "<p>Hello World</p>" will appear as "pHelloWorldp"

craesh