lets say I have an html document
how can I remove every thing from the document
I want to remove the HTML tags
I want to remove any special character
I want to remove everything except letters
and extract the text
Thanks
lets say I have an html document
how can I remove every thing from the document
I want to remove the HTML tags
I want to remove any special character
I want to remove everything except letters
and extract the text
Thanks
You can use strip_tags and preg_replace to accomplish this:
function clean($in)
{
// Remove HTML
$out = strip_tags($in);
// Filter all other characters
return preg_replace("/[^a-z]+/i", "", $out);
}
[^a-z]
will match any character other than A to Z, the +
sign specifies that it should match any sequence length of such characters and the /i
-modifier specifies that it's a case insensitive search. All matched characters will be replaced with an empty string leaving only the characters left.
If you want to keep spaces you can use [^a-z ]
instead and if you want to keep numbers as well [^a-z0-9 ]
. This allows you to whitelist all allowed characters and discard the rest.
Use strip_tags() to get rid of HTML first, then use Emil H's regex.
Prepend a
$in = preg_replace("/<[^>]*>/", "", $in);
to Emil H's solution, so your Tags will get striped. Else, a "<p>Hello World</p>" will appear as "pHelloWorldp"