tags:

views:

57

answers:

2

i've created a function to get plain text from HTML by striping out JavaScript , CSS , HTML tags etc. for that i've relied upon PHP's preg_replace function to remove certain patterns. The webpages are already stored on hard disk so i'm taking source code from disk. The function is working properly for source code from single files however if i append the source code of multiple files and pass it to my function then preg_replace function fails and returns FALSE . I tried get_last_error but nothing was reported. I'm also trimming the source code before concatinating (to remove EOFs).

Please also tell me how regular expressions are implemented on Windows because unlike Linux there is no grep on Windows.

+1  A: 

Did you look at PHP's built in strip_tags() function?

Otherwise, we've no idea what your code is actually doing, so very hard to identify why it isn't working as you want.

Mark Baker
strip_tags has a known limitation of 1024 characters. Everything above that remains unstripped.
bisko
@bisko that limit is per-tag, not the entire input.
Matt
By my tests it's for the entire input. Passing it a long, valid HTML it strips just the first 1024 chars if tags are found in them.
bisko
@bisko you might want to check your test again, 1024 characters is [no problem](http://codepad.org/Fd7N9cXu).
Matt
Hm, seems I really need to check again the results I got in testing. Seems you are right, Matt. Thanks for noting this!
bisko
A: 

When you have long HTML files, the preg family of functions will return false, because of a backtrack limitation in PHP ( check here: http://bugs.php.net/bug.php?id=40846 ).

You could try to work on smaller portions of the files and concatenate them after stripping the tags.

Also you could optimize your regular expressions not to use so much backtracking if you rely much on .* . For example

/<.*?>/

Could be optimized as

/<[^>]+>/

And so on.

bisko