views:

24

answers:

2

Hi there guys, I'm using strpos to lookup for string into web page bodies. 50% it fails, although the search string is present. I have tried to strtolower both search string and searched content, same results. Probabily the problem arises when dealing with different charsets...

Assuming: - search string charset is unknown - searched content charset is unknown - charset could be any ISOxx, UTF-8, Shift-JIS

Is there a bulletproof function to find a substring?

A: 

yup convert the html to utf8/latin1 first, grab the content encoding from the Content-Type header or from the meta tag, convert to utf8/latin1 using iconv, then stop worrying about it

nathan
- some ill behaved pages don't use the content-type- will it work also for SJIS pages?
Riccardo
+1  A: 

You could try using mb_detect_encoding to detect the encoding first, then convert to the encoding you would like to use (using iconv or mb_convert_encoding) and search for the pattern in that encoding.

wimvds
I've read somewhere the mb_detect_encoding fails easily...
Riccardo