The converting is easy. The detecting is the hard part. You could try mb_detect_encoding but that is a very shaky method, it's literally "guessing" the content type and as @troelskn highlights in the comments can guess "rough" differences at best (Is it a multi-byte encoding?) but fails at detecting nuances of similar character sets.
The proper way would be IMO:
- Interpreting any
content-type
Meta tags in the page
- Interpreting any
content-type
headers sent by the server
- If that yields nothing, try to "sniff" the encoding using mb_detect_encoding()
- If that yields nothing, fall back to a defined default (maybe ISO-8859-1, maybe UTF-8).
Different than outlined in the guidelines in @Gumbo's answer, I personally think Meta tags should have priority over server headers because I'm pretty sure that if a Meta tag is present, that is a more reliable indicator of the actual encoding of the page than a server setting some site operators don't even know how to change. The correct way, however, seems to be to treat content-type headers with higher priority.
For the former, I think you can use get_meta_tags(). The latter you should be getting from curl already, you would just have to parse it. Here is a full example on how to systematically process response headers served by cURL.
The conversion would then be using iconv:
$new_content = iconv("incoming-charset", "utf-8", $content);