views:

106

answers:

4

I'm currently scraping a website for various pieces of textual data (with permission, of course). The issue I'm seeing is that certain characters aren't correctly encoded in the process. This is particularly prominent with apostrophes ('): leading to characters such as: .

Currently, I use the following code to convert various HTML entities from the scraped data:

htmlentities($content, ENT_COMPAT, 'UTF-8', FALSE)

Is there a better way to handle this sort of thing?

A: 
Jeremy Morgan
notJim
THe code posted is meant to be an example, not a complete solution. He an use preg_replace to clean up any entities that are present to make sure they are encoded properly. The example I posted turns a few named entities to their decimal equivalent but you could use the same method for literals as well. The reason for using preg_replace is because it's more efficient, you can create a list of patterns and replacements, and also use regular expressions to speed things up. In fact with enough reg expressions you could do this filtering in one step as opposed to using str_replace 40 times.
Jeremy Morgan
@Jeremy - str_replace() is much more efficient than preg_replace() as it doesn't have to use the regular expressions engine. It also accepts an array of search and replace values, just as you have used them here (see http://us.php.net/manual/en/function.substr-replace.php). If you're not actually using a regex (as your example clearly does not) you should always use str_repalce() as notJim pointed out.
JamesArmes
A: 

It's a little bit difficult to suggest things based on the information provided. Can you provide an example snippet of text maybe?

Failing that, I'll employee the shotgun approach (e.g., suggesting a bunch of things and hoping one of them hits)

First of all, are you sure the page you're accessing is encoded in UTF-8? What does mb_detect_encoding say?

One option (may not work depending on your needs) would be to use iconv with the TRANSLIT option to convert the characters into something easier to handle using PHP. You could also look at using the mb_* functions for working with multibyte strings.

Are you sure htmlentities is the problem? If the content is UTF-8, and your site is set to serve ISO-8859-1, you're going to see odd characters. Check the encoding your browser is using to make sure it matches the encoding of the characters you're producing.

notJim
A: 

I don't see any issue with using htmlentities() as long as you pass false as the last parameter. This will ensure that you don't encode anything twice (such as turning & into &).

JamesArmes
+2  A: 

HTML entities have two goals:

  • Escape characters that have a special meaning in HTML, such as angle quotes, so they can be used as literals.
  • Display characters that are not supported by the character set you are using, such as the euro symbol in an ISO-8859-1 document.

They are not exactly an encoding tool.

If you want to convert from one charset into another one, I suggest you use iconv(). However, you must know both the source and the target charset. The source charset should be mentioned in the Content-Type response header and the target charset is something you decided when you started the site (although in your case it looks like UTF-8 is the most reasonable option).

Álvaro G. Vicario