views:

104

answers:

4

What character encoding should I use for a web page containing mostly Arabic text?

Is utf-8 okay?

+4  A: 

UTF-8 is fine, yes. It can encode any code point in the Unicode standard.


Edited to add

To make the answer more complete, your realistic choices are:

  • UTF-8
  • UTF-16
  • UTF-32

Each comes with tradeoffs and advantages.

UTF-8

As Joe Gauterin points out, UTF-8 is very efficient for European texts but can get increasingly inefficient the "farther" from the Latin alphabet you get. If your text is all Arabic it will actually be larger than the equivalent text in UTF-16. This is rarely a problem, however, in practice in these days of cheap and plentiful RAM unless you have a lot of text to deal with. More of a problem is that the variable-length of the encoding makes some string operations difficult and slow. For example you can't easily get the fifth Arabic character in a string because some characters might be 1 byte long (punctuation, say), while others are two or three. This makes actual processing of strings slow and error-prone.

On the other hand, UTF-8 is likely your best choice if you're doing a lot of mixed European/Arabic text. The more European text in your documents, the better the UTF-8 choice will be.

UTF-16

UTF-16 will give you better space efficiency than UTF-8 if you're using predominantly Arabic text. I don't know about the Arabic code points, however, so I don't know if you risk having variable-length encodings here. (My guess is that this is not an issue, however.) If you do, in fact, have variable-length encodings, all the string processing problems of UTF-8 apply here as well. If not, no problems.

On the other hand, if you have mixed European and Arabic texts, UTF-16 will be less space-efficient. Also, if you find yourself expanding your text forms to other texts like, say, Chinese, you definitely go back to variable length forms and the associated problems.

UTF-32

UTF-32 will basically double your space requirements. On the other hand it's constant sized for all known (and, likely, unknown;) script forms. For raw string processing it's your fastest, best option without the problems that variable-length encoding will cause you. (This presupposes you have a string library that knows about 32-bit characters, naturally.)

Recommendation

My own recommendation is that you use UTF-8 as your external format (because everybody supports it) for storage, transmission, etc. unless you really see a benefit size-wise with UTF-16. So any time you read a string from the outside world it would be UTF-8 and any time you put one to the outside world it, too, would be UTF-8. Within your software, though, unless you're in the habit of manipulating massive strings (in which case I'd recommend different data structures anyway!) I'd recommend using UTF-16 or UTF-32 instead (depending on if there's any variable-length encoding issues in your UTF-16 data) for the speed efficiency and simplicity of code.

JUST MY correct OPINION
About UTF-8 code points: according to this [Wikipedia](http://en.wikipedia.org/wiki/Arabic_alphabet#Unicode) page, Arabic and Arabic Supplement fall into the range 0600–077F, whereas the Presentation Forms fall into the range FB50–FEFF. A quick test using the [Unicode code converter](http://rishida.net/tools/conversion/) revealed that the former are represented in UTF-8 using two bytes and the latter take three bytes.
Marcel Korpel
+4  A: 

utf8 can store the full unicode range, so it's fine to use for Arabic.


However, if you were wondering what encoding would be most efficient:

All Arabic characters can be encoded using a single utf16 code unit (2 bytes), but they may take up to 3 utf8 code units (1 byte each), so if you were just encoding Arabic, utf16 would be a more sensible option.

However, you're not just encoding Arabic - you're encoding a significant number of characters that can be stored in a single byte in utf8, but take two bytes in utf16; all the html encoding characters <,&,>,= and all the html element names.

It's a trade off and, unless you're dealing with huge documents, it doesn't matter.

Joe Gauterin
The second part you raised is often overlooked. It's worth a concrete example - here's one from Wikipedia: `"Characters U+0800 through U+FFFF use three bytes in UTF-8, but only two in UTF-16. As a result, text in (for example) Chinese, Japanese or Hindi could take more space in UTF-8 if there are more of these characters than there are ASCII characters. This rarely happens in real documents, for example both the Japanese and the Korean UTF-8 article on Wikipedia take more space if saved as UTF-16 than the original UTF-8 version."` (Although I'd clarify this to *HTML* documents.)
Porges
@Porges: to add to that: according to [this Wikipedia page](http://en.wikipedia.org/wiki/Arabic_alphabet#Unicode), Arabic and Arabic Supplement fall into the range 0600–077F, whereas the Presentation Forms fall into the range FB50–FEFF. I suspect the former is more often used than the latter.
Marcel Korpel
+1  A: 

UTF-8 is the simplest way to go since it will work with almost everything:

UTF-8 can encode any Unicode character. Files in different languages can be displayed correctly without having to choose the correct code page or font. For instance Chinese and Arabic can be in the same text without special codes inserted to switch the encoding. (via wikipedia)

Of course keep in mind that:

UTF-8 often takes more space than an encoding made for one or a few languages. Latin letters with diacritics and characters from other alphabetic scripts typically take one byte per character in the appropriate multi-byte encoding but take two in UTF-8. East Asian scripts generally have two bytes per character in their multi-byte encodings yet take three bytes per character in UTF-8.

... but in most cases it's not a big issues. It would become one if you start handling huge documents.

marcgg
+1  A: 

Hello, I develop mostly Arabic websites and these are the two encodings I use :

1. Windows-1256

This is the most common encoding Arabic websites use. It works in most cases (90%) for Arabic users.

Here is one of the biggest Arabic web-development forums: http://traidnt.net/vb/. You can see that they are using this encoding.

The problem with this encoding is that if you are developing a website for international use, this encoding won't work with every user and they will see gibberish instead of the content.

2. UTF-8

This encoding solves the previous problem and also works in urls. I mean if you want to have Arabic words in the your url, you need them to be in utf-8 or it won't work.

the backside of this encoding is that if you are going to save Arabic content to a database (MySql) using this encoding (so the database will also be encoded with utf-8) its size is going to be double the size if the content were encoded with windows-1256 (so the database will be encoded with latin-1).

I suggest going with utf-8 if you can afford the size increase.

Maher4Ever
Ooh, excellent stuff, thanks for the info.
Paul D. Waite