views:

122

answers:

5

Scenario

You have lots of XML files stored as UTF-16 in a Database or on a Server where space is not an issue. You need to take a large majority of these files that you need to get to other systems as XML Files and it is critical that you use as little space as you can.

Issue

In reality only about 10% of the files stored as UTF-16 need to be stored as UTF-16, the rest can safely be stored as UTF-8 and be fine. If we can have the ones that need to be UTF-16 be such, and the rest be UTF-8 we can use about 40% less space on the file system.

We have tried to use great compression of the data and this is useful but we find that we get the same ratio of compression with UTF-8 as we get with UTF-16 and UTF-8 compresses faster as well. Therefore in the end if as much of the data is stored as UTF-8 as possible we can not only save space when stored uncompress, we can still save more space even when it is compressed, and we can even save time with the compression itself.

Goal

To figure out when there are Unicode characters in the XML file that require UTF-16 so we can only use UTF-16 when we have to.

Some Details about XML File and Data

While we control the schema for the XML itself, we do not control what type of "strings" can go in the values from a Unicode perspective as the source is free to provide Unicode data to use. However, this is rare so we would like not to have to use UTF-16 everytime just to support something that is only needed 10% of the time.

Development Environment

We are using C# with the .Net Framework 4.0.

EDIT: Solution

The solution is just to use UTF-8.

The question was based on my misunderstanding of UTF and I appreciate everyone helping set me straight. Thank you!

+5  A: 

Edit: I didn’t realise that your question implies that you think that there are Unicode strings that cannot be safely encoded as UTF-8. This is not the case. The following answer assumes that what you really meant was that some strings will simply be longer (take more storage space) as UTF-8.


I would say even less than 10% of the files need to be stored as UTF-16. Even if your XML contains significant amounts of Chinese, Japanese, Korean, or another language that is larger in UTF-8 than UTF-16, it is still only an issue if there is more text in that language than there is XML syntax.

Therefore, my initial intuition is “use UTF-8 until it’s a problem”. It makes for consistency, too.

If you have serious reason to believe that a large proportion of the XML will be East Asian, only then you need to worry about it. In that case, I would apply a simple heuristic, like... go through the XML and count the number of characters greater than U+0800 (those are three bytes in UTF-8) and only if this is greater than the number of characters less than U+0080 (those are one byte in UTF-8), use UTF-16.

Timwi
I disagree with your heuristic. It should compare those less than or equal to U+007F (1 octet in UTF-8) with those that are greater than or equal to U+0800 and less than U+10000 (3 octets in UTF-8), as the others are equal in both (2 octets in both between U+0080 to U+07FF, and 4 in both at U+10000 and higher). That said, I'd be inclined just to go with UTF-8 all the time for the greater simplicity and consistency unless a very large amont was not just East Asian, but had such chars overwhelm lower-codepoint chars also in the doc).
Jon Hanna
+4  A: 

You never 'need' to use UTF-16 instead of UTF-8 and the choice is not about 'safety'. Both encodings have the same encodable character repertoire.

Juho Östman
The question says “it is critical that you use as little space as you can” and this answer doesn’t address that.
Timwi
Well, as for safety, if you know beforehand what you need to store and you have enough space for that, you are safe. If, however, the data can change arbitrarily and you have very limited storage, you are never safe; you could run out of space. It was this confusing language that distracted from the issue, and it seems I was not the only one here.
Juho Östman
+3  A: 

There is no such thing as a document that has to be UTF-16. Any UTF-16 document can also be encoded as UTF-8. It is theoretically possible to have a document which is larger as UTF-8 than as UTF-16, but this is vanishingly unlikely, and not worth stressing over.

Just encode everything as UTF-8 and stop worrying about it.

JSBangs
It is not vanishingly unlikely. It is true of any document written in Chinese, Japanese, Korean, Hindi, Gujarathi, Burmese, Thai, Khmer, ...
Timwi
Unless the XML tag names are in English.
dan04
@Timwi, I was under the impression that Chinese and Japanese only required 2 octets under UTF-8. Thanks for the correction.
JSBangs
@JSBangs: No problem. If they required only 2 octets, which would have to be of the binary form `110xxxxx 10xxxxxx` but not `1100000x 10xxxxxx`, you could have only 2^11−2^7 = 1920 characters. While admittedly Hiragana and Katakana may have just about fit in there (alongside Cyrillic, Greek, Armenian, Arabic, Hebrew, etc.etc.), certainly the Han ideographs are too numerous for that.
Timwi
+1  A: 

There are no characters that require UTF-16 rather than UTF-8. Both UTF-8 and UTF-16 (and for that matter, UTF-32 along with some other non-recommended formats) can encode the entire UCS (that's what UTF means).

There are some streams that will be smaller in UTF-16 than in UTF-8. However, in practice such streams will largely contain Asian ideographs which are linguistically very concise. However, XML requires some characters in the 0x20-0x7F range with specific meanings, and are quite often using alphabet-based scripts for the element and attribute names.

Because of the aforementioned concision of these ideographs, the ratio of XML tags (including the element and attribute name along with the less-thans and greater-thans) to human-trageted text will be much higher than in languages that use alphabets and syllabaries. For this reason, even in cases where plain-text in UTF-16 would be appreciably smaller than the same text in UTF-8, when it comes to XML either this difference will be less, or the UTF-8 will still be smaller.

As a rule, use UTF-8 for transmission and storage.

Edit: Just noticed that you're compressing too. In which case, the balance is even less important, just use UTF-8 and be done with it.

Jon Hanna
“However, in practice such streams will largely contain Asian ideographs which are linguistically very concise.” True only of Chinese and Japanese, but not Korean, all non-Latin languages of India, Thai, Lao, Tibetan, Georgian, ...
Timwi
Indeed, to be more precise, true of some Korean and not true of all Japanese (don't know about Chinese). Unless such a case predominates the data source (in which case I would say use UTF-16 consistently) then I'd stand by what I say above.
Jon Hanna
+4  A: 

Encode everything in UTF-8. UTF-8 can handle anything UTF-16 can, and is almost surely going to be smaller in the case of an XML document. The only case in which UTF-8 would be larger than UTF-16 would be if the file was largely composed of characters beyond the BMP, and in the best case (ASCII-spec, which includes every character you can type on a standard U.S. 104-key) a UTF-8 file would be half the size of a UTF-16.

UTF-8 requires 2 bytes or less per character for all symbols at or below ordinal U07FF, and one byte for any character in the Extended ASCII codepage; that means UTF-8 will be at least equal to UTF-16 in size (and probably far smaller) for any document in a modern-day language using the Latin, Greek, Cyrillic, Hebrew or Arabic alphabets, including most of the common symbols used in algebra and the IPA. That's known as the Base Multilingual Plane, and encompasses more than 90% of all official national languages outside of Asia.

UTF-16, as a general rule, will give you a smaller file for documents written primarily in the Devanagari (Hindi), Japanese, Chinese, or Hangul (Korean) alphabets, or any ancient or "esoteric" alphabet (Cherokee or Inuit anyone?), and MAY be smaller in cases of documents that heavily use specialized mathematical, scientific, engineering or game symbols. If the XML you're working with is for localization files for India, China and Japan, you MAY get a smaller file size with UTF-16, but you will have to make your program smart enough to know the localization file is encoded that way.

KeithS
To explain my choice for giving the check, as technically all the answers appear to be similar and right. I didn't fully understand UTF as its shown in my question, and this is the first one that not only answers the question that really need to be answered but also explained to me why. So I upped everyone's question who was in the ball park for this issue and gave the check to Keith. I really appreciate all the help and everyone setting me straight. Thanks everyone!
Rodney Foley
There is a factual error in this answer though: The Basic Multilingual Plane extends from U+0000 to U+FFFF and includes all the modern daily-use East Asian scripts which require 3 bytes per character in UTF-8.
Timwi