views:

3826

answers:

5

To be more precise, I need to know whether (and if possible, how) I can find whether a given string has double byte characters or not. Basically, I need to open a pop-up to display a given text which can contain double byte characters, like Chinese or Japanese. In this case, we need to adjust the window size than it would be for English or ASCII. Anyone has a clue?

A: 

Why not let the window resize itself based on the runtime height/width?

Run something like this in your pop-up:

window.resizeTo(document.body.clientWidth, document.body.clientHeight);
Oli
Something like this should work in non-pathological cases; of course you'd need to make sure you're not exceeding the available screen space, or at least assume reasonable limits.
JasonTrue
+3  A: 

Actually, all of the characters are Unicode, at least from the Javascript engine's perspective.

Unfortunately, the mere presence of characters in a particular Unicode range won't be enough to determine you need more space. There are a number of characters which take up roughly the same amount of space as other characters which have Unicode codepoints well above the ASCII range. Typographic quotes, characters with diacritics, certain punctuation symbols, and various currency symbols are outside of the low ASCII range and are allocated in quite disparate places on the Unicode basic multilingual plane.

Generally, projects that I've worked on elect to provide extra space for all languages, or sometimes use javascript to determine whether a window with auto-scrollbar css attributes actually has content with a height which would trigger a scrollbar or not.

If detecting the presence of, or count of, CJK characters will be adequate to determine you need a bit of extra space, you could construct a regex using the following ranges: [\u3300-\u9fff\uf900-\ufaff], and use that to extract a count of the number of characters that match. (This is a little excessively coarse, and misses all the non-BMP cases, probably excludes some other relevant ranges, and most likely includes some irrelevant characters, but it's a starting point).

Again, you're only going to be able to manage a rough heuristic without something along the lines of a full text rendering engine, because what you really want is something like GDI's MeasureString (or any other text rendering engine's equivalent). It's been a while since I've done so, but I think the closest HTML/DOM equivalent is setting a width on a div and requesting the height (cut and paste reuse, so apologies if this contains errors):

o = document.getElementById("test");

document.defaultView.getComputedStyle(o,"").getPropertyValue("height"))
JasonTrue
+1  A: 

JavaScript holds text internally as UCS-16, which can encode a fairly extensive subset of Unicode.

But that's not really germane to your question. One solution might be to loop through the string and examine the character codes at each position:

function isDoubleByte(str) {
    for (var i = 0, n = str.length; i < n; i++) {
        if (str[i].charCodeAt() > 255) { return true; }
    }
    return false;
}

This might not be as fast as you would like.

pcorcoran
I don't know JavaScript, but don't you mean UTF-16? There is no such thing as UCS-16; there were UCS-x encoding forms, now obsolete, in the ISO/IEC 10646 standard that's equivalent to Unicode. UCS-2 used exactly two bytes and could thus represent the first 2^16 Unicode characters. UTF-16, on the contrary, uses 16-bit units, but not necessarily a single one of those. All Unicode characters can be represented as UTF-16 byte sequences.
Arthur Reutenauer
I believe you mean UCS-2.
Paul Biggar
I did mean UCS-2, thank you.
pcorcoran
A: 

You can use a regular expression to figure out whether a string contains non-Latin codepoints

  function containsNonLatinCodepoints(s) {
    return /[^\\u0000-\\u00ff]/.test(s);
  }
Mike Samuel
That also depends a bit on how you define "non-Latin". There are certainly several Latin based characters outside this range.
BalusC
Non-latin means any codepoints outside the subset of unicode that corresponds to 8859-1. It's a pretty widely understood term. See http://en.wikipedia.org/wiki/ISO/IEC_8859-1
Mike Samuel
+1  A: 

I used mikesamuel answer on this one. However I noticed perhaps because of this form that there should only be one escape slash before the u, e.g. \u and not \\u to make this work correctly.

function containsNonLatinCodepoints(s) {
    return /[^\u0000-\u00ff]/.test(s);
}

Works for me :)

james