ansaurus

Question

Javascript: How to find whether a particular string has unicode characters (esp. Double Byte characters)

Answer 1

A:

Why not let the window resize itself based on the runtime height/width?

Run something like this in your pop-up:

window.resizeTo(document.body.clientWidth, document.body.clientHeight);

Oli 2008-09-29 07:53:59

Something like this should work in non-pathological cases; of course you'd need to make sure you're not exceeding the available screen space, or at least assume reasonable limits.

JasonTrue 2008-09-29 08:12:08

Answer 2

+3 A:

Actually, all of the characters are Unicode, at least from the Javascript engine's perspective.

Unfortunately, the mere presence of characters in a particular Unicode range won't be enough to determine you need more space. There are a number of characters which take up roughly the same amount of space as other characters which have Unicode codepoints well above the ASCII range. Typographic quotes, characters with diacritics, certain punctuation symbols, and various currency symbols are outside of the low ASCII range and are allocated in quite disparate places on the Unicode basic multilingual plane.

Generally, projects that I've worked on elect to provide extra space for all languages, or sometimes use javascript to determine whether a window with auto-scrollbar css attributes actually has content with a height which would trigger a scrollbar or not.

If detecting the presence of, or count of, CJK characters will be adequate to determine you need a bit of extra space, you could construct a regex using the following ranges: [\u3300-\u9fff\uf900-\ufaff], and use that to extract a count of the number of characters that match. (This is a little excessively coarse, and misses all the non-BMP cases, probably excludes some other relevant ranges, and most likely includes some irrelevant characters, but it's a starting point).

Again, you're only going to be able to manage a rough heuristic without something along the lines of a full text rendering engine, because what you really want is something like GDI's MeasureString (or any other text rendering engine's equivalent). It's been a while since I've done so, but I think the closest HTML/DOM equivalent is setting a width on a div and requesting the height (cut and paste reuse, so apologies if this contains errors):

o = document.getElementById("test");

document.defaultView.getComputedStyle(o,"").getPropertyValue("height"))

JasonTrue 2008-09-29 08:10:28

Answer 3

+1 A:

JavaScript holds text internally as UCS-16, which can encode a fairly extensive subset of Unicode.

But that's not really germane to your question. One solution might be to loop through the string and examine the character codes at each position:

function isDoubleByte(str) {
    for (var i = 0, n = str.length; i < n; i++) {
        if (str[i].charCodeAt() > 255) { return true; }
    }
    return false;
}

This might not be as fast as you would like.

pcorcoran 2008-09-29 13:18:00

I don't know JavaScript, but don't you mean UTF-16? There is no such thing as UCS-16; there were UCS-x encoding forms, now obsolete, in the ISO/IEC 10646 standard that's equivalent to Unicode. UCS-2 used exactly two bytes and could thus represent the first 2^16 Unicode characters. UTF-16, on the contrary, uses 16-bit units, but not necessarily a single one of those. All Unicode characters can be represented as UTF-16 byte sequences.

Arthur Reutenauer 2009-11-08 20:21:21

I believe you mean UCS-2.

Paul Biggar 2010-02-08 04:06:13

I did mean UCS-2, thank you.

pcorcoran 2010-02-26 00:42:45

Answer 4

A:

You can use a regular expression to figure out whether a string contains non-Latin codepoints

  function containsNonLatinCodepoints(s) {
    return /[^\\u0000-\\u00ff]/.test(s);
  }

Mike Samuel 2008-09-29 17:33:47

That also depends a bit on how you define "non-Latin". There are certainly several Latin based characters outside this range.

BalusC 2009-11-08 20:13:10

Non-latin means any codepoints outside the subset of unicode that corresponds to 8859-1. It's a pretty widely understood term. See http://en.wikipedia.org/wiki/ISO/IEC_8859-1

Mike Samuel 2009-11-10 02:11:24

Answer 5

+1 A:

I used mikesamuel answer on this one. However I noticed perhaps because of this form that there should only be one escape slash before the u, e.g. \u and not \\u to make this work correctly.

function containsNonLatinCodepoints(s) {
    return /[^\u0000-\u00ff]/.test(s);
}

Works for me :)

james 2009-11-08 20:06:34

ansaurus

tags:

views:

answers:

Javascript: How to find whether a particular string has unicode characters (esp. Double Byte characters)

related questions