views:

527

answers:

3

Hello,

I need to take a string of mixed Asian characters (for now, assume only Chinese kanji or Japanese kanji/hiragana/katakana) and "Alphanumeric" (i.e., Enlgish, French), and count it in the following way:

1) count each Asian CHARACTER as 1; 2) count each Alphanumeric WORD as 1;

a few examples:

株式会社myCompany = 4 chars + 1 word = 5 total 株式会社マイコ = 7 chars


my only idea so far is to use:

var wordArray=val.split(/\w+/);

and then check each element to see if its contents are alphanumeric (so count as 1) or not (so take the array length). But I don't feel that's really very clever at all and the text being counted might be up to 10,000words, so not very quick.

Ideas?

A: 

I think you want to loop over all characters, and increase a counter every time the current character is in a different word (according to your definition) than the previous one.

Thilo
A: 

You can iterate over each character in the text, examining each one to look for word breaks. The following example does this, counting each Chinese/Japanese/Korean (CJK) ideograph as a single word, and treating all alphanumeric strings as single words.

Some notes on my implementation:

  1. It probably doesn't handle accented characters correctly. They will probably trigger word breaks. You can modify the wordBreakRegEx to fix this.

  2. cjkRegEx doesn't include some of the more esoteric code point ranges, since they require 5 hex digits to reference and JavaScript's regex engine doesn't seem to let you do that. But you probably don't need to worry about these, since I don't even think most fonts include them.

  3. I deliberately left Japanese Hiragana and Katakana out of cjkRegEx, since I'm not sure how you want to handle these. Depending on the type of text you're dealing with, it might make more sense to treat strings of them as single words. In that case, you'd need to add logic to recognize being in a "kana word" versus in a "alphanumeric word". If you don't care, then you just need to add their code point ranges to cjkRegEx. Of course, you could try to recognize word breaks within kana strings, but that quickly becomes Very Hard.

Example implementation:

function getWordCount(text) {
  // This matches all CJK ideographs.
  var cjkRegEx = /[\u3400-\u4db5\u4e00-\u9fa5\uf900-\ufa2d]/;

  // This matches all characters that "break up" words.
  var wordBreakRegEx = /\W/;

  var wordCount = 0;
  var inWord = false;
  var length = text.length;
  for (var i = 0; i < length; i++) {
    var curChar = text.charAt(i);
    if (cjkRegEx.test(curChar)) {
      // Character is a CJK ideograph.
      // Count it as a word.
      wordCount += inWord ? 2 : 1;
      inWord = false;
    } else if (wordBreakRegEx.test(curChar)) {
      // Character is a "word-breaking" character.
      // If a word was started, increment the word count.
      if (inWord) {
        wordCount += 1;
        inWord = false;
    } else {
      // All other characters are "word" characters.
      // Indicate that a word has begun.
      inWord = true;
    }
  }

  // If the text ended while in a word, make sure to count it.
  if (inWord) {
    wordCount += 1;
  }

  return wordCount;
}

The Unihan Database is very helpful for learning about CJK in unicode. Also of course the Unicode home page has loads of info.

Annabelle
+1  A: 

Unfortunately JavaScript's RegExp has no support for Unicode character classes; \w only applies to ASCII characters (modulo some browser bugs).

You can use Unicode characters in groups, though, so you can do it if you can isolate each set of characters you are interested in as a range. eg.:

var r= new RegExp(
    '[A-Za-z0-9_\]+|'+                             // ASCII letters (no accents)
    '[\u3040-\u309F]+|'+                           // Hiragana
    '[\u30A0-\u30FF]+|'+                           // Katakana
    '[\u4E00-\u9FFF\uF900-\uFAFF\u3400-\u4DBF]',   // Single CJK ideographs
'g');

var nwords= str.match(r).length;

(This attempts to give a more realistic count of ‘words’ for Japanese, counting each run of one type of kana as a word. That's still not right, of course, but it's probably closer than treating each syllable as one word.)

Obviously there are many more characters that would have to be accounted for if you wanted to ‘do it properly’. Let's hope you don't have characters outside the basic multilingual plane, for one!

bobince