views:

222

answers:

3

I'm using a python library called Guess Language: http://pypi.python.org/pypi/guess-language/0.1

"justwords" is a string with unicode text. I stick it in the package, but it always returns English, even though the web page is in Japanese. Does anyone know why? Am I not encoding correctly?

§ç©ºéå
¶ä»æ¡å°±æ²æéç¨®å¾                                é¤ï¼æ以ä¾é裡ç¶ç
éäºï¼åæ­¤ç°å¢æ°£æ°¹³åèµ·ä¾åªè½ç®âå¾å¥½âé常好âåå                 ¶æ¯è¦é»é¤ï¼é¨ä¾¿é»çé»ã飲æãä¸ææ²»ç­åä¸å                                     便å®ï¼æ¯æ´è¥ç   äºï¼æ³æ³é裡以å°é»ãæ¯è§ä¾èªªä¹è©²æpremiumï¼åªæ±é¤é»å¥½å就好äºã<br /><br />é¦åç¾ï¼æ以就é»åå®æ´ç         æ­£è¦åä¸ä¸å
ä¸ç                           å¥é¤å§ï¼å



justwords = justwords.encode('utf-8')
true_lang =  str(guess_language.guessLanguage(justwords))
print true_lang

Edit: THanks guys for your help. This is an update of the problem.

I am trying to "guess" the language of this: http://feeds.feedburner.com/nchild

Basically, in Python, I get the htmlSource. Then, I strip the tags using BeautifulSoup. Then, I pass it to the library to get the language. If I do not do encode('utf-8'), then ASCII-errors will come up. So , this is a must.

soup = BeautifulStoneSoup(htmlSource)
justwords = ''.join(soup.findAll(text=True))
justwords = justwords.encode('utf-8')
true_lang =  str(guess_language.guessLanguage(justwords))
A: 

It looks like you should be able to pass your unicode as-is. guessLanguage decodes an input that is str as utf-8. So your .encode('utf-8') is safe but unnecessary.

I skimmed the source code and assumed it relies exclusively on the data in its "trigrams" directory for language detection, and it would not handle Japanese because there is no "ja" subdirectory in there. That is not correct, as pointed out by John Machin. So I have to assume your input is not what you think it is (which is hard to debug since it's not showing up correctly in your question).

mzz
+1 for pointing out that `encode` is unnecessary. -1 for the "needs some trigrams added for Japanese" remark.
John Machin
+5  A: 

Looking at the main page, it says """Detects over 60 languages; Greek (el), Korean (ko), Japanese (ja), Chinese (zh) and all the languages listed in the trigrams directory. """

It doesn't use trigrams for those 4 languages; it relies on what script blocks are present in the input text. Looking at the source code:

if "Katakana" in scripts or "Hiragana" in scripts or "Katakana Phonetic Extensions" in scripts:
    return "ja"

if "CJK Unified Ideographs" in scripts or "Bopomofo" in scripts \
        or "Bopomofo Extended" in scripts or "KangXi Radicals" in scripts:
    return "zh"

For a script name like Katakana or Hiragana to appear in scripts, such characters must comprise 40% or more of the input text (after normalisation which removes non-alphabetic characters etc). It may be possible that some Japanese text needs a threshold of less than 40%. HOWEVER if that was the problem with your text, I would expect it to have more than 40% kanji (CJK Unified Ideographs) and thus should return "zh" (Chinese).

Update after some experimentation, including inserting a print statement to show what script blocks were detected with what percentages:

A presumably typical news item from the Asahi newspaper website:

 49.3 Hiragana
  8.7 Katakana
 42.0 CJK Unified Ideographs
result ja

A presumably atypical ditto:

 35.9 Hiragana
 49.2 CJK Unified Ideographs
 13.3 Katakana
  1.6 Halfwidth and Fullwidth Forms
result zh

(Looks like it might be a good idea to base the test on the total (Hiragana + Katakana) content)

Result of shoving the raw front page (XML, HTML, everything) through the machinery:

  2.4 Hiragana
  6.1 CJK Unified Ideographs
  0.1 Halfwidth and Fullwidth Forms
  3.7 Katakana
 87.7 Basic Latin
result ca

The high percentage of Basic Latin is of course due to the markup. I haven't investigated what made it choose "ca" (Catalan) over any other language which uses Basic Latin, including English. However the gobbledegook that you printed doesn't show any sign of including markup.

End of update

Update 2

Here's an example (2 headlines and next 4 paragraphs from this link) where about 83% of the characters are East Asian and the rest are Basic Latin but the result is en (English).

 29.6 Hiragana
 18.5 Katakana
 34.9 CJK Unified Ideographs
 16.9 Basic Latin
result en

The Basic Latin Characters are caused by the use of the English names of organisations etc in the text. The Japanese rule fails because neither Katakana nor Hiragana score 40% (together they score 48.1%). The Chinese rule fails because CJK Unified Ideographs scores less than 40%. So the 83.1% East Asian characters are ignored, and the result is decided by the 16.9% minority. These "rotten borough" rules need some reform. In generality, it could be expressed like:

If (total of script blocks used by only language X) >= X-specific threshold, then select language X.

As suggested above, Hiragana + Katakana >= 40% will probably do the trick for Japanese. A similar rule may well be needed for Korean.

Your gobbledegook did actually contain a few characters of markup (I didn't scroll far enough to the right to see it) but certainly not enough to depress all the East Asian scores below 40%. So we're still waiting to see what your actual input is and how you got it from where.

End of update2

To aid with diagnosis of your problem, please don't print gobbledegook; use

print repr(justwords)

That way anyone who is interested in actually doing debugging has got something to work on. It would help if you gave the URL of the webpage, and showed the Python code that you used to get your unicode justwords. Please edit your answer to show those 3 pieces of information.

Update 3 Thanks for the URL. Visual inspection indicates that the language is overwhelmingly Chinese. What gave you the impression that it is Japanese?

Semithanks for supplying some of your code. To avoid your correspondents having to do your work for you, and to avoid misunderstandings due to guessing, you should always supply (without being asked) a self-contained script that will reproduce your problem. Note that you say you got "ASCII errors" (no exact error message! no traceback!) if you didn't do .encode('utf8') -- my code (see below) doesn't have this problem.

No thanks for not supplying the result of print repr(justwords) (even after being asked). Inspecting what intermediate data has been created is a very elementary and very effective debugging technique. This is something you should always do before asking a question. Armed with this knowledge you can ask a better question.

Using this code:

# coding: ascii
import sys
sys.path.append(r"C:\junk\wotlang\guess-language\guess_language")
import guess_language
URL = "http://feeds.feedburner.com/nchild"
from BeautifulSoup import BeautifulStoneSoup
from pprint import pprint as pp
import urllib2
htmlSource = urllib2.urlopen(URL).read()
soup = BeautifulStoneSoup(htmlSource)
fall = soup.findAll(text=True)
# pp(fall)
justwords = ''.join(fall)
# justwords = justwords.encode('utf-8')
result = guess_language.guessLanguage(justwords)
print "result", result

I got these results:

 29.0 CJK Unified Ideographs
  0.0 Extended Latin
  0.1 Katakana
 70.9 Basic Latin
result en

Note that the URL content is not static; about an hour later I got:

 27.9 CJK Unified Ideographs
  0.0 Extended Latin
  0.1 Katakana
 72.0 Basic Latin

The statistics were obtained by fiddling around line 361 of guess_language.py so that it reads:

for key, value in run_types.items():
    pct = (value*100.0) / totalCount # line changed so that pct is a float
    print "%5.1f %s" % (pct, key) # line inserted
    if pct >=40:
        relevant_runs.append(key)

The statistics are symptomatic of Chinese with lots of HTML/XML/Javascript stuff (see previous example); this is confirmed by looking at the output of the pretty-print obtained by un-commenting pp(fall) -- lots of stuff like:

<img style="float:left; margin:0 10px 0px 10px;cursor:pointer; cursor:hand
;" width="60px" src="http://2.bp.blogspot.com/_LBJ4udkQZag/Rm6sTn1b7NI/AAAAAAAAA
FA/bYkSJZ3i2bg/s400/hepinge169.gif" border="0" alt=""id="BLOGGER_PHOTO_ID_507518
3283203730642" alt="\u548c\u5e73\u6771\u8def\u4e00\u6bb5169\u865f" title="\u548c
\u5e73\u6771\u8def\u4e00\u6bb5169\u865f"/>\u4eca\u5929\u4e2d\u5348\u8d70\u523
0\u516c\u53f8\u5c0d\u9762\u76847-11\u8cb7\u98f2\u6599\uff0c\u7a81\u7136\u770b\u5
230\u9019\u500b7-11\u602a\u7269\uff01\u770b\u8d77\u4f86\u6bd4\u6a19\u6e96\u62db\
u724c\u6709\u4f5c\u7528\u7684\u53ea\u6709\u4e2d\u9593\u7684\u6307\u793a\u71c8\u8
00c\u5df2\uff0c\u53ef\u537b\u6709\u8d85\u7d1a\u5927\u7684footprint\uff01<br /
><br /><a href="http://4.bp.blogspot.com/_LBJ4udkQZag/Rm6wHH1b7QI/AA

You need to do something about the markup. Steps: Look at your raw "htmlSource" in an XML browser. Is the XML non-compliant? How can you avoid having untranslated < etc? What elements have text content that is "English" only by virtue of it being a URL or similar? Is there a problem in Beautiful[Stone]Soup? Should you be using some other functionality of Beautiful[Stone]Soup? Should you use lxml instead?

I'd suggest some research followed by a new SO question.

end of update 3

John Machin
Thanks for your help. I've updated my post to show the URL that I want determined.
TIMEX
Hi John Machin--how were you able to get the scores and percentages? Could you paste your code? Thanks.
TIMEX
+1 for the effort
egarcia
A: 

Google says your example is in chinese. They have a (much more advanced) webservice to translate text and guess the language.

They have an API and code examples for Python.

THC4k
I can't use Google's API because I will constantly be hitting it.
TIMEX