views:

316

answers:

4

Along the lines of my previous question, http://stackoverflow.com/questions/1263796/how-do-i-convert-unicode-characters-to-floats-in-python , I would like to find a more elegant solution to calculating the value of a string that contains unicode numeric values.

For example, take the strings "1⅕" and "1 ⅕". I would like these to resolve to 1.2

I know that I can iterate through the string by character, check for unicodedata.category(x) == "No" on each character, and convert the unicode characters by unicodedata.numeric(x). I would then have to split the string and sum the values. However, this seems rather hacky and unstable. Is there a more elegant solution for this in Python?

+1  A: 
>>> import unicodedata
>>> b = '10 ⅕'
>>> int(b[:-1]) + unicodedata.numeric(b[-1])
10.2

define convert_dubious_strings(s):
    try:
        return int(s)
    except UnicodeEncodeError:
        return int(b[:-1]) + unicodedata.numeric(b[-1])

and if it might have no integer part than another try-except sub-block needs to be added.

SilentGhost
The problem with that solution is that it doesn't allow for multiple character numbers. What if, instead, the string was "10⅕"? That would then yield 1 + 0 + 0.2 = 1.2, where the correct answer is 10.2
Paul
2.0 == sum(unicodedata.numeric(i) for i in u"11"), so that doesn't solve the OP's problem of calculating the "value" of a string. Very good start nonetheless.
Mark Rushakoff
Wow... I did not even know that such a module existed.
D.Shawley
Your solution now requires advance knowledge of the string -- it assumes that there is a unicode character at the end, and would not work in the other example case of "11".
Paul
for goodness sake! you question explicitly states that you know that the unicode component in the string and at the end of it!
SilentGhost
A: 

I think you'll need a regular expression, explicitly listing the characters that you want to support. Not all numerical characters are suitable for the kind of composition that you envision - for example, what should be the numerical value of

u"4\N{CIRCLED NUMBER FORTY TWO}2\N{SUPERSCRIPT SIX}"

???

Do

for i in range(65536):
  if unicodedata.category(unichr(i)) == 'No':
      print hex(i), unicodedata.name(unichdr(i))

and go through the list defining which ones you really want to support.

Martin v. Löwis
A: 

This might be sufficient for you, depending on the strange edge cases you want to deal with:

val = 0
for c in my_unicode_string:
    if unicodedata.category(unichr(c)) == 'No':
        cval = unicodedata.numeric(c)
    elif c.isdigit():
        cval = int(c)
    else:
        continue
    if cval == int(cval):
        val *= 10
    val += cval
print val

Whole digits are assumed to be another digit in the number, fractional characters are assumed to be fractions to add to the number. Doesn't do the right thing with spaces between digits, repeated fractions, etc.

Ned Batchelder
+2  A: 

I think this is what you want...

import unicodedata
def eval_unicode(s):
    #sum all the unicode fractions
    u = sum(map(unicodedata.numeric, filter(lambda x: unicodedata.category(x)=="No",s)))
    #eval the regular digits (with optional dot) as a float, or default to 0
    n = float("".join(filter(lambda x:x.isdigit() or x==".", s)) or 0)
    return n+u

or the "comprehensive" solution, for those who prefer that style:

import unicodedata
def eval_unicode(s):
    #sum all the unicode fractions
    u = sum(unicodedata.numeric(i) for i in s if unicodedata.category(i)=="No")
    #eval the regular digits (with optional dot) as a float, or default to 0
    n = float("".join(i for i in s if i.isdigit() or i==".") or 0)
    return n+u

But beware, there are many unicode values that seem to not have a numeric value assigned in python (for example ⅜⅝ don't work... or maybe is just a matter with my keyboard xD).

Another note on the implementation: it's "too robust", it will work even will malformed numbers like "123½3 ½" and will eval it to 1234.0... but it won't work if there are more than one dots.

fortran