ansaurus

Question

How do I calculate the numeric value of a string with unicode components in python?

Answer 1

+1 A:

>>> import unicodedata
>>> b = '10 ⅕'
>>> int(b[:-1]) + unicodedata.numeric(b[-1])
10.2

define convert_dubious_strings(s):
    try:
        return int(s)
    except UnicodeEncodeError:
        return int(b[:-1]) + unicodedata.numeric(b[-1])

and if it might have no integer part than another try-except sub-block needs to be added.

SilentGhost 2009-08-12 16:44:53

The problem with that solution is that it doesn't allow for multiple character numbers. What if, instead, the string was "10⅕"? That would then yield 1 + 0 + 0.2 = 1.2, where the correct answer is 10.2

Paul 2009-08-12 16:47:32

2.0 == sum(unicodedata.numeric(i) for i in u"11"), so that doesn't solve the OP's problem of calculating the "value" of a string. Very good start nonetheless.

Mark Rushakoff 2009-08-12 16:48:40

Wow... I did not even know that such a module existed.

D.Shawley 2009-08-12 16:53:35

Your solution now requires advance knowledge of the string -- it assumes that there is a unicode character at the end, and would not work in the other example case of "11".

Paul 2009-08-12 16:56:32

for goodness sake! you question explicitly states that you know that the unicode component in the string and at the end of it!

SilentGhost 2009-08-12 17:02:54

Answer 2

A:

I think you'll need a regular expression, explicitly listing the characters that you want to support. Not all numerical characters are suitable for the kind of composition that you envision - for example, what should be the numerical value of

u"4\N{CIRCLED NUMBER FORTY TWO}2\N{SUPERSCRIPT SIX}"

???

Do

for i in range(65536):
  if unicodedata.category(unichr(i)) == 'No':
      print hex(i), unicodedata.name(unichdr(i))

and go through the list defining which ones you really want to support.

Martin v. Löwis 2009-08-12 16:59:56

Answer 3

A:

This might be sufficient for you, depending on the strange edge cases you want to deal with:

val = 0
for c in my_unicode_string:
    if unicodedata.category(unichr(c)) == 'No':
        cval = unicodedata.numeric(c)
    elif c.isdigit():
        cval = int(c)
    else:
        continue
    if cval == int(cval):
        val *= 10
    val += cval
print val

Whole digits are assumed to be another digit in the number, fractional characters are assumed to be fractions to add to the number. Doesn't do the right thing with spaces between digits, repeated fractions, etc.

Ned Batchelder 2009-08-12 17:11:06

Answer 4

+2 A:

I think this is what you want...

import unicodedata
def eval_unicode(s):
    #sum all the unicode fractions
    u = sum(map(unicodedata.numeric, filter(lambda x: unicodedata.category(x)=="No",s)))
    #eval the regular digits (with optional dot) as a float, or default to 0
    n = float("".join(filter(lambda x:x.isdigit() or x==".", s)) or 0)
    return n+u

or the "comprehensive" solution, for those who prefer that style:

import unicodedata
def eval_unicode(s):
    #sum all the unicode fractions
    u = sum(unicodedata.numeric(i) for i in s if unicodedata.category(i)=="No")
    #eval the regular digits (with optional dot) as a float, or default to 0
    n = float("".join(i for i in s if i.isdigit() or i==".") or 0)
    return n+u

But beware, there are many unicode values that seem to not have a numeric value assigned in python (for example ⅜⅝ don't work... or maybe is just a matter with my keyboard xD).

Another note on the implementation: it's "too robust", it will work even will malformed numbers like "123½3 ½" and will eval it to 1234.0... but it won't work if there are more than one dots.

fortran 2009-08-12 17:14:17

ansaurus

tags:

views:

answers:

How do I calculate the numeric value of a string with unicode components in python?

related questions