views:

612

answers:

4

I am parsing a webpage which has Unicode representations of fractions. I would like to be able to take those strings directly and convert them to floats. For example:

"⅕" would become 0.2

Any suggestions of how to do this in Python?

+1  A: 

Since there are only a fixed number of fractions defined in Unicode, a dictionary seems appropriate:

Fractions = {
    u'¼': 0.25,
    u'½': 0.5,
    u'¾': 0.75,
    u'⅕': 0.2,
    # add any other fractions here
}

Update: the unicodedata module is a much better solution.

Greg Hewgill
Specifically, you're looking at characters U+00BC-E(http://www.unicode.org/charts/PDF/U0080.pdf) and U+2153-E (http://www.unicode.org/charts/PDF/U2150.pdf). Just search the index (http://www.unicode.org/Public/UNIDATA/Index.txt) for "vulgar".
Ben Blank
+15  A: 

You want to use the unicodedata module:

import unicodedata
unicodedata.numeric(u'⅕')

This will print:

0.20000000000000001

If the character does not have a numeric value, then unicodedata.numeric(unichr[, default]) will return default, or if default is not given will raise ValueError.

Karl Voigtland
Hey, that's pretty cool!
Greg Hewgill
Python should get a new slogan by borrowing from Apple: "There's a module for that".
John Fouhy
Yup batteries included.
Karl Voigtland
I didn't realize it until I just read the docs that ftp.unicode.org has a UnicodeData.txt file which is where the unicodedata module is getting all its data from.
Karl Voigtland
I had no idea that you could do that!
mhawke
Neither did I - that's truly amazing
Martin Beckett
For the morbidly curious it seems the python implementation of numeric is basically just a big lookup table, see python/trunk/Objects/unicodectype.cAlso, there are obviously a lot more unicode characters with numeric values than just the standard fractions ... check out http://www.fileformat.info/info/unicode/char/0f2e/index.htm for example!
akent
+1  A: 

Maybe you could decompose the fraction using the "unicodedata" module and then look for the FRACTION SLASH character and then it's just a matter of simple division.

For example:

>>> import unicodedata
>>> unicodedata.lookup('VULGAR FRACTION ONE QUARTER')
u'\xbc'
>>> unicodedata.decomposition(unicodedata.lookup('VULGAR FRACTION ONE QUARTER'))
'<fraction> 0031 2044 0034'

Update: I'll leave this answer here for reference but using unicodedata.numeric() as per Karl's answer is a much better idea.

akent
+1  A: 

In Python 3.1, you don't need the 'u', and it will produce 0.2 instead of 0.20000000000000001 .

>>> unicodedata.numeric('⅕')
0.2
Selinap
assert (0.2 == 0.20000000000000001) ... What you possibly meant to say is that the float produced by the unicodedata.numeric() has NOT changed, but repr() has been enhanced to produce a less frightening but still computationally equivalent answer where possible.
John Machin