views:

183

answers:

3

When creating a UUID in Python, likeso:

>>> uuid.uuid1()
UUID('a8098c1a-f86e-11da-bd1a-00112444be1e')

How could one map that UUID into a string made up of the capitalized alphabet A-Z minus the characters D, F, I, O, Q, and U, plus the numerical digits, plus the characters "+" and "=". i.e. the from an integer or string onto the set of 32 (relatively OCR friendly) characters:

[ABCEGHJKLMNPRSTVWXYZ1234567890+=]

I'll call this the OCRf set (for OCR friendly).

I'd like to have an isomorphic function:

def uuid_to_ocr_friendly_chars(uid)
    """takes uid, an integer, and transposes it into a string made 
       of the the OCRf set
    """
    ...

My first thought is to go through the process of changing the uuid to base 32. e.g.

OCRf = "ABCEGHJKLMNPRSTVWXYZ1234567890+="

def uuid_to_ocr_friendly_chars(uid):
     ocfstr = ''
     while uid > 1:
        ocfstr += OCRf[uid % 32]
        uid /= 32
     return ocfstr

However, I'd like to know if this method is the best and fastest way to go about this conversion - or if there's a simpler and faster method (e.g. a builtin, a smarter algorithm, or just a better method).

I'm grateful for your input. Thank you.

+2  A: 

How important is it to you to "squeeze" the representation by 18.75%, i.e., from 32 to 26 characters? Because, if saving this small percentage of bytes isn't absolutely crucial, something like uid.hex.upper().replace('D','Z') will do what you ask (not using the whole alphabet you make available, but the only cost of this is missing that 18.75% "squeezing").

If squeezing down every last byte is crucial, I'd work on substrings of 20 bits each -- that's 5 hex characters, 4 characters in your funky alphabet. There are 6 of those (plus 8 bits left over, for which you can take the hex.upper().replace as above since there's nothing to gain in doing anything fancier). You can easily get the substrings by slicing .hex and turn each into an int with an int(theslice, 16). Then, you can basically apply the same algorithm you're using above -- but the arithmetic is all done on much-smaller numbers, so the speed gain should be material. Also, don't build the string by looping on += -- make a list of all the "digits", and ''.join them all at the end -- that's also a performance improvement.

Alex Martelli
Agree re. space queezing- good point - though there's an (astronomically remote) possibility of collisions with a .replace('O','D')/etc. The more important point would be to have a reduced, albeit "funky", alphabet that uses fewer visually ambiguous characters (e.g. "D","O","Q", and "0").
Brian M. Hunt
@Brian, I don't see what "collisions" could occur if you just use `uid.hex.upper().replace('D', 'Z')`. 'D' is the only character in the hex set potentially confusable with another ('0', the digit zero)
Alex Martelli
@Alex: Oh sorry -- I was thinking the algorithm suggested in the second paragraph would apply `replace('D','Z')` to the 20 bit substrings.
Brian M. Hunt
+1  A: 
>>> OCRf = 'ABCEGHJKLMNPRSTVWXYZ1234567890+='
>>> uuid = 'a8098c1a-f86e-11da-bd1a-00112444be1e'
>>> binstr = bin(int(uuid.replace("-",""),16))[2:].zfill(130)
>>> ocfstr = "".join(OCRf[int(binstr[i:i+5],2)] for i in range(0,130,5))
>>> ocfstr
'HLBJJB2+ETCKSP7JWACGYGMVW+'

To convert back again

>>> "%x"%(int("".join(bin(OCRf.index(i))[2:].zfill(5) for i in ocfstr),2))
'a8098c1af86e11dabd1a00112444be1e'
gnibbler
There's no need for the fanciness with binstr - you can just fetch the .bytes property on a UUID to get its binary representation.
Nick Johnson
@Nick Johnson, Can you explain what you mean? I don't see how I can regroup the `.bytes` as base 32
gnibbler
Just encode it using base32, or any of the other encoding schemes suggested here. My point is that if you have a real UUID object, the third line of your snippet can be replaced with just "uuid.bytes".
Nick Johnson
+1  A: 
transtbl = string.maketrans(
  'ABCDEFGHIJKLMNOPQRSTUVWXYZ234567',
  'ABCEGHJKLMNPRSTVWXYZ1234567890+='
)

uuidstr = uuid.uuid1()

print base64.b32encode(str(uuidstr).replace('-', '').decode('hex')).rstrip('=').translate(transtbl)

Yes, this method does make me a bit ill, thanks for asking.

Ignacio Vazquez-Abrams