views:

451

answers:

2

I am trying to learn python and couldn't figure out how to translate the following perl script to python:

#!/usr/bin/perl -w                     

use open qw(:std :utf8);

while(<>) {
  s/\x{00E4}/ae/;
  s/\x{00F6}/oe/;
  s/\x{00FC}/ue/;
  print;
}

The script just changes unicode umlauts to alternative ascii output. (So the complete output is in ascii.) I would be grateful for any hints. Thanks!

+1  A: 
  • Use the fileinput module to loop over standard input or a list of files,
  • decode the lines you read from UTF-8 to unicode objects
  • then map any unicode characters you desire with the translate method

translit.py would look like this:

#!/usr/bin/env python2.6
# -*- coding: utf-8 -*-

import fileinput

table = {
          0xe4: u'ae',
          ord(u'ö'): u'oe',
          ord(u'ü'): u'ue',
          ord(u'ß'): None,
        }

for line in fileinput.input():
    s = line.decode('utf8')
    print s.translate(table), 

And you could use it like this:

$ cat utf8.txt 
sömé täßt
sömé täßt
sömé täßt

$ ./translit.py utf8.txt 
soemé taet
soemé taet
soemé taet
hop
And to get ascii output the last line should be `print s.translate(table).encode('ascii', 'ignore')`, I guess.
Frank
strictly speaking the original .pl doesn't do that either, but yes, that would be one solution
hop
The objective appears to be de-umlauting German text, leaving it understandable. The effect of `ord(u'ß'): None` in this code is to **delete** the ß ("eszett") character. It should be `ord(u'ß'): u'ss'`. Upvotes?? Accepted answer???
John Machin
oh. come. on. i tried to show the different possibilities for the map.
hop
You chose a very bad example of how to do something that the OP didn't indicate that he wanted or needed.
John Machin
@john: if you would take the OP's question literally together with his comment above ('ignore'), it would have the _exact_ _same_ outcome, so stop nitpicking already.
hop
+1  A: 

For converting to ASCII you might want to try ASCII, Dammit or this recipe, which boils down to:

>>> title = u"Klüft skräms inför på fédéral électoral große"
>>> import unicodedata
>>> unicodedata.normalize('NFKD', title).encode('ascii','ignore')
'Kluft skrams infor pa federal electoral groe'
Ian Bicking
which does not at all what the original .pl does (mainly properly transliterating german special characters)
hop