tags:

views:

700

answers:

4

Hello I'm learning python and PyGTK now, and have created a simple Music Organizer. http://pastebin.com/m2b596852 But when it edits songs with the Norwegian letters æ, ø, and å it's just changing them to a weird character.

So is there any good way of opening or encode the names into utf-8 characters?

Two relevant places from the above code:

Read info from a file:

def __parse(self, filename):
    "parse ID3v1.0 tags from MP3 file"
    self.clear()
    self['artist'] = 'Unknown'
    self['title'] = 'Unknown'
    try:
        fsock = open(filename, "rb", 0)
        try:
            fsock.seek(-128, 2)
            tagdata = fsock.read(128)
        finally:
            fsock.close()
        if tagdata[:3] == 'TAG':
            for tag, (start, end, parseFunc) in self.tagDataMap.items():
                self[tag] = parseFunc(tagdata[start:end])
    except IOError:
        pass

Print to sys.stdout info:

for info in files:
    try:
        os.rename(info['name'], 
            os.path.join(self.dir, info['artist'])+' - '+info['title']+'.mp3')

        print 'From: '+ info['name'].replace(os.path.join(self.dir, ''), '')
        print 'To:   '+ info['artist'] +' - '+info['title']+'.mp3'
        print
        self.progressbar.set_fraction(i/num)
        self.progressbar.set_text('File %d of %d' % (i, num))
        i += 1
    except IOError:
        print 'Rename fail'
+1  A: 

You'd need to convert the bytestrings you read from the file into Unicode character strings. Looking at your code, I would do this in the parsing function, i.e. replace stripnulls with something like this

def stripnulls_and_decode(data):
    return codecs.utf_8_decode(data.replace("\00", "")).strip()

Note that this will only work if the strings in the file are in fact encoded in UTF-8 - if they're in a different encoding, you'd have to use the corresponding decoding function from the codecs module.

David Zaslavsky
A: 

(To long to use comment)
Hm .. Tried this, but now I get this error while running

** (gui.py:24877): CRITICAL **: murrine_style_draw_focus: assertion `height >= -1' failed
Traceback (most recent call last):
  File "gui.py", line 80, in organize
    files = listDirectory(self.dir, [".mp3"])
  File "gui.py", line 63, in listDirectory
    return [getFileInfoClass(f)(f) for f in fileList]
  File "gui.py", line 23, in __init__
    self["name"] = filename
  File "gui.py", line 52, in __setitem__
    self.__parse(item)
  File "gui.py", line 46, in __parse
    self[tag] = parseFunc(tagdata[start:end])
  File "gui.py", line 16, in stripnulls
    return codecs.utf_8_decode(data.replace("\00", ""))
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 15-16: invalid data

Have also added # -- coding: utf-8 -- to the top of the file, and tried to print æøå with success.. (Running latest Ubuntu, English language, Geany as editor)

Anyway else?

Terw
Either update your question or post a new one.
J.F. Sebastian
Answer area should be preferably used for answers.
J.F. Sebastian
A: 

I don't know what encodings are used for mp3 tags but if you are sure that it is UTF-8 then:

 tagdata[start:end].decode("utf-8")

The line # -*- coding: utf-8 -*- defines your source code encoding and doesn't define encoding used to read from or write to files.

J.F. Sebastian
+3  A: 

You want to start by decoding the input FROM the charset it is in TO utf-8 (in Python, encode means "take it from unicode/utf-8 to some other charset").

Some googling suggests the Norwegian charset is plain-ole 'iso-8859-1'... I hope someone can correct me if I'm wrong on this detail. Regardless, whatever the name of the charset in the following example:

tagdata[start:end].decode('iso-8859-1')

In a real-world app, I realize you can't guarantee that the input is norwegian, or any other charset. In this case, you will probably want to proceed through a series of likely charsets to see which you can convert successfully. Both SO and Google have some suggestions on algorithms for doing this effectively in Python. It sounds scarier than it really is.

Jarret Hardie
Editing back to the old stripnulls and using your code did the job, tnx :)
Terw
Glad it works. For my own curiosity, since I see from your profile that you are in Norway, is the Norwegian character set in fact covered by ISO-8859-1?
Jarret Hardie