views:

138

answers:

2

The PyYAML package loads unmarked strings as either unicode or str objects, depending on their content.

I would like to use unicode objects throughout my program (and, unfortunately, can't switch to Python 3 just yet).

Is there an easy way to force PyYAML to always strings load unicode objects? I do not want to clutter my YAML with !!python/unicode tags.

# Encoding: UTF-8

import yaml

menu= u"""---
- spam
- eggs
- bacon
- crème brûlée
- spam
"""

print yaml.load(menu)

Output: ['spam', 'eggs', 'bacon', u'cr\xe8me br\xfbl\xe9e', 'spam']

I would like: [u'spam', u'eggs', u'bacon', u'cr\xe8me br\xfbl\xe9e', u'spam']

+1  A: 

Here's a function you could use to use to replace str with unicode types from the decoded output of PyYAML:

def make_str_unicode(obj):
    t = type(obj)

    if t in (list, tuple):
        if t == tuple:
            # Convert to a list if a tuple to 
            # allow assigning to when copying
            is_tuple = True
            obj = list(obj)
        else: 
            # Otherwise just do a quick slice copy
            obj = obj[:]
            is_tuple = False

        # Copy each item recursively
        for x in xrange(len(obj)):
            obj[x] = make_str_unicode(obj[x])

        if is_tuple: 
            # Convert back into a tuple again
            obj = tuple(obj)

    elif t == dict: 
        for k in obj:
            if type(k) == str:
                # Make dict keys unicode
                k = unicode(k)
            obj[k] = make_str_unicode(obj[k])

    elif t == str:
        # Convert strings to unicode objects
        obj = unicode(obj)
    return obj

print make_str_unicode({'blah': ['the', 'quick', u'brown', 124]})
David Morrissey
Not quite the answer I would like to see :(That function will probably work on most common YAML files, but not all. Dict keys might not be strings, and YAML allows storing custom types, which might contain strings.
Petr Viktorin
if the keys aren't `str` type, the they won't be converted to `unicode` types (if you look at the code) I agree it's not a fantastic solution, but it will work. try `make_str_unicode({0: [u'the', u'quick', u'brown', 124]})` and it'll leave the integer alone. Also, if you look at the code further, it only processes `list`, `tuple`, `dicts` and `str` (other types/classes will stay as they were)
David Morrissey
if you use custom types, then the handlers might have to convert the `str` objects to `unicode` themselves (or add a `elif isinstance(obj, mycustomtype: ...` and handle them individually)
David Morrissey
Sorry, my mistake. Thanks for the solution.
Petr Viktorin
no problems, I think I'd probably use the other solution myself though just because it's shorter/faster :-)
David Morrissey
+2  A: 

Here's a version which overrides the PyYAML handling of strings by always outputting unicode. In reality, this is probably the identical result of the other response I posted except shorter (i.e. you still need to make sure that strings in custom classes are converted to unicode or passed unicode strings yourself if you use custom handlers):

# -*- coding: utf-8 -*-
import yaml
from yaml import Loader, SafeLoader

def construct_yaml_str(self, node):
    # Override the default string handling function 
    # to always return unicode objects
    return self.construct_scalar(node)
Loader.add_constructor(u'tag:yaml.org,2002:str', construct_yaml_str)
SafeLoader.add_constructor(u'tag:yaml.org,2002:str', construct_yaml_str)

print yaml.load(u"""---
- spam
- eggs
- bacon
- crème brûlée
- spam
""")

(The above gives [u'spam', u'eggs', u'bacon', u'cr\xe8me br\xfbl\xe9e', u'spam'])

I haven't tested it on LibYAML (the c-based parser) as I couldn't compile it though, so I'll leave the other answer as it was.

David Morrissey
This is perfect, thank you!It does work with strings inside custom classes, and With LibYAML's CLoader. And it looks much cleaner :)Thanks again!
Petr Viktorin