views:

1456

answers:

5

I'm using Python (Python 2.5.2 on Ubuntu 8.10) to parse JSON from (ASCII encoded) text files. When loading these files with json (simplejson), all my string values are cast to Unicode objects instead of string objects.

The problem is, I have to use the data with some libraries that only accept string objects.

Is it possible to get string objects instead unicode ones from simplejson?
Any hints on how I can achieve this automatically?

Edit: I can't change the libraries nor update them. One - the csv module - is even in the Python standard library (the documentation says it will support Unicode in the future). I could write wrappers of course, but maybe there is a more convenient way?

The actual data I parse from the JSON files is rather nested and complex, so it would be a pain to look for every Unicode object therein and cast it manually...

Here's a small example:

>>> import simplejson as json
>>> l = ['a', 'b']
>>> l
['a', 'b']
>>> js = json.dumps(l)
>>> js
'["a", "b"]'
>>> nl = json.loads(js)
>>> nl
[u'a', u'b']

Update: I completely agree with Jarret Hardie and nosklo: Since the JSON specs specifically state strings as Unicode simplejson should return Unicode objects.

But while searching the net, I came across some post, where people complain about simplejson actually returning string objects... I couldn't reproduce this behavior but it seems it is possible. Any hints?

Workaround

Right now I use PyYAML to parse the files, it gives me string objects.
Since JSON is a subset of YAML it works nicely.

+7  A: 

That's because json has no difference between string objects and unicode objects. They're all strings in javascript.

I think JSON is right to return unicode objects. In fact, I wouldn't accept anything less, since javascript strings are in fact unicode objects (i.e. JSON (javascript) strings can store any kind of unicode character) so it makes sense to create unicode objects when translating strings from JSON. Plain strings just wouldn't fit since the library would have to guess the encoding you want.

It's better to use unicode string objects everywhere. So your best option is to update your libraries so they can deal with unicode objects.

But if you really want bytestrings, just encode the results to the encoding of your choice:

>>> nl = json.loads(js)
>>> nl
[u'a', u'b']
>>> nl = [s.encode('utf-8') for s in nl]
>>> nl
['a', 'b']
nosklo
what on earth has Java to do here?
Javier
Thanks nosklo, that is what I have done first. But as I said, the real data I used is pretty nested and all, so this introduced quiety some overhead. I'm still looking for an automatic solution... There's at least one bug report out there where people complain about simplejson returning string objects instead of unicode.
Brutus
@Javier: Sorry, I meant Javascript. Fixed the text in the answer.
nosklo
@Brutus: I think json is right to return unicode objects. In fact, I wouldn't accept anything less, since javascript strings are in fact unicode objects. What I mean is that json (javascript) strings can store any kind of unicode character, so it makes sense to create unicode objects when translating from json. You should really fix your libraries instead.
nosklo
+4  A: 

I'm afraid there's no way to achieve this automatically within the simplejson library.

The scanner and decoder in simplejson are designed to produce unicode text. To do this, the library uses a function called c_scanstring (if it's available, for speed), or py_scanstring if the C version is not available. The scanstring function is called several times by nearly every routine that simplejson has for decoding a structure that might contain text. You'd have to either monkeypatch the scanstring value in simplejson.decoder, or subclass JSONDecoder and provide pretty much your own entire implementation of anything that might contain text.

The reason that simplejson outputs unicode, however, is that the json spec specifically mentions that "A string is a collection of zero or more Unicode characters"... support for unicode is assumed as part of the format itself. Simplejson's scanstring implementation goes so far as to scan and interpret unicode escapes (even error-checking for malformed multi-byte charset representations), so the only way it can reliably return the value to you is as unicode.

If you have an aged library that needs an str, I recommend you either laboriously search the nested data structure after parsing (which I acknowledge is what you explicitly said you wanted to avoid... sorry), or perhaps wrap your libraries in some sort of facade where you can massage the input parameters at a more granular level. The second approach might be more manageable than the first if your data structures are indeed deeply nested.

Jarret Hardie
+1  A: 

This is late to the game, but I built this recursive caster. It works for my needs and I think it's relatively complete. It may help you.

def _parseJSON(self, obj):
 newobj = {}

 for key, value in obj.iteritems():
  key = str(key)

  if isinstance(value, dict):
   newobj[key] = self._parseJSON(value)
  elif isinstance(value, list):
   if key not in newobj:
    newobj[key] = []
    for i in value:
     newobj[key].append(self._parseJSON(i))
  elif isinstance(value, unicode):
   val = str(value)
   if val.isdigit():
    val = int(val)
   else:
    try:
     val = float(val)
    except ValueError:
     val = str(val)
   newobj[key] = val

 return newobj

Just pass it a JSON object like so:

obj = json.loads(content, parse_float=float, parse_int=int)
obj = _parseJSON(obj)

I have it as a private member of a class, but you can repurpose the method as you see fit.

Wells
I've run into a problem where I'm trying to parse JSON and pass the resulting mapping to a function as **kwargs. It looks like function parameter names cannot be unicode, so your _parseJSON function is great. If there's an easier way, someone can let me know.
Neal S.
This code has a problem - you make a recursive call in the List piece, which is going to fail if the elements of the list are not themselves dictionaries.
I82Much
+1  A: 

So, I've run into the same problem. Guess what was the first Google result.

Because I need to pass all data to PyGTK, unicode strings aren't very useful to me either. So I have another recursive conversion method. It's actually also needed for typesafe JSON conversion - json.dump() would bail on any non-literals, like Python objects. Doesn't convert dict indexes though.

# removes any objects, turns unicode back into str
def filter_data(obj):
        if type(obj) in (int, float, str, bool):
                return obj
        elif type(obj) == unicode:
                return str(obj)
        elif type(obj) in (list, tuple, set):
                obj = list(obj)
                for i,v in enumerate(obj):
                        obj[i] = filter_data(v)
        elif type(obj) == dict:
                for i,v in obj.iteritems():
                        obj[i] = filter_data(v)
        else:
                print "invalid object in data, converting to string"
                obj = str(obj) 
        return obj
mario
The only problem that might come up here is if you need the keys in a dictionary converted from unicode. Though this implementation will convert the values, it maintains the unicode keys. If you create a 'newobj', use newobj[str(i)] = ..., and assign obj = newobj when you're done, the keys will be converted as well.
Neal S.
+1  A: 

The gotcha is that simplejson and json are two different modules, at least in the manner they deal with unicode. You have json in py 2.6+, and this gives you unicode values, whereas simplejson returns string objects. Just try easy_install-ing simplejson in your environment and see if that works. It did for me.

ducu