ansaurus

Question

How to get string Objects instead Unicode ones from JSON in Python?

Answer 1

+7 A:

That's because json has no difference between string objects and unicode objects. They're all strings in javascript.

I think JSON is right to return unicode objects. In fact, I wouldn't accept anything less, since javascript strings are in fact unicode objects (i.e. JSON (javascript) strings can store any kind of unicode character) so it makes sense to create unicode objects when translating strings from JSON. Plain strings just wouldn't fit since the library would have to guess the encoding you want.

It's better to use unicode string objects everywhere. So your best option is to update your libraries so they can deal with unicode objects.

But if you really want bytestrings, just encode the results to the encoding of your choice:

>>> nl = json.loads(js)
>>> nl
[u'a', u'b']
>>> nl = [s.encode('utf-8') for s in nl]
>>> nl
['a', 'b']

nosklo 2009-06-05 16:44:45

what on earth has Java to do here?

Javier 2009-06-05 17:17:00

Thanks nosklo, that is what I have done first. But as I said, the real data I used is pretty nested and all, so this introduced quiety some overhead. I'm still looking for an automatic solution... There's at least one bug report out there where people complain about simplejson returning string objects instead of unicode.

Brutus 2009-06-05 17:23:29

@Javier: Sorry, I meant Javascript. Fixed the text in the answer.

nosklo 2009-06-05 18:12:28

@Brutus: I think json is right to return unicode objects. In fact, I wouldn't accept anything less, since javascript strings are in fact unicode objects. What I mean is that json (javascript) strings can store any kind of unicode character, so it makes sense to create unicode objects when translating from json. You should really fix your libraries instead.

nosklo 2009-06-05 18:27:42

Answer 2

+4 A:

I'm afraid there's no way to achieve this automatically within the simplejson library.

The scanner and decoder in simplejson are designed to produce unicode text. To do this, the library uses a function called c_scanstring (if it's available, for speed), or py_scanstring if the C version is not available. The scanstring function is called several times by nearly every routine that simplejson has for decoding a structure that might contain text. You'd have to either monkeypatch the scanstring value in simplejson.decoder, or subclass JSONDecoder and provide pretty much your own entire implementation of anything that might contain text.

The reason that simplejson outputs unicode, however, is that the json spec specifically mentions that "A string is a collection of zero or more Unicode characters"... support for unicode is assumed as part of the format itself. Simplejson's scanstring implementation goes so far as to scan and interpret unicode escapes (even error-checking for malformed multi-byte charset representations), so the only way it can reliably return the value to you is as unicode.

If you have an aged library that needs an str, I recommend you either laboriously search the nested data structure after parsing (which I acknowledge is what you explicitly said you wanted to avoid... sorry), or perhaps wrap your libraries in some sort of facade where you can massage the input parameters at a more granular level. The second approach might be more manageable than the first if your data structures are indeed deeply nested.

Jarret Hardie 2009-06-05 18:10:03

Answer 3

+1 A:

This is late to the game, but I built this recursive caster. It works for my needs and I think it's relatively complete. It may help you.

def _parseJSON(self, obj):
 newobj = {}

 for key, value in obj.iteritems():
  key = str(key)

  if isinstance(value, dict):
   newobj[key] = self._parseJSON(value)
  elif isinstance(value, list):
   if key not in newobj:
    newobj[key] = []
    for i in value:
     newobj[key].append(self._parseJSON(i))
  elif isinstance(value, unicode):
   val = str(value)
   if val.isdigit():
    val = int(val)
   else:
    try:
     val = float(val)
    except ValueError:
     val = str(val)
   newobj[key] = val

 return newobj

Just pass it a JSON object like so:

obj = json.loads(content, parse_float=float, parse_int=int)
obj = _parseJSON(obj)

I have it as a private member of a class, but you can repurpose the method as you see fit.

Wells 2009-10-29 03:53:43

I've run into a problem where I'm trying to parse JSON and pass the resulting mapping to a function as **kwargs. It looks like function parameter names cannot be unicode, so your _parseJSON function is great. If there's an easier way, someone can let me know.

Neal S. 2010-09-30 14:35:00

This code has a problem - you make a recursive call in the List piece, which is going to fail if the elements of the list are not themselves dictionaries.

I82Much 2010-10-14 14:00:21

Answer 4

+1 A:

So, I've run into the same problem. Guess what was the first Google result.

Because I need to pass all data to PyGTK, unicode strings aren't very useful to me either. So I have another recursive conversion method. It's actually also needed for typesafe JSON conversion - json.dump() would bail on any non-literals, like Python objects. Doesn't convert dict indexes though.

# removes any objects, turns unicode back into str
def filter_data(obj):
        if type(obj) in (int, float, str, bool):
                return obj
        elif type(obj) == unicode:
                return str(obj)
        elif type(obj) in (list, tuple, set):
                obj = list(obj)
                for i,v in enumerate(obj):
                        obj[i] = filter_data(v)
        elif type(obj) == dict:
                for i,v in obj.iteritems():
                        obj[i] = filter_data(v)
        else:
                print "invalid object in data, converting to string"
                obj = str(obj) 
        return obj

mario 2010-07-05 18:22:51

The only problem that might come up here is if you need the keys in a dictionary converted from unicode. Though this implementation will convert the values, it maintains the unicode keys. If you create a 'newobj', use newobj[str(i)] = ..., and assign obj = newobj when you're done, the keys will be converted as well.

Neal S. 2010-09-30 14:43:10

Answer 5

+1 A:

The gotcha is that simplejson and json are two different modules, at least in the manner they deal with unicode. You have json in py 2.6+, and this gives you unicode values, whereas simplejson returns string objects. Just try easy_install-ing simplejson in your environment and see if that works. It did for me.

ducu 2010-10-19 19:48:34

ansaurus

tags:

views:

answers:

How to get string Objects instead Unicode ones from JSON in Python?

Workaround

related questions