tags:

views:

178

answers:

2

I have experimental data expressed as dicts of key-value pairs for each experiment. A set of related experiments is serialized as a list of these dicts in JSON. This is parseable in in R via the rjson package, but the data is loaded in a form which is challenging to analyze

data <- fromJSON('[{"k1":"v1","k2":"v2"}, {"k1":"v3","k2":"v4"}]')

yields

[[1]]
[[1]]$k1
[1] "v1"

[[1]]$k2
[1] "v2"


[[2]]
[[2]]$k1
[1] "v3"

[[2]]$k2
[1] "v4"

Attempting to turn this into a data.frame directly with as.data.frame(data) yields:

  k1 k2 k1.1 k2.1
1 v1 v2   v3   v4

clearly viewing the the sequence of key/value pairs across all experiments as a flat 1-dimensional list.

What I want is a more conventional table with a row for each experiment, and a column for each unique key:

  k1 k2
1 v1 v2
2 v3 v4

How can I cleanly express this transform in R?

+2  A: 

This is interesting. The easiest way would be to fix the Python code so that the dict can be transformed more easily.

But, how about this?

k1 <- unlist(lapply(data,FUN=function(x){return(x[[1]])}))
k2 <- unlist(lapply(data,FUN=function(x){return(x[[2]])}))
data.frame(k1,k2)

You will need to cast k1 and k2 into the correct data type still, but this should accomplish what you are looking for.

Ryan Rosario
A cleaner generalization if you have a lot of columns would be:newdata <- lapply(1:length(data[[1]]), function(x) unlist(lapply(data, "[[", x)));newdata <- as.data.frame(newdata);names(newdata) <- names(data[[1]])
brentonk
I clearly can preprocess the JSON to transpose it before loading, but the problem is that I don't view this as "fixing" it at all: a list of dicts _is_ the most natural way to think about this data. A dict of lists is just the more convenient way for row-oriented software to load it densely, not the best way to think about it.And manually unpacking every entry is untenable. Bretonk's method, however, works. (I clearly need to better grok the meaning of `[[` as opposed to plain subset (`[`), among other things.)
jrk
My solution works for two columns, which is clearly what your question implied. If you have several columns, then of course you need to use a generalization, such as brentonk's method.
Ryan Rosario
You're right—I misread that (without running it) as being a *row-wise* operation, requiring an invocation on every data element, not just on every column. More explanation would have made that clearer. Still, for large numbers of columns, the further generalization is useful. Thanks, both. If you want to add an explanation of the generalization over many columns I'd gladly mark it "accepted". I think that would be useful for future viewers, rather than keeping it buried in the comments.
jrk
+4  A: 

The l*ply functions can be your best friend when doing with list processing. Try this:

> library(plyr)
> ldply(data, data.frame)
  k1 k2
1 v1 v2
2 v3 v4

plyr does some very nice processing behind the scenes to deal with things like irregular lists (e.g. when each list doesn't contain the same number of elements). This is very common with JSON and XML, and is tricky to handle with the base functions.

Or alternatively using base functions:

> do.call("rbind", lapply(data, data.frame))

You can use rbind.fill (from plyr) instead of rbind if you have irregular lists, but I'd advise just using plyr from the beginning to make your life easier.

Edit:

Regarding your more complicated example, using Hadley's suggestion deals with this easily:

> x<-list(list(k1=2,k2=3),list(k2=100,k1=200),list(k1=5, k3=9))
> ldply(x, data.frame)
   k1  k2 k3
1   2   3 NA
2 200 100 NA
3   5  NA  9
Shane
Like the plyr solution, since it can deal with the variables appearing in a different order for each observation. Call me paranoid, but I was worried about some observations not having some variables. Here is a variation that does not break even for very bad cases:x<-list(list(k1=2,k2=3),list(k2=100,k1=200),list(k1=5));ldply(x,function(z) as.data.frame(t(unlist(z)))
Jyotirmoy Bhattacharya
I think a better solution is `ldply(x, data.frame)`
hadley
I'd always choose the plyr solution :)
Ryan Rosario
Brilliant. This is exactly what I want. Thanks, all.
jrk