ansaurus

Question

Best way to store variable-length data in an R data.frame?

Answer 1

+1 A:

Since the R data frame structure is based loosely on the SQL table, having each element of the data frame be anything other than an atomic data type is uncommon. However, it can be done, as you've shown, and this linked post describes such an application implemented on a larger scale.

An alternative is to store your data as a string and have a function to retrieve it, or create a separate function to which the data is attached and extract it using indices stored in your data frame.

> ## alternative 1
> tokens <- function(x,i=TRUE) Map(as.numeric,strsplit(x[i],","))
> d <- data.frame(id=c(1,2,3), token_lengths=c("5,5", "9", "4,2,2,4,6"))
> 
> tokens(d$token_lengths)
[[1]]
[1] 5 5

[[2]]
[1] 9

[[3]]
[1] 4 2 2 4 6

> tokens(d$token_lengths,2:3)
[[1]]
[1] 9

[[2]]
[1] 4 2 2 4 6

> 
> ## alternative 2
> retrieve <- local({
+   token_lengths <- list(c(5,5), 9, c(4,2,2,4,6))
+   function(i) token_lengths[i]
+ })
> 
> d <- data.frame(id=c(1,2,3), token_lengths=1:3)
> retrieve(d$token_lengths[2:3])
[[1]]
[1] 9

[[2]]
[1] 4 2 2 4 6

Stephen 2010-02-24 02:38:08

I thought of the pack-as-string solution, but then it complicated working with the variable length data. For now, I'm going with the column-of-arrays solution, and using `mapply()` liberally. For example, if I want the mean token length per phrase that's just `mapply(mean, d$token_lengths)`. If I want the max of all token lengths it's `max(mapply(max, d$token_lengths))`.

Nick 2010-02-24 17:43:47

Answer 2

+2 A:

Trying to shoehorn the data into a data frame seems hackish to me. Far better to consider each row as an individual object, then think of the dataset as an array of these objects.

This function converts your data strings to an appropriate format. (This is S3 style code; you may prefer to use one of the 'proper' object oriented systems.)

as.mydata <- function(x)
{
   UseMethod("as.mydata")
}

as.mydata.character <- function(x)
{
   convert <- function(x)
   {
      md <- list()
      md$phrase = x
      spl <- strsplit(x, " ")[[1]]
      md$num_words <- length(spl)
      md$token_lengths <- nchar(spl)
      class(md) <- "mydata"
      md
   }
   lapply(x, convert)
}

Now your whole dataset looks like

mydataset <- as.mydata(c("hello world", "greetings", "take me to your leader"))

mydataset
[[1]]
$phrase
[1] "hello world"

$num_words
[1] 2

$token_lengths
[1] 5 5

attr(,"class")
[1] "mydata"

[[2]]
$phrase
[1] "greetings"

$num_words
[1] 1

$token_lengths
[1] 9

attr(,"class")
[1] "mydata"

[[3]]
$phrase
[1] "take me to your leader"

$num_words
[1] 5

$token_lengths
[1] 4 2 2 4 6

attr(,"class")
[1] "mydata"

You can define a print method to make this look prettier.

print.mydata <- function(x)
{
   cat(x$phrase, "consists of", x$num_words, "words, with", paste(x$token_lengths, collapse=", "), "letters.")
}
mydataset
[[1]]
hello world consists of 2 words, with 5, 5 letters.
[[2]]
greetings consists of 1 words, with 9 letters.
[[3]]
take me to your leader consists of 5 words, with 4, 2, 2, 4, 6 letters.

The sample operations you wanted to do are fairly straightforward with data in this format.

sapply(mydataset, function(x) nchar(x$phrase) > 10)
[1]  TRUE FALSE  TRUE

Richie Cotton 2010-02-24 12:01:59

I was going to suggest this list-based solution too. It's certainly what you'd do in something other than R. But there's a way in which *all* R programming is "hackish", in a good way, and the (over)use of data.frames is one of those ways. Arguably, a long-format data.frame may be the most programmer-efficient option, even if it's a little silly from a data structures perspective.

Harlan 2010-02-24 16:43:06

So what's the efficient way to calculate the mean number of tokens? In my original example, it's simply `mean(mydata$num_tokens)`. For the list based solution, you'd have to do something like `mean(sapply(mydataset, function(x) x$num_tokens))`. With helper functions, that could be prettier of course.

Nick 2010-02-24 17:33:02

@Nick: Yes, the syntax is a little clunkier this way. I'd put the sapply statement in a function like `get_num_tokens <- function(x) sapply(x, function(x) x$num_tokens)`. Then you use `mean(get_num_tokens(mydataset))`.

Richie Cotton 2010-02-25 16:06:34

Answer 3

+3 A:

I would just use the data in the "long" format.

E.g.

> d1 <- data.frame(id=1:3, num_words=c(2,1,4), phrase=c("hello world", "greetings", "take me to your leader"))
> d2 <- data.frame(id=c(rep(1,2), rep(2,1), rep(3,5)), token_length=c(5,5,9,4,2,2,4,6))
> d2$tokenid <- with(d2, ave(token_length, id, FUN=seq_along))
> d <- merge(d1,d2)
> subset(d, nchar(phrase) > 10)
  id num_words                 phrase token_length tokenid
1  1         2            hello world            5       1
2  1         2            hello world            5       2
4  3         4 take me to your leader            4       1
5  3         4 take me to your leader            2       2
6  3         4 take me to your leader            2       3
7  3         4 take me to your leader            4       4
8  3         4 take me to your leader            6       5
> with(d, tapply(token_length, id, mean))
  1   2   3 
5.0 9.0 3.6

Once the data is in the long format, you can use sqldf or plyr to extract what you want from it.

Eduardo Leoni 2010-02-24 14:51:31

I actually have my data in this long format already and I'm trying to shorten it because I'm finding working with it awkward. For instance, to compute the mean number of tokens I have to write something like: `mean(unique(d[c('id, 'num_tokens')])$num_tokens)`. If the data is not long, I can just write `mean(d$num_tokens)` which is much more readable. The primary item of interest here is the phrase and it just happens to have variable-length data associated with it; expanding out that data makes it awkward.

Nick 2010-02-24 17:40:34

You could shorten it by mean(subset(d, tokenid==1, num_tokens)), but I get your point. If you want to stick to a dataframe I think you can. Just think about it: data frames are lists of vectors of the same length. You can make the vector of tokens be a vector of lists: df <- data.frame(a=1:3) ; df$b <- list(1:3,1:2,1:3) . R doesn't like this, though. (It complains if you create the dataframe in one step.) Not sure why.

Eduardo Leoni 2010-02-24 18:12:06

The tokenid subset is at least a little more attractive. :)

Nick 2010-02-24 19:59:18

Answer 4

+4 A:

Another option would be to convert your data frame into a matrix of mode list - each element of the matrix would be a list. standard array operations (slicing with [, apply(), etc. would be applicable).

> d <- data.frame(id=c(1,2,3), num_tokens=c(2,1,4), token_lengths=as.array(list(c(5,5), 9, c(4,2,2,4,6))))
> m <- as.matrix(d)
> mode(m)
[1] "list"
> m[,"token_lengths"]
[[1]]
[1] 5 5

[[2]]
[1] 9

[[3]]
[1] 4 2 2 4 6

> m[3,]
$id
[1] 3

$num_tokens
[1] 4

$token_lengths
[1] 4 2 2 4 6

Stephen 2010-02-24 18:08:13

Answer 5

A:

I would also use strings for the variable length data, but as in the following example: "c(5,5)" for the first phrase. One needs to use eval(parse(text=...)) to carry out computations.

For example, the mean can be computed as follows:

sapply(data$token_lengths,function(str) mean(eval(parse(text=str))))

teucer 2010-03-04 14:26:11

ansaurus

tags:

views:

answers:

Best way to store variable-length data in an R data.frame?

related questions