



I'm trying to take a list and serialize each item and put it into a CSV file with a key to create a text file with key/value pairs. Ultimately this is going to run through Hadoop streaming so before you ask, I think it really does need to be in a text file. (but I'm open to other ideas) This all seemed seemed pretty straight forward at first. But I can't quite get serialization to work the way I want it (still).

If I do this:

> rawToChar(serialize("blah", NULL, ascii=T))
[1] "A\n2\n133888\n131840\n16\n1\n9\n4\nblah\n"

Then I have those pesky \n which screw up my CSV parsing later. I could go in and replace the \n with some other string, which I'm not opposed to doing. This seems a little messy, however.

The other option that came to mind is omitting the rawToChar() call and pumping the raw ascii into a text file:

> serialize("blah", NULL, ascii=T)
 [1] 41 0a 32 0a 31 33 33 38 38 38 0a 31 33 31 38 34 30 0a 31 36 0a 31 0a 39 0a
[26] 34 0a 62 6c 61 68 0a

Well if I just dump that to a text file I'll get \n after each element in the list. So I tried doing a little paste/collapse:

> ser <- serialize("blah", NULL, ascii=T)
> ser2 <- paste(ser, collapse="")
> ser2
[1] "410a320a3133333838380a3133313834300a31360a310a390a340a626c61680a"

Now that's a value I can write to a CSV text file! Only... how do I turn that back into raw again later? Let's just take the first hex element: 41 I can't even figure out how to create a list of raw items and shove a hex value 41 into one of the elements. When I try to shove a raw hex value into a raw list I end up with something like this:

> r <- raw(1)
> r[1] <- 41
Error in r[1] <- 41 : 
  incompatible types (from double to raw) in subassignment type fix
> r[1] <- as.raw(41)
> r[1]
[1] 29 

Crap! 29!=41 (except for really large values of 29 and really small values of 41, of course)

Any ideas on how to crack this nut?


Maybe you wanted as.raw(65) instead as 65 (in decimal) is 41 (in hex)

 > as.hexmode(65)
[1] "41"

As for the encoding, can you work with binary data within Hadoop streaming?

Dirk Eddelbuettel
Hadoop can work with binary, but the streaming mode, from what I can figure out, requires text.
JD Long
+2  A: 

The package caTools has a Base64 encoder-decoder that you can use:

> library(caTools)
> s<-base64encode(serialize("blah",NULL))
> s
> unserialize(base64decode(s,"raw"))
[1] "blah"
Jyotirmoy Bhattacharya
This looks very promising! Will test it out this morning. Thanks!
JD Long
I tested it and it works... but I am sometimes getting results that don't exactly match. Possibly a floating point issue. I will ask that specific question in another post.
JD Long
follow up question added here:
JD Long
ascii=T in serialize causes imprecise binary-to-decimal conversions. Use the binary serialization format (ascii=F); base64encode can encode raw vectors.
Jyotirmoy Bhattacharya
+1  A: 

thanks to jmoy for his great answer. I used his recommendation and it works great. For future hitchhikers who end up here, I'm leaving my functions for turning a list into a serialized CSV text files and then turning them back into lists. I'm marking this post as community wiki. Feel free to edit it if there is a cleaner way of doing any of this:

listToCsv <- function(inList, outFileName){
  if (is.list(inList) == F) 
        stop("listToCsv: The input list fails the is.list() check.")
  fileName <- outFileName
  cat("", file=fileName, append=F)

  i <- 1
  for (item in inList) {
    myLine <- paste(i, ",", base64encode(serialize(item, NULL, ascii=T)), "\n", sep="")
    cat(myLine, file=fileName, append=T) 
    i <- i+1

csvToList <- function(inFileName){
  linesIn <- readLines(fileName, n=-1)
  outList <- NULL

  i <- 1
  for (line in linesIn){
    outList[[i]] <- unserialize(base64decode(strsplit(linesIn[[i]], split=",")[[1]][[2]], "raw"))
    i <- i+1
JD Long