views:

168

answers:

3

in my previous question about using serialize() to create a CSV of objects I got a great answer from jmoy where he recommended base64 encoding of my serialized text. That was exactly what I was looking for. Oddly enough, when I try to put this in practice I get results that look right but don't exactly match what I ran through the serialize/encoding process.

The example below takes a list with 3 vectors and serializes each vector. Then each vector is base64 encoded and written to a text file along with a key. The key is simply the index number of the vector. I then reverse the process and read each line back from the csv. At the very end you can see some items don't exactly match. Is this a floating point issue? Something else?

require(caTools)

randList <- NULL
set.seed(2)

randList[[1]] <- rnorm(100)
randList[[2]] <- rnorm(200)
randList[[3]] <- rnorm(300)

#delete file contents
fileName <- "/tmp/tmp.txt"
cat("", file=fileName, append=F)

i <- 1
for (item in randList) {
  myLine <- paste(i, ",", base64encode(serialize(item, NULL, ascii=T)), "\n", sep="")
  cat(myLine, file=fileName, append=T) 
  i <- i+1
}

linesIn <- readLines(fileName, n=-1)

parsedThing <- NULL
i <- 1
for (line in linesIn){
  parsedThing[[i]] <- unserialize(base64decode(strsplit(linesIn[[i]], split=",")[[1]][[2]], "raw"))
  i <- i+1
  }

#floating point issue?
identical(randList, parsedThing)

for (i in 1:length(randList[[1]])) {
  print(randList[[1]][[i]] == parsedThing[[1]][[i]])
}

i<-3
randList[[1]][[i]] == parsedThing[[1]][[i]]

randList[[1]][[i]]
parsedThing[[1]][[i]]

Here's the abridged output:

> #floating point issue?
> identical(randList, parsedThing)
[1] FALSE
> 
> for (i in 1:length(randList[[1]])) {
+   print(randList[[1]][[i]] == parsedThing[[1]][[i]])
+ }
[1] TRUE
[1] TRUE
[1] FALSE
[1] FALSE
[1] TRUE
[1] FALSE
[1] TRUE
[1] TRUE
[1] FALSE
[1] FALSE
...
> 
> i<-3
> randList[[1]][[i]] == parsedThing[[1]][[i]]
[1] FALSE
> 
> randList[[1]][[i]]
[1] 1.587845
> parsedThing[[1]][[i]]
[1] 1.587845
> 
+2  A: 

JD: I ran your code snippet on my Linux box, then looked at the differences computed by randList[[1]][[i]] - parsedThing[[1]][[i]].

Yes, the values are different, but only at the level my machine's floating-point tolerance. A typical difference was -4.440892e-16 -- which is pretty tiny. Some differences were zero.

It does not surprise me that the save/restore introduced that (tiny) level of change. Any significant data conversion runs the risk of "bobbling" the least significant digit.

pteetor
Welcome to SO, Paul. Glad to have you on board.
Shane
That's exactly what made me think this was a floating point rounding type of error. I was a little surprised at the introduction of noise, albeit very small noise. It was just outside of my experience so I thought I'd ask as to the cause. I'm rolling forward in my code assuming this is "close enough." Glad you're in SO!
JD Long
Glad to be here!
pteetor
+2  A: 

Ok, now that you show the output I can explain to you what you're doing (following Paul's lead here).

As that is a known issue (see e.g. this R FAQ entry), you should buckle up and use any one of

  • identical()
  • all.equal()
  • functions from the RUnit package such as checkEquals

In sum, there seems nothing wrong with the base64 encoding you are using. You simply employed the wrong definition of exactly. But hey, we're economists, and anything below a trillion or two is rounding error anyway...

Dirk Eddelbuettel
I worked for an accounting firm for a period in my life so there's something that tickles inside my ear canal when I can't reproduce certain things *exactly*. Thank you for giving me a sanity check.
JD Long
Please see the (classic) [What Every Computer Scientist Should Know About Floating-Point Arithmetic](http://docs.sun.com/source/806-3568/ncg_goldberg.html). You and I ain't computer scientists, but we occassionally play one on telly.
Dirk Eddelbuettel
I've actually read big chunks of that before. I'm afraid the ideas got replaced with beer. I'd like to blame my kid for erasing my brain, but that's probably not realistic. It was beer.
JD Long
+2  A: 

ascii=T in your call to serialize is making R do imprecise binary-decimal-binary conversions when serializing and unserializing causing the values to differ. If you remove ascii=T you get exactly the same numbers back as now it is a binary representation which is written out.

base64encode can encode raw vectors so it doesn't need ascii=T.

The binary representation used by serialize is architecture independent, so you can happily serialize on one machine and unserialize on another.

Reference: http://cran.r-project.org/doc/manuals/R-ints.html#Serialization-Formats

Jyotirmoy Bhattacharya
Great answer! Thanks!
JD Long