tags:

views:

87

answers:

3

I'm trying to create an R API for StackOverflow. The output is gzipped. For example:

readLines("http://api.stackoverflow.com/0.9/stats/", warn=F)
[1] "\037‹\b"                                                                                                                                                                                                                                                                                         
[2] "\030\002úØÛy°óé½\036„iµXäË–[<üt—Zu[\\VmÎHî=ÜÛݹ×ýz’Í.äûû÷>ý´\a\177Ýh÷\017îÝÛÙwßÚáÿþ«¼þý\027ÅrÝæÔlgüÀëA±\017›ìŽï{M¤û.\020\037�Ë\"¿’\006³ì\032„Úß9¸ÿ`¼ç÷³*~ÿKêˆð¡\006v¦ð²ýô£�ñÃ�ì+ôU�_\026滽�]êt¼·?ÞûÈ4ù%\016~S0^>àe¶ÀG\037½n³éÛôKê缬®‚\016Êê¢úý×u‰fó¶]=º{·aΚŽ—y{·©î\026‹‹»h5^-/‚W1 |9[UŲõ^§�Ç"
[3] ":¬´¿1M\177ð\"0íö¹ñ…YÞLëbÕ*!~â\027\036§çU�®êê¢ÎˆµhòýæÅ´Zn\036S¶Z•ùv[­§óm´î�"                                                                                                                                                                                                                      
[4] "Í™t˪^d¥£·üÂ?¾ÿ\033'¿$ù\177"  

Is there a good way to gunzip this in R, short of writing the output to file, gunzip'ing it, and reading it back in?

+6  A: 

You could do:

conn <- gzcon(url("http://api.stackoverflow.com/0.9/stats/"))
data <- readLines(conn)
nico
Thanks! Don't forget to close the connection when you're finished.
Shane
Why double `readLines` is needed? [mbq answer](http://stackoverflow.com/questions/3128422/gunzip-a-file-stream-in-r/3128738#3128738) works too.
Marek
@Marek: corrected. That was just me trying different things and I must have pasted some extra command. Thanks for pointing that out.
nico
+3  A: 

Try:

p <- gzcon(url("http://api.stackoverflow.com/0.9/stats/"))
readLines(p)
mbq
+3  A: 

Ideally we should tell the server that we can handle gzipped content, find out from the HTTP headers that the content is actually gzip encoded and then decompress only if it is. The Rcurl library can do this:

library(Rcurl)
getURL("http://api.stackoverflow.com/0.9/stats/",
       .opts=list(encoding="identity,gzip")
Jyotirmoy Bhattacharya
That's would be the right way to do it, but be aware that the Stack Overflow API team has [decided against obeying the HTTP protocol](http://stackapps.com/questions/729) in this regard; slightly related we won't see [proper HTTP/1.1 cache control](http://stackapps.com/questions/1028) for the time being as well ...
Steffen Opel