I have a large data.frame displaying some weird properties when plotted. I'd like to ask a question about it on Stackoverflow, to do that I'd like to write the data.frame out in a form that I can paste it into SO and somebody else can easily run it and have it back into a data.frame object again. Is there an easy way to accomplish this? Also, if it is really long, should I use paste bin instead of directly paste it here?
views:
178answers:
4To answer your question directly, the easiest thing to do would be to use summary()
or head()
to display information about the data frame. I would suggest not pasting the actual data into a SO question, but rather providing a public link to the data for the community to play with. If you have not seen it, the box.net service provides a lot of free space for online collaboration.
Finally, if the data is exhibiting odd behavior when plotted, why not provide the code you are using to do the plots and some examples plots themselves.
This is an excellent question.
Here's my attempt at an answer--in the form of recommendations for asking better questions w/r/t presenting the data that accompanies the question. I've probably violated every one of these suggestions below, but at least i've got something to refer to in the future, and perhaps it's useful for others as well.
First, i suspect that anyone who asks a question prefers an answer
with enough abstraction so that in the future they can solve the general class of problems to which the current problem belongs; and
with enough practical guidance (usually this means actual R code) to actually solve the problem that's just in front of us.
Again: abstraction in your question (usually) results in abstraction in the answer, which means a more useful answer but also increases the likelihood that you'll actually get an acceptable answer--it's unlikely that the community has seen that exact data set before; it's far more likely that someone here will recognize a pattern. But the pattern can be obscured by too much data.
Second, the amount of data that's needed to adequately explain a question is not really what matters--what matters is how long it takes the people attempting to answer the question to get that data into their R environment. There are data sets provided in the base R distribution that are 50,000 rows--doesn't matter because i can get the data into R in a few keystrokes. What' more, if you can refer to one of those data sets then you don't have to bother cutting and pasting stuff inside the question window. By contrast, i really try to avoid forcing people to scrape even a few lines of data off the SO page just so they can properly understand my question (except for Dirk, he does the calculations in his head.)
Third, cutting and pasting the entire width of the data set in a question (all of the columns), unless it's absolutely required is just lazy. The data is rarely a substitute for a concise problem description. I would prefer that the OP's spend a minute or two and trim their actual data set so that they provide no more data than is actually required to illustrate the question.
Fourth, if the data can be 'provided' by a formula or algorithm, then just provide that. E.g., if a question relates to a random walk, we don't need the data, just say "random walk" and nearly everyone here will be able to generate the data in a short line of code.
First, Drew's ideas are very good.
In addition, if you reduce the data and isolate the "weird" part, then use dput(). That's the post straight forward way to allow others to load it. Although you need to reduce your data to a reasonable amount first.
Otherwise post it as a csv file in a location that's accessible through http and people can read that directly with read.csv. Although it's unreasonable to ask people to help you with a very large dataset.
Lastly, look at the answers to this question: http://stackoverflow.com/questions/1434897/how-do-i-load-example-datasets-in-r/1434927