tags:

views:

157

answers:

6

I want to split a data frame into several smaller ones. This looks like a very trivial question, however I cannot find a solution from web search.

Can anyone help?

Also, do you have any recommendation for a simple experiment design or survey R package ?

many thanks.

Leo

A: 

The key is that you can subset a data frame using vector indices. A quick example to get you thinking about what exactly you want to learn how to do:

> A <- data.frame(matrix(1:16, nrow=4, ncol=4))
> A
  X1 X2 X3 X4
1  1  5  9 13
2  2  6 10 14
3  3  7 11 15
4  4  8 12 16
> left.half <- c(1, 2)
> right.half <- c(3, 4)
> A.lh <- A[ , left.half]
> A.rh <- A[ , right.half]
> A.lh
  X1 X2
1  1  5
2  2  6
3  3  7
4  4  8
> A.rh
  X3 X4
1  9 13
2 10 14
3 11 15
4 12 16

A quick search at rseek.org found this, which is a great tutorial:

http://www.r-bloggers.com/select-operations-on-r-data-frames/

Also, Grant Farnsworth has a great guide to econmetrics in R that I find helpful. You can also find that with rseek.org's search engine.

richardh
+2  A: 

If you want to split a dataframe according to values of some variable, I'd suggest using daply() from the plyr package.

library(plyr)
x <- daply(df, .(splitting_variable), function(x)return(x))

Now, x is an array of dataframes. To access one of the dataframes, you can index it with the name of the level of the splitting variable.

x$Level1
#or
x[["Level1"]]

I'd be sure that there aren't other more clever ways to deal with your data before splitting it up into many dataframes though.

JoFrhwld
please state upfront the package from which a non-base function is from - presumably you mean daply from package plyr?
mdsumner
I loaded plyr in my code snippet, so I thought it was clear, but I'll edit the answer prose for clarity.
JoFrhwld
Don't you mean `dlply`?
hadley
I suggested `dlply` first, but it didn't automatically name the entries by the grouping variable. I don't know what I did first, but aparently `daply` doesn't work unless a function is specified. I edited the answer to work.
JoFrhwld
+1  A: 

subset() is also useful

subset(DATAFRAME, COLUMNNAME == "")

For a survey package, maybe the "survey" package is pertinent?

http://faculty.washington.edu/tlumley/survey/

apeescape
+3  A: 

You may also want to cut the data frame into an arbitrary number of smaller dataframes. Here, we cut into two dataframes.

x = data.frame(num = 1:26, let = letters, LET = LETTERS)
set.seed(10)
split(x, sample(rep(1:2, 13)))

gives

$`1`
   num let LET
3    3   c   C
6    6   f   F
10  10   j   J
12  12   l   L
14  14   n   N
15  15   o   O
17  17   q   Q
18  18   r   R
20  20   t   T
21  21   u   U
22  22   v   V
23  23   w   W
26  26   z   Z

$`2`
   num let LET
1    1   a   A
2    2   b   B
4    4   d   D
5    5   e   E
7    7   g   G
8    8   h   H
9    9   i   I
11  11   k   K
13  13   m   M
16  16   p   P
19  19   s   S
24  24   x   X
25  25   y   Y
Greg
Greg,Your solution works!thanks.
Leo5188
No problem. I'm glad it did.
Greg
A: 

The answer you want depends very much on how and why you want to break up the data frame.

For example, if you want to leave out some variables, you can create new data frames from specific columns of the database. The subscripts in brackets after the data frame refer to row and column numbers. Check out Spoetry for a complete description.

newdf <- mydf[,1:3]

Or, you can choose specific rows.

newdf <- mydf[1:3,]

And these subscripts can also be logical tests, such as choosing rows that contain a particular value, or factors with a desired value.

What do you want to do with the chunks left over? Do you need to perform the same operation on each chunk of the database? Then you'll want to ensure that the subsets of the data frame end up in a convenient object, such as a list, that will help you perform the same command on each chunk of the data frame.

Ben M
A: 

I just posted a kind of a RFC that might help you: http://stackoverflow.com/questions/3318333/split-a-vector-into-chunks-in-r

x = data.frame(num = 1:26, let = letters, LET = LETTERS)
## number of chunks
n <- 2
dfchunk <- split(x, factor(sort(rank(row.names(x))%%n)))
dfchunk
$`0`
   num let LET
1    1   a   A
2    2   b   B
3    3   c   C
4    4   d   D
5    5   e   E
6    6   f   F
7    7   g   G
8    8   h   H
9    9   i   I
10  10   j   J
11  11   k   K
12  12   l   L
13  13   m   M

$`1`
   num let LET
14  14   n   N
15  15   o   O
16  16   p   P
17  17   q   Q
18  18   r   R
19  19   s   S
20  20   t   T
21  21   u   U
22  22   v   V
23  23   w   W
24  24   x   X
25  25   y   Y
26  26   z   Z

Cheers, Sebastian

Sebastian