ansaurus

Question

Answer 1

+3 A:

So one quick version would be to transform the data.frame and use the rbind() function to get what you want.

dataNEW <- data.frame(bnames[,1],c("m"), bnames[,c(2,3)], c("f"), bnames[,4])
colnames(dataNEW) <- c("name", "gender", "value", "name", "gender", "value")

This will give you:

          name gender value      name gender value
1        Jacob      m 22272      Emma      f 18587
2      Michael      m 20298  Isabella      f 18377
3        Ethan      m 20004     Emily      f 17217
4       Joshua      m 18924   Madison      f 16853
5       Daniel      m 18717       Ava      f 16850
6    Alexander      m 18423    Olivia      f 16845
7      Anthony      m 18158    Sophia      f 15887
8      William      m 18149   Abigail      f 14901
9  Christopher      m 17783 Elizabeth      f 11815
10     Matthew      m 17337     Chloe      f 11699

Now you can use rbind():

dataNGV <- rbind(dataNEW[1:3],dataNEW[4:6])

which leads to:

      name gender value
1        Jacob      m 22272
2      Michael      m 20298
3        Ethan      m 20004
4       Joshua      m 18924
5       Daniel      m 18717
6    Alexander      m 18423
7      Anthony      m 18158
8      William      m 18149
9  Christopher      m 17783
10     Matthew      m 17337
11        Emma      f 18587
12    Isabella      f 18377
13       Emily      f 17217
14     Madison      f 16853
15         Ava      f 16850
16      Olivia      f 16845
17      Sophia      f 15887
18     Abigail      f 14901
19   Elizabeth      f 11815
20       Chloe      f 11699

mropa 2010-02-03 07:10:20

Answer 2

+3 A:

I think (if I have understood correctly) that mropa's solution needs one more step to get what you want

library(plyr)
data <- ddply(dataNGV, .(name,gender), 
      function(x) data.frame(name=rep(x[,1],x[,3]),gender=rep(x[,2],x[,3])))

gd047 2010-02-03 08:19:45

That's right, I need only two variables, that is, I want to lose frequencies and get only "raw" data. Thanks for this one. That's just what I've been looking for!

aL3xa 2010-02-03 17:42:52

Answer 3

+2 A:

Direct vector-based solution (replace the loop) will be

# your data:
bnames <- read.table(textConnection(
"male.name n.male female.name n.female
Jacob 22272 Emma 18587
Michael 20298 Isabella 18377
Ethan 20004 Emily 17217
Joshua 18924 Madison 16853
Daniel 18717 Ava 16850
Alexander 18423 Olivia 16845
Anthony 18158 Sophia 15887
William 18149 Abigail 14901
Christopher 17783 Elizabeth 11815
Matthew 17337 Chloe 11699
"), sep=" ", header=TRUE, stringsAsFactors=FALSE)

# how to avoid loop
bnames$male.name[ rep(1:nrow(bnames), times=bnames$n.male) ]

It's based on fact that rep can do at once thing you do in loop.

But for final result you should combine mropa and gd047 answers.

Or with my solution:

data_final <- data.frame(
  name = c(
    bnames$male.name[ rep(1:nrow(bnames), times=bnames$n.male) ],
    bnames$female.name[ rep(1:nrow(bnames), times=bnames$n.female) ]
  ),
  gender = rep(
    c("m", "f"),
    times = c(sum(bnames$n.male), sum(bnames$n.female))
  ),
  stringsAsFactors = FALSE
)

[EDIT] Simplify:

data_final <- data.frame(
  name = rep(
    c(bnames$male.name, bnames$female.name),
    times = c(bnames$n.male, bnames$n.female)
  ),
  gender = rep(
    c("m", "f"),
    times = c(sum(bnames$n.male), sum(bnames$n.female))
  ),
  stringsAsFactors = FALSE
)

Marek 2010-02-03 10:21:47

Actually, this approach will throw an error, 'cause lengths of `name` and `gender` variables differ. This seems to be the problem in your code: ### error lies here, I've suppressed code prior to error name = c(bnames$male.name[... ### should be replaced with data_final <- data.frame( name = c(rep(bnames$male.names, bnames$n.male), rep(bnames$female.names, bnames$n.female)), gender = c(rep('m', bnames$n.male), rep('f', bnames$n.female)), stringsAsFactors=FALSE)I think that running `ddply`, as @gd047 suggested, is better solution!

aL3xa 2010-02-03 17:39:22

Marek 2010-02-03 21:23:10

I apologize for my previous comment. Your code works indeed. I've renamed colnames (potential cause of an error), and copy/pasted code in R interpreter... show no errors. Thanks a lot, and sorry once again!

aL3xa 2010-02-04 00:49:41

Answer 4

+1 A:

Alternatively, download the full (cleaned up) baby names dataset from http://github.com/hadley/data-baby-names.

hadley 2010-02-03 14:38:54

To be honest, I acquainted with this dataset in one of (awesome) `plyr` tutorial slides available on your site. I wanted to use this method to analyse naming trends in Serbia. I know that there's already available dataset on github, but I wanted to do this just for practice... Thanks for replying, though!

aL3xa 2010-02-03 20:33:58

ansaurus

tags:

views:

answers:

Getting "raw" data from frequency table

related questions