tags:

views:

201

answers:

4

Hi there,

I've been looking around for some data about naming trends in USA. I managed to get top 1000 names for babies born in 2008. The data is formated in this manor:

 male.name n.male female.name n.female
 Jacob 22272 Emma 18587
 Michael 20298 Isabella 18377
 Ethan 20004 Emily 17217
 Joshua 18924 Madison 16853
 Daniel 18717 Ava 16850
 Alexander 18423 Olivia 16845
 Anthony 18158 Sophia 15887
 William 18149 Abigail 14901
 Christopher 17783 Elizabeth 11815
 Matthew 17337 Chloe 11699

I want to get a data.frame with 2 variables: name and gender. This can be done with looping, but I consider it rather inefficient way of solving this problem. I reckon that some reshape function will suite my needs.

Let's presuppose that this tab-delimited data is saved into a data.frame named bnames. Looping can be done with function:

 tmp <- character()
  for (i in 1:nrow(bnames)) {
  tmp <- c(tmp, rep(bnames[i,1], bnames[i,2]))
 }

But I want to achieve this with vector-based approach. Any suggestions?

+3  A: 

So one quick version would be to transform the data.frame and use the rbind() function to get what you want.

dataNEW <- data.frame(bnames[,1],c("m"), bnames[,c(2,3)], c("f"), bnames[,4])
colnames(dataNEW) <- c("name", "gender", "value", "name", "gender", "value")

This will give you:

          name gender value      name gender value
1        Jacob      m 22272      Emma      f 18587
2      Michael      m 20298  Isabella      f 18377
3        Ethan      m 20004     Emily      f 17217
4       Joshua      m 18924   Madison      f 16853
5       Daniel      m 18717       Ava      f 16850
6    Alexander      m 18423    Olivia      f 16845
7      Anthony      m 18158    Sophia      f 15887
8      William      m 18149   Abigail      f 14901
9  Christopher      m 17783 Elizabeth      f 11815
10     Matthew      m 17337     Chloe      f 11699

Now you can use rbind():

dataNGV <- rbind(dataNEW[1:3],dataNEW[4:6])

which leads to:

      name gender value
1        Jacob      m 22272
2      Michael      m 20298
3        Ethan      m 20004
4       Joshua      m 18924
5       Daniel      m 18717
6    Alexander      m 18423
7      Anthony      m 18158
8      William      m 18149
9  Christopher      m 17783
10     Matthew      m 17337
11        Emma      f 18587
12    Isabella      f 18377
13       Emily      f 17217
14     Madison      f 16853
15         Ava      f 16850
16      Olivia      f 16845
17      Sophia      f 15887
18     Abigail      f 14901
19   Elizabeth      f 11815
20       Chloe      f 11699
mropa
+3  A: 

I think (if I have understood correctly) that mropa's solution needs one more step to get what you want

library(plyr)
data <- ddply(dataNGV, .(name,gender), 
      function(x) data.frame(name=rep(x[,1],x[,3]),gender=rep(x[,2],x[,3])))
gd047
That's right, I need only two variables, that is, I want to lose frequencies and get only "raw" data. Thanks for this one. That's just what I've been looking for!
aL3xa
+2  A: 

Direct vector-based solution (replace the loop) will be

# your data:
bnames <- read.table(textConnection(
"male.name n.male female.name n.female
Jacob 22272 Emma 18587
Michael 20298 Isabella 18377
Ethan 20004 Emily 17217
Joshua 18924 Madison 16853
Daniel 18717 Ava 16850
Alexander 18423 Olivia 16845
Anthony 18158 Sophia 15887
William 18149 Abigail 14901
Christopher 17783 Elizabeth 11815
Matthew 17337 Chloe 11699
"), sep=" ", header=TRUE, stringsAsFactors=FALSE)

# how to avoid loop
bnames$male.name[ rep(1:nrow(bnames), times=bnames$n.male) ]

It's based on fact that rep can do at once thing you do in loop.

But for final result you should combine mropa and gd047 answers.

Or with my solution:

data_final <- data.frame(
  name = c(
    bnames$male.name[ rep(1:nrow(bnames), times=bnames$n.male) ],
    bnames$female.name[ rep(1:nrow(bnames), times=bnames$n.female) ]
  ),
  gender = rep(
    c("m", "f"),
    times = c(sum(bnames$n.male), sum(bnames$n.female))
  ),
  stringsAsFactors = FALSE
)

[EDIT] Simplify:

data_final <- data.frame(
  name = rep(
    c(bnames$male.name, bnames$female.name),
    times = c(bnames$n.male, bnames$n.female)
  ),
  gender = rep(
    c("m", "f"),
    times = c(sum(bnames$n.male), sum(bnames$n.female))
  ),
  stringsAsFactors = FALSE
)
Marek
Actually, this approach will throw an error, 'cause lengths of `name` and `gender` variables differ. This seems to be the problem in your code: ### error lies here, I've suppressed code prior to error name = c(bnames$male.name[... ### should be replaced with data_final <- data.frame( name = c(rep(bnames$male.names, bnames$n.male), rep(bnames$female.names, bnames$n.female)), gender = c(rep('m', bnames$n.male), rep('f', bnames$n.female)), stringsAsFactors=FALSE)I think that running `ddply`, as @gd047 suggested, is better solution!
aL3xa
Marek
I apologize for my previous comment. Your code works indeed. I've renamed colnames (potential cause of an error), and copy/pasted code in R interpreter... show no errors. Thanks a lot, and sorry once again!
aL3xa
+1  A: 

Alternatively, download the full (cleaned up) baby names dataset from http://github.com/hadley/data-baby-names.

hadley
To be honest, I acquainted with this dataset in one of (awesome) `plyr` tutorial slides available on your site. I wanted to use this method to analyse naming trends in Serbia. I know that there's already available dataset on github, but I wanted to do this just for practice... Thanks for replying, though!
aL3xa