views:

110

answers:

4

Dear Stackers,

I have to merge to data frames in R. The two data frames share a common id variable, the name of the subject. However, the names in one data frame are partly capitalized, while in the other they are in lower cases. Furthermore the names appear in reverse order. Here is a sample from the data frames:

DataFrame1$Name:
"Van Brempt Kathleen"
"Gräßle Ingeborg"
"Gauzès Jean-Paul"
"Winkler Iuliu" 

DataFrame2$Name:
"Kathleen VAN BREMPT" 
"Ingeborg GRÄSSLE"
"Jean-Paul GAUZÈS"
"Iuliu WINKLER"

Is there a way in R to make these two variables usable as an identifier for merging the data frames?

Best, Thomas

A: 

Can you add an additional column/variable to each data frame which is a lowercase version of the original name:

DataFrame1$NameLower <- tolower(DataFrame1$Name)
DataFrame2$NameLower <- tolower(DataFrame2$Name)

Then perform a merge on this:

MergedDataFrame <- merge(DataFrame1, DataFrame2, by="NameLower")
Joel
sorry - I missed the bit about reverse ordering too, you'll have to write a custom function to move all capitalized text to the beginning of a name, and lowercase it.
Joel
Thanks Joel, do you have an idea of how to write such a function?
Thomas Jensen
+3  A: 

You can use gsub to convert the names around:

> names
[1] "Kathleen VAN BREMPT" "jean-paul GAULTIER" 
> gsub("([^\\s]*)\\s(.*)","\\2 \\1",names,perl=TRUE)
[1] "VAN BREMPT Kathleen" "GAULTIER jean-paul" 
> 

This works by matching first anything up to the first space and then anything after that, and switching them around. Then add tolower() or toupper() if you want, and use match() for joining your data frames.

Good luck matching Grassle with Graßle though. Lots of other things will probably bite you too, such as people with two first names separated by space, or someone listed with a title!

Barry

Spacedman
Thanks Barry, you are right that this is not unproblematic, and I will most likely have to do some stuff by hand, but hopefully this will reduce the work load a bit :)
Thomas Jensen
A: 

In addition to the answer using gsub to rearrange the names, you might want to also look at the agrep function, this looks for approximate matches. You can use this with sapply to find the matching rows from one data frame to the other, e.g.:

> sapply( c('newyork', 'NEWJersey', 'Vormont'), agrep, x=state.name, ignore.case=TRUE )
  newyork NEWJersey   Vormont 
       32        30        45 
Greg Snow
+1  A: 

Here's a complete solution that combines the two partial methods offered so far (and overcomes the fears expressed by Spacedman about "matching Grassle with Graßle"):

DataFrame2$revname <- gsub("([^\\s]*)\\s(.*)","\\2 \\1",DataFrame2$Name,perl=TRUE)
DataFrame2$agnum <-sapply(tolower(DataFrame2$revname), agrep, tolower(DataFrame1$Name) )
DataFrame1$num <-1:nrow(DataFrame1)
merge(DataFrame1, DataFrame2, by.x="num", by.y="agnum")

Output:

  num              Name.x              Name.y             revname

1   1 Van Brempt Kathleen Kathleen VAN BREMPT VAN BREMPT Kathleen
2   2     Gräßle Ingeborg    Ingeborg GRÄSSLE    GRÄSSLE Ingeborg
3   3    Gauzès Jean-Paul    Jean-Paul GAUZÈS    GAUZÈS Jean-Paul
4   4       Winkler Iuliu       Iuliu WINKLER       WINKLER Iuliu

The third step would not be necessary if DatFrame1 had rownames that were still sequentially numbered (as they would be by default). The merge statement would then be:

merge(DataFrame1, DataFrame2, by.x="row.names", by.y="agnum")

-- David.

DWin