ansaurus

Question

Merging two Data Frames using Fuzzy/Approximate String Matching in R

Answer 1

+3 A:

One quick suggestion: try to do some matching on the different fields separately before using merge. The simplest approach is with the pmatch function, although R has no shortage of text matching functions (e.g. agrep). Here's a simple example:

pmatch(c("med", "mod"), c("mean", "median", "mode"))

For your dataset, this matches all the fund names out of a:

> nrow(merge(a,b,x.by="Fund.Name", y.by="Fund.name"))
[1] 58
> length(which(!is.na(pmatch(a$Fund.Name, b$Fund.name))))
[1] 238

Once you create matches, you can easily merge them together using those instead.

Shane 2010-02-09 20:50:52

Thanks Shane, your suggestions are always helpful - I'll take a look at these two and let you know how it worked out.

Brandon Bertelsen 2010-02-09 21:22:07

Yes definitely, great tips Shane.

Jay 2010-02-11 16:29:03

The problem is more when the names are dirtier then that. Say when the first word may or may not be short-formed and when the rest of the words may or may not be shortformed either. This shortforming is not consistent inside a single name or accross names.

Jay 2010-02-14 15:33:43

Answer 2

+1 A:

Hi Brandon,

I'm a Canada local as well, recognize the fund names.

This is a difficult one as each of the data providers picks their own form for the individual fund names. Some use different structure like all end in either Fund or Class others are all over the place. Each seems to choose their own short-forms as well and these change regularly.

That's why so many people like you are doing this by hand on a regular basis. Some of the consulting firms do list indexes to link various sources, not sure if you've explored that route?

As Shane and Marek pointed out this is a matching task more than a straight join. Many companies are struggling with this one. I'm in the middle of my work on this...

Jay

Jay 2010-02-09 21:11:23

It's frustrating. I'm really surprised that they don't have codes assigned to them. The nomenclature is not at all standardized and even the asset classes don't match from one supplier to the next for the same year, let alone the same month. If you're working on this, I'd love an opportunity to discuss it in more detail: brandon AT bertelsen dot ca is me.

Brandon Bertelsen 2010-02-09 21:19:44

Definitely is, some have codes but they are indentifiers only in their own systems... I would like the oppurtunity to discuss and collaborate as well. I will pop you off an email shortly.

Jay 2010-02-09 21:32:37

It seems I can only comment on my own answer....so I concur with jmoy one should contruct an index (id) to use to join the different sources. The key is creating these indices ;) some of the standard stuff works some doesn't. It's a hard one. Are there any experianced data integrators out there who perhaps know something some of us may not?Approximate matching and all surronding that are the most automated way I have found to begin to tackle problems like these.

Jay 2010-02-10 04:57:46

Answer 3

+1 A:

Approximate string matching is not a good idea since an incorrect match would invalidate the whole analysis. If the names from each source is the same each time, then building indexes seems the best option to me too. This is easily done in R:

Suppose you have the data:

a<-data.frame(name=c('Ace','Bayes'),price=c(10,13))
b<-data.frame(name=c('Ace Co.','Bayes Inc.'),qty=c(9,99))

Build an index of names for each source one time, perhaps using pmatch etc. as a starting point and then validating manually.

a.idx<-data.frame(name=c('Ace','Bayes'),idx=c(1,2))
b.idx<-data.frame(name=c('Ace Co.','Bayes Inc.'), idx=c(1,2))

Then for each run merge using:

a.rich<-merge(a,a.idx,by="name")
b.rich<-merge(b,b.idx,by="name")
merge(a.rich,b.rich,by="idx")

Which would give us:

  idx name.x price     name.y qty
1   1    Ace    10    Ace Co.   9
2   2  Bayes    13 Bayes Inc.  99

Jyotirmoy Bhattacharya 2010-02-10 04:13:38

The problem is more when the names are dirtier then that. Say when the first word may or may not be short-formed and when the rest of the words may or may not be shortformed either. This shortforming is not consistent inside a single name or accross names.

Jay 2010-02-14 15:32:49

ansaurus

tags:

views:

answers:

Merging two Data Frames using Fuzzy/Approximate String Matching in R

related questions