views:

26

answers:

1

I have 20,000 company addresses on various documents, which are all formatted differently. For example:

  • Company A 12345 street US

  • CompanyA, Inc box2, 12345 street WA, US

  • The Company B company Ltd 123 happy street UK

  • company B, Ltd 123, happy street, london, S1 1AA

I'd like to be able to combine the records for each company (i.e. seperate the above into 2 categories, one per company).

I have no idea about how to go about this. I assume any clustering will be probabilistic in nature, and probably work well for easier matches, but then require manual review for less likely/more uncertain matches.

Can anyone name any techniques suitable for this type of task?

many thanks!

+1  A: 

Perhaps automatic grammar induction is a technique that would yield results here. You could attempt to infer grammars for your text and then use some kind of comparison metrics to cluster the inferred grammars.

Gian