I have 20,000 company addresses on various documents, which are all formatted differently. For example:
Company A 12345 street US
CompanyA, Inc box2, 12345 street WA, US
The Company B company Ltd 123 happy street UK
company B, Ltd 123, happy street, london, S1 1AA
I'd like to be able to combine the records for each company (i.e. seperate the above into 2 categories, one per company).
I have no idea about how to go about this. I assume any clustering will be probabilistic in nature, and probably work well for easier matches, but then require manual review for less likely/more uncertain matches.
Can anyone name any techniques suitable for this type of task?
many thanks!