views:

60

answers:

2

We have Excel file. This file is in a cells with the name "address" containing the line, for example:

The Accounts Department, National Bank Ltd, 20 Lombard Str., London 3 WRS, England

Need to share information in the cell groups. That is, we must have the following cells:

"country": England "city": London "street": Lombard Str. ..... and other

That is necessary to analyze the contents of the cell and divide the content into logical parts. You can tell from what I get started?

+1  A: 

There is no sure-fire way to do this. Assuming (and this is a big assumption) that commas are only used to separate cells, you can the Data menu, select Text To Columns, and select comma as your delimiter.

That should give you something like the following:

A1                      | B1                | C1              | D1           | E1     
The Accounts Department | National Bank Ltd | 20 Lombard Str. | London 3 WRS | England

From there, in cell F1, you could do the following to try and extract the street name:

=RIGHT(C1,LEN(TRIM(C1))-FIND(" ",TRIM(C1)))

You can use this to find the city:

=LEFT(D1,FIND(" ",TRIM(D1)))

You'll probably find exceptions to both my formulas, and you'll just have to work around that.

If my first assumption is wrong, and there are commas in the text other than the field delimiter, I'd ask to get the file back with a different delimiter (pipe for example).

LittleBobbyTables
The purposed formulas for extracting city and street name should be considered with care. Depending upon the data normalisation, it is usually needed to do a dictionary lookup to identify names.
belisarius
what about addresses that have different numbers of lines or different address components? Your columns wont line up then.
Richard
+3  A: 

This really depends on whether your "logical parts" are delimited in some way such that you can id each part separately. I doubt you can assume a comma "," as a delimiter as address components may themselves contain commas (e.g the name of a firm/business). Additionally you may have issues with data cleanliness - that is commas may be missing, or in the wrong place or whatever.

If you have delimited data your job is simplified somewhat, in that you'll be able to id each field independently. However that's still not straightforward. If you do not have delimited data, it's going to be much harder. Anyway, identification of fields will probably be along these lines:

1) Postcode (there's a well known regex for this - however again you may need to cope with malformed or invalid postcodes or typos)

2) Country & town, city - you can get these with a dictionary of UK towns & cities. Have a Google.

3) Villages - harder, but again a dictionary will get you 98% of the way there.

4) Streets, Roads etc: can't really use a dictionary for this. You'll need to do some kind of recognition based on keywords - if the field ends in street, road, lane or whatever. However there are a lot of these. You may find a bayesian approach works well for this.

5) Company name, department etc. Harder still. Again certain keywords can flag these (e.g "ltd") but I'm guessing most of your entries are not guaranteed to include legal entity. And departments can be anything.

Also - what about people names? can you recognise those?

In short, this is quite a big and involved job to get done correctly. There is no easy/simple answer.

BTW - if you access to the PAF that might help you: http://www.royalmail.com/portal/rm/jump2?mediaId=400085&catId=400084&campaignid=paf_redirect

But that still wont help you with departments, business or people names.

Richard
+1 Shows the difficult part of the job. A naive approach is doomed.
belisarius