ansaurus

Question

Answer 1

+1 A:

There is no sure-fire way to do this. Assuming (and this is a big assumption) that commas are only used to separate cells, you can the Data menu, select Text To Columns, and select comma as your delimiter.

That should give you something like the following:

A1                      | B1                | C1              | D1           | E1     
The Accounts Department | National Bank Ltd | 20 Lombard Str. | London 3 WRS | England

From there, in cell F1, you could do the following to try and extract the street name:

=RIGHT(C1,LEN(TRIM(C1))-FIND(" ",TRIM(C1)))

You can use this to find the city:

=LEFT(D1,FIND(" ",TRIM(D1)))

You'll probably find exceptions to both my formulas, and you'll just have to work around that.

If my first assumption is wrong, and there are commas in the text other than the field delimiter, I'd ask to get the file back with a different delimiter (pipe for example).

LittleBobbyTables 2010-10-10 14:50:47

The purposed formulas for extracting city and street name should be considered with care. Depending upon the data normalisation, it is usually needed to do a dictionary lookup to identify names.

belisarius 2010-10-10 14:58:39

what about addresses that have different numbers of lines or different address components? Your columns wont line up then.

Richard 2010-10-10 15:35:34

Answer 2

+3 A:

This really depends on whether your "logical parts" are delimited in some way such that you can id each part separately. I doubt you can assume a comma "," as a delimiter as address components may themselves contain commas (e.g the name of a firm/business). Additionally you may have issues with data cleanliness - that is commas may be missing, or in the wrong place or whatever.

If you have delimited data your job is simplified somewhat, in that you'll be able to id each field independently. However that's still not straightforward. If you do not have delimited data, it's going to be much harder. Anyway, identification of fields will probably be along these lines:

1) Postcode (there's a well known regex for this - however again you may need to cope with malformed or invalid postcodes or typos)

2) Country & town, city - you can get these with a dictionary of UK towns & cities. Have a Google.

3) Villages - harder, but again a dictionary will get you 98% of the way there.

4) Streets, Roads etc: can't really use a dictionary for this. You'll need to do some kind of recognition based on keywords - if the field ends in street, road, lane or whatever. However there are a lot of these. You may find a bayesian approach works well for this.

5) Company name, department etc. Harder still. Again certain keywords can flag these (e.g "ltd") but I'm guessing most of your entries are not guaranteed to include legal entity. And departments can be anything.

Also - what about people names? can you recognise those?

In short, this is quite a big and involved job to get done correctly. There is no easy/simple answer.

BTW - if you access to the PAF that might help you: http://www.royalmail.com/portal/rm/jump2?mediaId=400085&catId=400084&campaignid=paf_redirect

But that still wont help you with departments, business or people names.

Richard 2010-10-10 15:33:24

+1 Shows the difficult part of the job. A naive approach is doomed.

belisarius 2010-10-10 17:20:36

ansaurus

tags:

views:

answers:

Parsing excel cell. How?

related questions