views:

377

answers:

3

I need to sort a list of UK postcodes in to order.

Is there a simple way to do it?

UK postcodes are made up of letters and numbers:

see for full info of the format: http://en.wikipedia.org/wiki/UK_postcodes

But my problem is this a simple alpha sort doesn't work because each code starts with 1 or two letters letters and then is immediately followed by a number , up to two digits, then a space another number then a letter. e.g. LS1 1AA or ls28 1AA, there is also another case where once the numbers in the first section exceed 99 then it continues 9A etc.

Alpha sort cause the 10s to immediately follow the 1:

...
LS1 9ZZ
LS10 1AA
...
LS2

I'm looking at creating a SQL function to convert the printable Postcode into a sortable postcode e.g. 'LS1 9ZZ' would become 'LS01 9ZZ', then use this function in the order by clause.

Has anybody done this or anything similar already?

+1  A: 

I'd be tempted to store the normalised postcode in the database along with the real postcode - that way you only do the string manipulation once, and you can use an index to help you with the sort.

anon
+3  A: 

You need to think of this as a tokenization issue so SW1A 1AA should tokenize to:

  • SW
  • 1
  • A
  • 1AA

(although you could break the inward part down into 1 and AA if you wanted to)

and G12 8QT should tokenize to:

  • G
  • 12
  • (empty string)
  • 8QT

Once you have broken the postcode down into those component parts then sorting should be easy enough. There is an exception with the GIR 0AA postcode but you can just hardcode a test for that one

edit: some more thoughts on tokenization

For the sample postcode SW1A 1AA, SW is the postcode area, 1A is the postcode district (which we'll break into two parts for sorting purposes), 1 is the postcode sector and AA is the unit postcode.

These are the valid postcode formats (source: Royal Mail PAF user guide page 8 - link at bottom of this page):

AN NAA
AAN NAA
ANN NAA
ANA NAA
AAA NAA (only for GIR 0AA code)
AANN NAA
AANA NAA

So a rough algorithm would be (assuming we want to separate the sector and unit postcode):

  • code = GIR 0AA? Tokenize to GI/R/ /0/AA (treating R as the district simplifies things)
  • code 5 letters long e.g G1 3AF? Tokenize to G/1/ /3/AF
  • code 6 letters long with 3rd character being a letter e.g. W1P 1HQ? Tokenize to W/1/P/1/HQ
  • code 6 letters long with 2nd character being a letter e.g. CR2 6XH? Tokenize to CR/2/ /6/XH
  • code 7 letters long with 4th character being a letter e.g. EC1A 1BB? Tokenize to EC/1/A/1/BB
  • otherwise e.g. TW14 2ZZ, tokenize to TW/14/ /2/ZZ

If the purpose is to display a list of postcodes for the user to choose from then I would adopt Neil Butterworth's suggestion of storing a 'sortable' version of the postcode in the database. The easiest way to create a sortable version is to pad all entries to nine characters:

  • two characters for the area (right-pad if shorter)
  • two for the district number (left-pad if shorter)
  • one for the district letter (pad if missing)
  • space
  • one for the sector
  • two for the unit

and GIR 0AA is a slight exception again. If you pad with spaces then the sort order should be correct. Examples using # to represent a space:

  • W1#1AA => W##1##1AA
  • WC1#1AA => WC#1##1AA
  • W10#1AA => W#10##1AA
  • W1W#1AA => W##1W#1AA
  • GIR#0AA => GI#R##0AA
  • WC10#1AA => WC10##1AA
  • WC1W#1AA => WC#1W#1AA

You need to right-pad the area if it's too short: left-padding produces the wrong sort order. All of the single letter areas - B, E, G, L, M, N, S, W - would sort before all of the two-letter areas - AB, AL, ..., ZE - if you left-padded

The district number needs to be left padded to ensure that the natural W1, W2, ..., W9, W10 order remains intact

barrowc
Although the 'tokenisation' is not straight forward because there is not the usual separators....
Adrian
I've added a rough algorithm for tokenizing and some ideas on how to store the results
barrowc
thanks - i can't mark you answer up any more, hopefully someone else will.
Adrian