tags:

views:

200

answers:

3

I have a regular expression which works for validating UK postal codes but now I would like to extract the constituent parts of the code and I'm getting confused. For those who do not know examples of UK postal codes are 'WC1 1AA', 'WC11 1AA' and 'M1 1AA'.

The regular expression below (apologies for the formatting) handles the lack of a space (this is the \s{0,} bit) between the left and right parts and still validates (which is great).

(?:(?:A[BL]|B[ABDHLNRST]?|C[ABFHMORTVW]|D[ADEGHLNTY]|E[CHNX]?|F[KY]|G[LUY]?|H[ADGPRSUX]|I[GMPV]|JE|K[ATWY]|L[ADELNSU]?|M[EKL]?|N[EGNPRW]?|O[LX]|P[AEHLOR]|R[GHM]|S[AEGKLMNOPRSTWY]?|T[ADFNQRSW]|UB|W[ACDFNRSV]?|YO|ZE)\d(?:\d|[A-Z])?\s{0,}\d[A-Z]{2})

I'd like to be able to extract the left and right hand sides now and I know that brackets are used for this, but there are already brackets in there and the regex specification is not easy to read. So I guess these brackets need replacing, can anyone help me rework my brackets?

I can see other people would find this regex of use, so please feel free to use it for validating UK postal addresses.

Thanks in advance

Ryan

+5  A: 

Actually, parentheses are used for extraction, not brackets. The (?: constructs in your expression are how you prevent parentheses from performing extraction. You would want:

(?:((?:A[BL]|B[ABDHLNRST]?|C[ABFHMORTVW]|D[ADEGHLNTY]|E[CHNX]?|F[KY]|G[LUY]?|H[ADGPRSUX]|I[GMPV]|JE|K[ATWY]|L[ADELNSU]?|M[EKL]?|N[EGNPRW]?|O[LX]|P[AEHLOR]|R[GHM]|S[AEGKLMNOPRSTWY]?|T[ADFNQRSW]|UB|W[ACDFNRSV]?|YO|ZE)\d(?:\d|[A-Z])?)\s{0,}(\d[A-Z]{2}))

Incidentally, I would also make this change:

(?:((?:A[BL]|B[ABDHLNRST]?|C[ABFHMORTVW]|D[ADEGHLNTY]|E[CHNX]?|F[KY]|G[LUY]?|H[ADGPRSUX]|I[GMPV]|JE|K[ATWY]|L[ADELNSU]?|M[EKL]?|N[EGNPRW]?|O[LX]|P[AEHLOR]|R[GHM]|S[AEGKLMNOPRSTWY]?|T[ADFNQRSW]|UB|W[ACDFNRSV]?|YO|ZE)\d(?:\d|[A-Z])?)\s*(\d[A-Z]{2}))

because \s{0,} is a goofy way to write \s*.

chaos
You're right about "\s{0,}" meaning \s*. But I think \s should not be optional or you'll split two parts incorrectly. Maybe author meant "{1,}" (or "+").
Marko Dumic
This may be a language issue, Brits call this character "(" a bracket.
Chas. Owens
To be honest, I get very confused with regex. I meant 'any number of spaces, including zero'. Also, brackets / parentheses are synonymous in UK English, apologies for the confusion.
Ryan ONeill
That is not quite working for me, I am using M1 1AA and I get 1 match, not 2 as I had hoped.
Ryan ONeill
The additional results will be in the Groups property of your 1 match.
Chris Shaffer
@Ryan ONeill: You did fine. It's just that regex has a standard shorthand metacharacter for zero-or-more: * (as with ? for zero-or-one and + for one-or-more).
chaos
@Chas. Owens: Ah, right. Ryan, bracket means [] to me. As I managed to note anyhow, though, your existing () aren't an issue because they're (?:).
chaos
@Marko Dumic: One of the desired features of the regex he talks about is that it works even on postal codes where people have left out the space.
chaos
Thanks a lot, very much appreciated. I hope others can get some use out of that regex now as it was bloody hard to code.
Ryan ONeill
I don't doubt they will. UK postal code validation/analysis is one of those recurring questions, and well done putting together a solution (especially if regex confuses you :).
chaos
+5  A: 

Additionally, I'd recommend against trying to check the postcode so thoroughly. The list of valid postcodes can change, so you'll have to maintain the expression every time the Post Office updates the PAF.

You're also missing some of the “special postcodes” like BFPO, GIR, the non-geographic postcodes and overseas territories. See wiki for an overview of what's out there you might have to deal with.

In general for most purposes a “does it look plausible?” check is better than trying to nail it down completely. There's nothing worse than telling customers they can't use your service because their address doesn't exist.

bobince
I wish I could mark this as an answer too, good point.
Ryan ONeill
+1  A: 

When dealing with a large regex like this you should use the /x option (which I think is called RegexOptions.IgnorePatternWhitespace in C#). (?:) is not capturing, so all you need to do is put () around the parts you want. Another benefit of the /x option is that you can comment the regex with end-of-line comments (they start with #). You may also might need to be careful with \d and \s. They may match more than you expect (\s matches all whitespace, not just spaces and, at least in Perl 5.8 and later, \d matches all UNICODE digit characters, not just [0-9])

Regex exp = new Regex(@"
    (?:
        ( #capture first part
            (?:
                A[BL]        | B[ABDHLNRST]? | C[ABFHMORTVW]      |
                D[ADEGHLNTY] | E[CHNX]?      | F[KY]              |
                G[LUY]?      | H[ADGPRSUX]   | I[GMPV]            |
                JE           | K[ATWY]       | L[ADELNSU]?        |
                M[EKL]?      | N[EGNPRW]?    | O[LX]              |
                P[AEHLOR]    | R[GHM]        | S[AEGKLMNOPRSTWY]? |
                T[ADFNQRSW]  | UB            | W[ACDFNRSV]?       |
                YO           | ZE
            )
            \d
            (?:
                \d | [A-Z]
            )?
        ) #end capture of first part
        \s{0,}
        ( #capture second part
            \d[A-Z]{2}
        ) #end capture of second part
    )",
    RegexOptions.IgnorePatternWhitespace
);
Chas. Owens
Brilliant, that makes maintenance so much easier.
Ryan ONeill
Just one quibble: "(?:\d|[A-Z])?" can be replaced with "[\dA-Z]?"
Alan Moore