ansaurus

Question

How would I modify this regex to extract the left and right hand parts of a UK postal code?

Answer 1

+5 A:

Actually, parentheses are used for extraction, not brackets. The (?: constructs in your expression are how you prevent parentheses from performing extraction. You would want:

(?:((?:A[BL]|B[ABDHLNRST]?|C[ABFHMORTVW]|D[ADEGHLNTY]|E[CHNX]?|F[KY]|G[LUY]?|H[ADGPRSUX]|I[GMPV]|JE|K[ATWY]|L[ADELNSU]?|M[EKL]?|N[EGNPRW]?|O[LX]|P[AEHLOR]|R[GHM]|S[AEGKLMNOPRSTWY]?|T[ADFNQRSW]|UB|W[ACDFNRSV]?|YO|ZE)\d(?:\d|[A-Z])?)\s{0,}(\d[A-Z]{2}))

Incidentally, I would also make this change:

(?:((?:A[BL]|B[ABDHLNRST]?|C[ABFHMORTVW]|D[ADEGHLNTY]|E[CHNX]?|F[KY]|G[LUY]?|H[ADGPRSUX]|I[GMPV]|JE|K[ATWY]|L[ADELNSU]?|M[EKL]?|N[EGNPRW]?|O[LX]|P[AEHLOR]|R[GHM]|S[AEGKLMNOPRSTWY]?|T[ADFNQRSW]|UB|W[ACDFNRSV]?|YO|ZE)\d(?:\d|[A-Z])?)\s*(\d[A-Z]{2}))

because \s{0,} is a goofy way to write \s*.

chaos 2009-03-26 15:14:47

You're right about "\s{0,}" meaning \s*. But I think \s should not be optional or you'll split two parts incorrectly. Maybe author meant "{1,}" (or "+").

Marko Dumic 2009-03-26 15:19:18

This may be a language issue, Brits call this character "(" a bracket.

Chas. Owens 2009-03-26 15:19:33

To be honest, I get very confused with regex. I meant 'any number of spaces, including zero'. Also, brackets / parentheses are synonymous in UK English, apologies for the confusion.

Ryan ONeill 2009-03-26 15:20:48

That is not quite working for me, I am using M1 1AA and I get 1 match, not 2 as I had hoped.

Ryan ONeill 2009-03-26 15:22:40

The additional results will be in the Groups property of your 1 match.

Chris Shaffer 2009-03-26 15:25:51

@Ryan ONeill: You did fine. It's just that regex has a standard shorthand metacharacter for zero-or-more: * (as with ? for zero-or-one and + for one-or-more).

chaos 2009-03-26 15:27:14

@Chas. Owens: Ah, right. Ryan, bracket means [] to me. As I managed to note anyhow, though, your existing () aren't an issue because they're (?:).

chaos 2009-03-26 15:29:12

@Marko Dumic: One of the desired features of the regex he talks about is that it works even on postal codes where people have left out the space.

chaos 2009-03-26 15:30:08

Thanks a lot, very much appreciated. I hope others can get some use out of that regex now as it was bloody hard to code.

Ryan ONeill 2009-03-26 15:38:14

I don't doubt they will. UK postal code validation/analysis is one of those recurring questions, and well done putting together a solution (especially if regex confuses you :).

chaos 2009-03-26 15:40:51

Answer 2

+5 A:

Additionally, I'd recommend against trying to check the postcode so thoroughly. The list of valid postcodes can change, so you'll have to maintain the expression every time the Post Office updates the PAF.

You're also missing some of the “special postcodes” like BFPO, GIR, the non-geographic postcodes and overseas territories. See wiki for an overview of what's out there you might have to deal with.

In general for most purposes a “does it look plausible?” check is better than trying to nail it down completely. There's nothing worse than telling customers they can't use your service because their address doesn't exist.

bobince 2009-03-26 15:58:17

I wish I could mark this as an answer too, good point.

Ryan ONeill 2009-03-28 17:41:45

Answer 3

+1 A:

When dealing with a large regex like this you should use the /x option (which I think is called RegexOptions.IgnorePatternWhitespace in C#). (?:) is not capturing, so all you need to do is put () around the parts you want. Another benefit of the /x option is that you can comment the regex with end-of-line comments (they start with #). You may also might need to be careful with \d and \s. They may match more than you expect (\s matches all whitespace, not just spaces and, at least in Perl 5.8 and later, \d matches all UNICODE digit characters, not just [0-9])

Regex exp = new Regex(@"
    (?:
        ( #capture first part
            (?:
                A[BL]        | B[ABDHLNRST]? | C[ABFHMORTVW]      |
                D[ADEGHLNTY] | E[CHNX]?      | F[KY]              |
                G[LUY]?      | H[ADGPRSUX]   | I[GMPV]            |
                JE           | K[ATWY]       | L[ADELNSU]?        |
                M[EKL]?      | N[EGNPRW]?    | O[LX]              |
                P[AEHLOR]    | R[GHM]        | S[AEGKLMNOPRSTWY]? |
                T[ADFNQRSW]  | UB            | W[ACDFNRSV]?       |
                YO           | ZE
            )
            \d
            (?:
                \d | [A-Z]
            )?
        ) #end capture of first part
        \s{0,}
        ( #capture second part
            \d[A-Z]{2}
        ) #end capture of second part
    )",
    RegexOptions.IgnorePatternWhitespace
);

Chas. Owens 2009-03-26 16:12:58

Brilliant, that makes maintenance so much easier.

Ryan ONeill 2009-03-26 16:41:47

Just one quibble: "(?:\d|[A-Z])?" can be replaced with "[\dA-Z]?"

Alan Moore 2009-03-27 10:07:31

ansaurus

tags:

views:

answers:

How would I modify this regex to extract the left and right hand parts of a UK postal code?

related questions