tags:

views:

29

answers:

1

I've got addresses I need to clean up for matching purposes. Part of the process is trimming unwanted suffices from housenumbers, e.g:

mainstreet 4a --> mainstreet 4. 

However I don't want:

618 5th Ave SW  --> 618 5 Ave SW 

in other words there are some strings (for now: st, nd, rd, th) which I don't want to strip. What would be the best method of doing this (regex or otherwise) ?

a wokring regex without the exceptions would be:

a = a.replaceAll("(^| )([0-9]+)[a-z]+($| )","$1$2$3"); //replace 1a --> 1

I thought about first searching and substiting the special cases with special characters while keeping the references in a map, then do the above regex, and then doing the reverse substitute using the reference map, but I'm looking for a simpler solution.

Thanks

A: 

You could probably do this with negative lookahead:

a = a.replaceAll("(^| )([0-9]+)(?!th|nd|etc)[a-z]+($| )","$1$2$3"); //replace 1a --> 1

or do it all with negative lookahead/lookbehind:

a = a.replaceAll("(?<=^| )([0-9]+)(?!th|nd|etc)[a-z]+(?= |$)", "$1"); //replace 1a --> 1 but not 2nd --> 2
Avi
Going with the first I think, bc. I can wrap by head around it.. Any advantage in using the latter? as an aside: how do you post with formatted code-examples?
Geert-Jan
accepted btw, thanks!
Geert-Jan
I agree, it is a little easier to read the first :-). The second allows me to express just the part you want to match, and uses the zero-width lookahead/lookbehind/negative-lookahead to assert other things about where it appears, which is what they are meant for.
Avi
If you indent a line or paragraph with 4 spaces, it will be formatted as code. See http://stackoverflow.com/editing-help for more details on editing MarkDown.
Avi