views:

290

answers:

5

Looking for a quick and dirty way to parse Australian street addresses into its parts:
3A/45 Jindabyne Rd, Oakleigh, VIC 3166

should split into:
"3A", 45, "Jindabyne Rd" "Oakleigh", "VIC", 3166

Suburb names can have multiple words, as can street names.


See: http://stackoverflow.com/questions/1739746/parse-a-steet-address-into-components

Has to be in Java, cannot make http requests (e.g. to web APIs).


EDIT: Assume that format specified is always followed. I have no issue with spitting incorrectly formatted strings back at the user with a message telling them to follow the format (which I've described above).

+1  A: 

You could use String.split, first with ,, then with . or /.

Valentin Rocher
+1  A: 

Hm, probably quite difficult because the format is not well defined.

A regex would certainly work as a quick&dirty solution. The problem is that it will probably fail (produce incorrect results) in special cases.

Best bet is probably to hack up a small regex, then run that over a realistic dataset (ideally everything you have in production), and check if it gives good results. May be a lot of manual work, but probably the best you can do...

Edit: BTW, to use regexes in Java, use the methods from package java.util.regex. Just thought I'd mention it...

sleske
+4  A: 

Honestly, you're setting yourself a rather Sisyphean challenge here, and I'm not sure if it's worthwhile. Unless your data comes from a known source, with a very well specified format, you're going to get data that's completely useless. If you're dealing with free text, people screw up their addresses in ways you wouldn't believe.

Do you really want to try (yourself) to parse every possible combination of Richmond, Victoria, 3121 and Richmond 3121 VIC and Richmond VIC, 3121 etc? And that's just suburb granularity!

Addresses are even worse. Sure, most people would put 7/21 Smith St for a unit, or 29-33 Jones St for a location spanning multiple street numbers, but people aren't consistent. Is 1-5 Brown St unit 1 at number 5, or a location spanning #1 to #5 on that street? Is 7A a separate subdivided street address, or Unit A at #7?

Address matching is not a simple problem and if your data set is end-user-entered free text, I seriously wouldn't bother unless you have a trivial amount of data or don't care about accuracy that much (or, alternatively, have a lot of time for manual cleanups). If not, hand it off to a piece of software that does this work for you.

Australia Post have something called the Postal Address File (PAF) which contains every valid delivery location in Australia. There are a number of software libraries which will do the parsing + matching for you, and either give you a definitive answer (including all the individual address components, as you're after) or provide a list of potential matches for you to choose from if the address is non-existent or ambiguous. One example I'm aware of is QAS Batch (not affiliated with them in any way, evaluated their software in the past but didn't end up using it) but that's just one example; there's a list of others accessible through the PAF website.

Cannot recommend strongly enough that you don't waste your time on this unless it's at a trivial scale.

If it is, hey, yeah, regex.

Cowan
@Cowan, thanks for a well reasoned answer. However you may assume that the input string will conform to strict format. E.g. it will always be `Richmond, VIC 3121`, not any of the other formats.
bguiz
+2  A: 

Given your reply to my other answer, this should do for the strictly-formatted case you specify:

    String sample = "3A/45 Jindabyne Rd, Oakleigh, VIC 3166";
    Pattern pattern = Pattern.compile("(([^/ ]+)/)?([^ ]+) ([^,]+), ([^,]+), ([^ ]+) (\\d+)");
    Matcher m = pattern.matcher(sample);
    if (m.find()) {
        System.out.println("Unit: " + m.group(2));
        System.out.println("Number: " + m.group(3));
        System.out.println("Street: " + m.group(4));
        System.out.println("Suburb: " + m.group(5));
        System.out.println("State: " + m.group(6));
        System.out.println("Postcode: " + m.group(7));
    } else {
        throw new IllegalArgumentException("WTF");
    }

This works if you remove the '3A/' (in which case m.group(2) will be null), if the street number is '45A' or '45-47', if we add a space to the road ('Jindabyne East Rd') or to the suburb ('Oakleigh South').

Just to explain that regex further, if you're not familiar with regular expressions:

(([^/ ]+)/)? is the equivalent of just ([^/ ]+/)? -- that is, 'anything not including a forward slash or a space, followed by a slash'. The question mark makes it optional (so the whole clause can be missing), and the extra parentheses in the final version are to create a smaller inner group, without the slash, for later extraction.

([^ ]+) is 'capture anything that's not a space (which is followed by a space)' -- this is the street number.

([^,]+), is 'capture anything that's not a comma (which is followed by comma and space)' -- this is the street name. Anything is valid in the street name as long as it's not a comma.

([^,]+), is the same again, in this case to capture the suburb.

([^ ]+) captures the next non-space string (state abbrevation) and skips the space after it.

(\\d+) rounds off by capturing any number of digits (the postcode)

Hope that's helpful.

Cowan
+1 and check: Nice regex + use of `Matcher#group(int)`
bguiz
A: 

For a commercial solution, you could give address-parser.com a try.

Mike Warner