views:

122

answers:

3

To ensure data privacy, I have to publish a list of addresses after removing the street numbers.

So, for example:

1600 Amphitheatre Parkway, Mountain View, CA

needs to be published as

Amphitheatre Parkway, Mountain View, CA

What's the best way to do this in Java? Does this require regex?

+1  A: 

One possibility is to use a CASS system that typically will parse the address and return in XML. Then, you can easily grab the street name, city, and state, ignoring the street number.

pkananen
+2  A: 

EDIT : How about...

addressString.replace("^\\s*[0-9]+\\s+","");

or JavaScript...

addressString.replace(/^\s*[0-9]+\s+/,'');

My original suggestion was (JavaScript)...

addressString.replace(/^\s*[0-9]+\s*(?=.*$)/,'');
El Ronnoco
Be careful not to call it twice on '123 2nd Street, Nowhereville'
Wrikken
I did intend that it was only called once per line :D
El Ronnoco
In fact `/^\s*[0-9]+\s+/` is simpler and probably works better. The lookahead isn't necessary. Also this will ensure that '7th street' doesnt get turned into 'th street'
El Ronnoco
@Wrikken My updated answer will be safe to use on this as it insists on a following whitespace character.
El Ronnoco
The OP has already asked a separate question asking this to be translated into valid Java code, but would you care to fix it here for posterity? It should be `addressString.replace("^\\s*[0-9]+\\s*(?=.*$)", "");`
Mark Peters
Also @El Ronnoco: I'm not sure that makes it idempotent. Is it universally accepted that no road begins with a number not followed by "th" or "st", etc? For example I can think of a local road named "Twenty Road" and I can image somebody listing their address as "415 20 Road, ...". I think your solution is exactly what the OP asked for; I just think the real solution here is to use an existing library that takes into consideration locales, etc and even looks it up in a database like Google Maps before stripping the street number.
Mark Peters
@Mark Peters: Apologies - I put my code in a JavaScript syntax. I shall update. With regards to the second point - I don't think I have ever seen an example of an address of the form "1st High Street" - certainly not in the UK (although the OP is from the US apparently). '415 20 Road' would be replaced to '20 Road' as the regex insists on matching a following whitespace. However '20 Road' would be changed to 'Road' . I'm not sure exactly how critical the OPs problem is but for a quickfix initial datacleanse this seems simpler than (sourcing and) plugging into an existing library solution
El Ronnoco
@Mark Peters - Further apologies - I've just looked up what idempotent means :D I think the OP will just be iterating address lines and performing one operation per line.
El Ronnoco
+2  A: 

This is a technically difficult problem to solve. But I don't think that matters.

You say you want to strip out the street number from the address to ensure data privacy. How in the world do you think that ensures privacy? I mean, it might give a little privacy to those who live on a street with a few thousand homes, but on a medium street it narrows it down to a few hundred people; on a small street there are maybe a few choices and on some rural roads it may tell you exactly which house the address corresponds to.

This is not sanitization.

The problem is then compounded greatly if you are associating any other data with that address.

Mark Peters
+1 because even though the regex answer technically addresses the question, THIS answer seems much more relevant.
Jake