views:

306

answers:

3

Hi,

I would like to extract portion of a text using a regular expression. So for example, I have an address and want to return just the number and streets and exclude the rest:

2222 Main at King Edward Vancouver BC CA

But the addresses varies in format most of the time. I tried using Lookbehind Regex and came out with this expression:

.*?(?=\w* \w* \w{2}$)

The above expressions handles the above example nicely but then it gets way too messy as soon as commas come into the text, postal codes which can be a 6 character string or two 3 character strings with a space in the middle, etc...

Is there any more elegant way of extracting a portion of text other than a lookbehind regex?

Any suggestion or a point in another direction is greatly appreciated.

Thanks!

+2  A: 

Regular expressions are for data that is REGULAR, that follows a pattern. So if your data is completely random, no, there's no elegant way to do this with regex.

On the other hand, if you know what values you want, you can probably write a few simple regexes, and then just test them all on each string.

Ex. regex1= address # grabber, regex2 = street type grabber, regex3 = name grabber.

Attempt a match on string1 with regex1, regex2, and finally regex3. Move on to the next string.

ryansstack
That's what I thought. Oh well, I guess I have to go do the messy stuff.Thanks Ryan!
Jaime
A: 

well i thot i'd throw my hat into the ring:

.*(?=,? ([a-zA-Z]+,?\s){3}([\d-]*\s)?)

and you might want ^ or \d+ at the front for good measure
and i didn't bother specifying lengths for the postal codes... just any amount of characters hyphens in this one.

it works for these inputs so far and variations on comas within the City/state/country area:

  • 2222 Main at King Edward Vancouver, BC, CA, 333-333
  • 555 road and street place CA US 95000
  • 2222 Main at King Edward Vancouver BC CA 333
  • 555 road and street place CA US

it is counting at there being three words at the end for the city, state and country but other than that it's like ryansstack said, if it's random it won't work. if the city is two words like New York it won't work. yeah... regex isn't the tool for this one.

btw: tested on regexhero.net

Victor
Thanks Victor! I'll try and test it with more data on my end.
Jaime
A: 

i can think of 2 ways you can do this

1) if you know that "the rest" of your data after the address is exactly 2 fields, ie BC and CA, you can do split on your string using space as delimiter, remove the last 2 items.

2) do a split on delimiter /[A-Z][A-Z]/ and store the result in array. then print out the array ( this is provided that the address doesn't contain 2 or more capital letters)

ghostdog74
Thanks for the input. Appreciate it!
Jaime