tags:

views:

247

answers:

2

I am trying to extract a US address from a text.

So if I have the following variations of text then I'd like to extract the address portion

Today is a good day to meet up at a bar. the address is 123 fake street, NY, 23423-3423

 just came from 423 Elm Street, kk, 34223 ...had awesome time

blah blah bleh blah 23414 Fake Terrace, MM something else

 experimented my teleporter to get to work but reached at 2423 terrace NY

If someone can provide some starting points then I can mold it for other variations.

+1  A: 

good question but you can not get or extract address with any reg ex or any other type.

you can extract mobile number or email address but you can not extract address proper.

AjmeraInfo
+2  A: 

At some point, you'd have clarify what you consider an address to be.

Does an address just have a street number and street name?

Does an address have a street name, and a city name?

Does an address have a city name, a state name?

Does an address have a city name, a state abbreviation, and a zip code? What format is the zip code in?

It's easy to see how you can run into trouble quickly.

This obviously wouldn't catch everything, but maybe you could match strings that start with a street number, has a state abbreviation in the middle somewhere, and end in a zip code. The reliability of this would greatly depend on knowing what sort of text you were using as the input. I.e., if there is a lot of other numbers in the text, this could be completely useless.

possible regex

\d+.+(?=AL|AK|AS|AZ|AR|CA|CO|CT|DE|DC|FM|FL|GA|GU|HI|ID|IL|IN|IA|KS|KY|LA|ME|MH|MD|MA|MI|MN|MS|MO|MT|NE|NV|NH|NJ|NM|NY|NC|ND|MP|OH|OK|OR|PW|PA|PR|RI|SC|SD|TN|TX|UT|VT|VI|VA|WA|WV|WI|WY)[A-Z]{2}[, ]+\d{5}(?:-\d{4})?

sample input

hello world this is me posting an address. please go to 312 N whatever st., New York NY 10001.

If you can find me there. I might be at 123 Invalid address.

Please send all letters to 115A Address Street, Suite 100, Google KS, 66601

42 NE Another Address, Some City with 9 digit zip, AK 55555-2143

Hope this helps!

matches

312 N whatever st., New York NY 10001
115A Address Street, Suite 100, Google KS, 66601
42 NE Another Address, Some City with 9 digit zip, AK 55555-2143

regex explanation

\d+                      digits (0-9) (1 or more times (matching the most amount possible))
.+                       any character except \n (1 or more times (matching the most amount possible))
(?=                      look ahead to see if there is:
  AL|AK|AS|...             'AL', 'AK', 'AS', ... (valid state abbreviations)
)                        end of look-ahead
[A-Z]{2}                 any character of: 'A' to 'Z' (2 times)
[, ]+                    any character of: ',', ' ' (1 or more times (matching the most amount possible))
\d{5}                    digits (0-9) (5 times)
(?:                      group, but do not capture (optional (matching the most amount possible)):
  -                        '-'
  \d{4}                    digits (0-9) (4 times)
)?                       end of grouping
macek
yeah. i'd have to come up with constraints at some point. but for now. I am assuming that there wont be much numbers in the text. so text after numbers will be street name. followed by a comma (or not) would be city, followed by a comma(or not) would be state or abbreviation
drake
drake, I updated my answer to provide example. Hope this helps :)
macek
Ajmeralnfo, as noted in my answer, I said, *"This obviously wouldn't catch everything,"* followed by, *"The reliability of this would greatly depend on knowing what sort of text you were using as the input."*
macek
awesome. thanks smotchkkiss
drake
+1 @smotchkiss good answer .
KandadaBoggu