How can I extract addresses and phone number from HTML?

+5 A:

You could use something like Google Maps. Geocode the address and, if successful, Google's API will return an XML representation of the address with all of the elements separated (and corrected or completed).

EDIT:

I'm being voted down and not sure why. Parsing addresses can be a little difficult. Here's an example of using Google to do this:

http://blog.nerdburn.com/entries/code/how-to-parse-google-maps-returned-address-data-a-simple-jquery-plugin

I'm not saying this is the only way or necessarily the best way. Just a way to parse addresses on a web site.

marcc 2009-09-10 03:25:46

Up-voted you. Treating "in-the-cloud" services like the Google Maps API as a library (which is what the poster asked for) is valid, IMHO.

Chris Simmons 2009-09-10 04:46:07

:) thank you for the vote

marcc 2009-09-10 05:08:23

maybe the downvotes are for not addressing getting the addresses from the html page in the first place? just a guess.

ysth 2009-09-11 06:38:49

A:

This is not really an answer but pointers to options that can be useful.

Dive into Python has some info on address parsing. http://www.diveintopython.org/regular_expressions/street_addresses.html

Similar questions on SO:

http://stackoverflow.com/questions/518210/where-is-a-good-address-parser

http://stackoverflow.com/questions/16413/parse-usable-street-address-city-state-zip-from-a-string

hashable 2009-09-10 04:39:06

+2 A:

There are 2 parts to this: extract the complete address from the page, and parse that address into something you can use (store the various parts in a DB for example).

For the first part you will need a heuristic, most likely country-dependant: for US addresses [A-Z][A-Z],?\s*\d\d\d\d\d should give you the end of an address, provided the 2 letters turn out to be a state. Finding the beginning of the string is left as an exercise.

The second part can be done either through a call to Google maps, or as usual in Perl, using a CPAN module: Lingua::EN::AddressParse (test it on your data to see if it works well enough for you).

In any case this is a difficult task, and you will most likely never get it 100% right, so plan for manually checking the addresses before using them.

mirod 2009-09-10 08:10:59

A:

Here is a link to my street address parser, written in Python using pyparsing.

Paul McGuire 2009-09-10 08:23:55

A:

You don't need regular expressions (yet) or a general parser like pyparsing (at all). Look at something like Beautiful Soup, which will parse even bad HTML into something like a tree of tags. From there, you can look at the source of the page, and find out what tags to drill down through to get to the data. Then, from Beautiful Soup's tree, you can search for these nodes using XPath (in recent versions), and directly loop over the tags you're interested in, getting to the actual data easily. From there, you can parse the data using a quick regex or something. This will be more flexible and more future proof, and also possibly less head-exploding, than just trying to do it in pure regular expressions.

Lee B 2009-09-20 00:17:39

ansaurus

tags:

views:

answers:

How can I extract addresses and phone number from HTML?

related questions