views:

471

answers:

9

What are the best algorithms for recognizing structured data on an HTML page?

For example Google will recognize the address of home/company in an email, and offers a map to this address.

A: 

A regular expression would be suited to do this matching.

Mitch Wheat
+2  A: 

If you also have to handle international addresses, you're in for a world of headaches; international address formats are amazingly varied.

Michael Borgwardt
A: 

What I meant was is the recognition of a posting address on an html page. There the address can be in many forms the city can come before the streets etc.

There is a large body of text and I am looking for addresses in this text. My algorithm should recognize if there is an address on the page and parse it into a data structure where I store the street the city and postal code in separate fields.

gyurisc
You should edit the original post.
John the Statistician
A: 

Again, regular expressions should do the trick.

Because of the wide variety of addresses, you can only guess if a string is an address or not by an expression like "(number), (name) Street|Boulevard|Main", etc

You can consider looking into some firefox extensions which aim to map addresses found in text to see how they work

antileet
+1  A: 

What you're asking is really quite a hard problem if you want to get it perfect. While a simple regexp will get it mostly right most of them time, writing one that will get it exactly right everytime is fiendishly hard. There are plenty of strange corner cases and in several cases there is no single unambiguous answer. Most web sites that I've seen to a pretty bad job handling all but the simplest URLs.

If you want to go down the regexp route your best bet is probably to check out the sourcecode of http://search.cpan.org/~abigail/Regexp-Common-2.122/lib/Regexp/Common/URI/http.pm

+2  A: 

Do not use regular expressions. Use an existing HTML parser, for example in Python I strongly recommend BeautifulSoup. Even if you use a regular expression to parse the HTML elements BeautifulSoup grabs.

If you do it with your own regexs, you not only have to worry about finding the data you require, you have to worry about things like invalid HTML, and lots of other very non-obvious problems you'll stumble over..

dbr
+3  A: 

I'd guess that Google takes a two step approach to the problem (at least that's what I would do). First they use some fairly general search pattern to pick out everything that could be an address, and then they use their map database to look up that string and see if they get any matches. If they do it's probably an address if they don't it probably isn't. If you can use a map database in your code that will probably make your life easier.

Unless you can limit the geographic location of the addresses, I'm guessing that it's pretty much impossible to identify a string as an address just by parsing it, simply due to the huge variation of address formats used around the world.

+5  A: 

A named-entity extraction framework such as GATE has at least tackled the information extraction problem for locations, assisted by a gazetteer of known places to help resolve common issues. Unless the pages were machine generated from a common source, you're going to find regular expressions a bit weak for the job.

John the Statistician
+4  A: 

If you have the markup proper—and not just the text from the page—I second the Beautiful Soup suggestion above. In particular, the address tag should provide the lowest of low-hanging fruit. Also look into the adr microformat. I'd only falll back to regexes if the first two didn't pull enough info or I didn't have the necessary data to look for the first two.

Hank Gay