views:

1330

answers:

7

I need to take a web page and extract the address information from the page. Some are easier than others. I'm looking for a firefox plugin, windows app, or VB.NET code that will help me get this done.

Ideally I would like to have a web page on our admin (ASP.NET/VB.NET) where you enter a URL and it scraps the page and returns a Dataset that I can put in a Grid.

+1  A: 

What type of address information are you referring to?

There are a couple FireFox plugins Operator & Tails that allow you to extract and view microformats from web pages.

Scott Nichols
A: 

from pages like this http://www.ashnha.com/members.php

Brian Boatright
+1  A: 

Aza Raskin has talked about recognising when selected text is an address in his Firefox Proposal: A Better New Tab Screen. No code yet, but I mention it as there may be code in firefox to do this in the future.

Alternatively, you could look at using the map command in Ubiquity, although you'd have to select the addresses yourself.

Sam Hasler
+1  A: 

If you know the format of the page (for instance, if they're all like that ashnha.com page) then it's fairly easy to write VB.NET code that does this:

  1. Create a System.Net.WebRequest and read the response into a string.
  2. Then create a System.Text.RegularExpressions.Regex and iterate over the collection of Matches between that and the string you just retrieved. For each match, create a new row in a DataTable.

The tough bit is writing the regex, which is a bit of a black art. See regexlib.com for loads of tools, books etc about regexes.

If the HTML format isn't well-defined enough for a regex, then you're probably going to have to rely on some amount of user intervention in order to identify which bits are the addresses...

Matt Bishop
+1  A: 

For general HTML screen scraping in VB.NET, check out HTML Agility Pack. Much easier than trying to Regex it (unless you happen to be a Regex ninja already!)

The page you mentioned in your answer would be easy to automate, as the addresses are in a consistent format.

But to allow the users to point to any page, that's a much harder job. The data could be in any format at all. You could write something to dump all the text, guess how they are divided, try and recognise bits like country and state names, telephone numbers etc, and get then show your results with an interface that will let the users complete missing sections, move the dividers, and identify the bits you missed or they didn't want.

It's not simple though, and making an interface that provides a big advantage over simply cutting and pasting into validated form fields would be quite an achievement I think - I'd be interested to know how you get on!

EDIT: Just noticed this other question that might cover quite a bit of what you want to do: http://stackoverflow.com/questions/16413/parse-usable-street-address-city-state-zip-from-a-string

Colin Pickard
A: 

Dapper.net is the best for screen scraping..

A: 

Automation Anywhere is a windows application that you can use to extract data from web. You can enter the URL from where you wanna extract the data and then scrape all the email addresses you want. You can export the scraped data into any database. As far as my knowledge goes I don't think that it can be used as a plug-in to the browser but yes in terms of performance, its a very nice tool. Also, I found this example which shows how it extracts people's data plus also a free trial is available inside.

Enjoy

Bob