Getting a list of all churches in a certain state using Python.

views:

224

answers:

Getting a list of all churches in a certain state using Python.

Hi, I am pretty good with Python, so pseudo-code will suffice when details are trivial. Please get me started on the task - how do go about crawling the net for the snail mail addresses of churches in my state. Once I have a one liner such as "123 Old West Road #3 Old Lyme City MD 01234", I can probably parse it into City, State, Street, number, apt with enough trial and error. My problem is - if I use white pages online, then how do I deal with all the HTML junk, HTML tables, ads, etc? I do not think I need their phone number, but it will not hurt - I can always throw it out once parsed. Even if your solution is half-manual (such as save to pdf, then open acrobat, save as text) - I might be happy with it still. Thanks! Heck, I will even accept Perl snippets - I can translate them myself.

+1 A:

Try lynx --dump <url> to download the web pages. All the troublesome HTML tags will be stripped from the output, and all the links from the page will appear together.

mobrule 2009-12-14 22:36:49

Without HTML tags, it might be difficult to find the correct content...

Skilldrick 2009-12-14 22:38:21

Huh? If you're scraping different web sites with arbitrary layouts, the HTML is more likely to get in your way.

mobrule 2009-12-14 23:38:05

I would prefer scraping just one website if I can.

Hamish Grubijan 2009-12-15 01:13:53

+2 A:

You could use mechanize. It's a python library that simulates a browser, so you could crawl through the white pages (similarly to what you do manually).

In order to deal with the 'html junk' python has a library for that too: BeautifulSoup It is a lovely way to get the data you want out of HTML (of course it assumes you know a little bit about HTML, as you will still have to navigate the parse tree).

Update: As to your follow-up question on how to click through multiple pages. mechanize is a library to do just that. Take a closer look at their examples, esp. the follow_link method. As I said it simulates a browser, so 'clicking' can be realized quickly in python.

Frank 2009-12-14 22:42:54

It was hard to choose the best answer. Thanks, all!

Hamish Grubijan 2009-12-18 14:58:40

+2 A:

What you're trying to do is called Scraping or web scraping.

If you do some searches on python and scraping, you may find a list of tools that will help.

(I have never used scrapy, but it's site looks promising :)

Seth 2009-12-14 22:46:32

+2 A:

Beautiful Soup is a no brainer. Here's a site you might start at http://www.churchangel.com/. They have a huge list and the formatting is very regular -- translation: easy to setup BSoup to scrape.

Peter Rowell 2009-12-14 23:17:01

+1 A:

Python scripts might not be the best tool for this job, if you're just looking for addresses of churches in a geographic area.

The US census provides a data set of churches for use with geographic information systems. If finding all the x in a spatial area is a recurring problem, invest in learning a GIS. Then you can bring your Python skills to bear on many geographic tasks.

mmsmatt 2009-12-14 23:34:17

Do you have a link to this census data? Thanks!

Hamish Grubijan 2009-12-15 04:55:57

Sure, the dataset is called TIGER/Line and is available athttp://www.census.gov/geo/www/tiger/tgrshp2009/tgrshp2009.htmlTo start use it, read up on GIS concepts and grab a free GIS like QuantumGIS

mmsmatt 2009-12-16 02:24:18

ansaurus

tags:

views:

answers:

Getting a list of all churches in a certain state using Python.

related questions