ansaurus

Question

How to scrape web pages that are in different format/layouts ?

Answer 1

+2 A:

I'd be inclined to keep a library of meta files that describe the layout for each page you want to scrape data from and use it when trying to get the data.

In that way you don't need complex reg-ex commands and if a site changes its design you simply change a single one of your files.

How you decide to create the meta file is up to you but things like pertinent class names or tags might be a good start.

then describe how to extract the data from that tag.

Unsure if there is a tool out there that does all that.

The other, nicer, way might be to contact the owners of these sites and see if they provide a feed in the form of a WebService or something that you can use to get the data. Saves a lot of heartache I should think.

griegs 2009-10-28 03:01:09

The "cool, vintage 2007" way would be to ask for an RSS feed. Good luck finding someone at the SEC who understands either "Web Service" or "RSS feed".

Dan 2009-10-28 03:25:40

Hehe, yeah agreed. Maybe they could send a parchment via carrier pigeon once a month

griegs 2009-10-28 03:29:56

ansaurus

tags:

views:

answers:

How to scrape web pages that are in different format/layouts ?

related questions