views:

8167

answers:

9

The Problem

I use a tool at work that lets me do queries and get back HTML tables of info. I do not have any kind of back-end access to it.

A lot of this info would be much more useful if I could put it into a spreadsheet for sorting, averaging, etc. How can I screen-scrape this data to a CSV file?

My First Idea

Since I know jQuery, I thought I might use it to strip out the table formatting onscreen, insert commas and line breaks, and just copy the whole mess into notepad and save as a CSV. Any better ideas?

The Solution

Yes, folks, it really was as easy as copying and pasting. Don't I feel silly.

Specifically, when I pasted into the spreadsheet, I had to select "Paste Special" and choose the format "text." Otherwise it tried to paste everything into a single cell, even if I highlighted the whole spreadsheet.

+5  A: 
  • Select the the HTML table in your tools's UI and copy it into the clipboard (if that's possible
  • Paste it into Excel.
  • Save as CSV file

However, this is a manual solution not an automated one.

mkoeller
Ha! Thanks for the forehead-slapping idea.
Nathan Long
This works with IE, but I don't believe it works with FF, even with Paste Special, I believe it just dumps everything into the first cell.
alexp206
No, I did it with FF3. I selected all in the spreadsheet before doing the Paste Special > Text. Maybe it doesn't work if the underlying HTML is formatted a certain way?
Nathan Long
+1  A: 

Quick and dirty:

Copy out of browser into Excel, save as CSV.

Better solution (for long term use):

Write a bit of code in the language of your choice that will pull the html contents down, and scrape out the bits that you want. You could probably throw in all of the data operations (sorting, averaging, etc) on top of the data retrieval. That way, you just have to run your code and you get the actual report that you want.

It all depends on how often you will be performing this particular task.

James Van Huis
A: 

Have you tried opening it with excel? If you save a spreadsheet in excel as html you'll see the format excel uses. From a web app I wrote I spit out this html format so the user can export to excel.

Will Rickards
A: 

If you're screen scraping and the table you're trying to convert has a given ID, you could always do a regex parse of the html along with some scripting to generate a CSV.

andy
+6  A: 

using python:

for example imagine you want to scrape forex quotes in csv form from some site like:fxquotes

then...

from BeautifulSoup import BeautifulSoup
import urllib,string,csv,sys,os
from string import replace

date_s = '&date1=01/01/08'
date_f = '&date=11/10/08'
fx_url = 'http://www.oanda.com/convert/fxhistory?date_fmt=us'
fx_url_end = '&lang=en&margin_fixed=0&format=CSV&redirected=1'
cur1,cur2 = 'USD','AUD'
fx_url = fx_url + date_f + date_s + '&exch=' + cur1 +'&exch2=' + cur1
fx_url = fx_url +'&expr=' + cur2 +  '&expr2=' + cur2 + fx_url_end
data = urllib.urlopen(fx_url).read()
soup = BeautifulSoup(data)
data = str(soup.findAll('pre', limit=1))
data = replace(data,'[<pre>','')
data = replace(data,'</pre>]','')
file_location = '/Users/location_edit_this'
file_name = file_location + 'usd_aus.csv'
file = open(file_name,"w")
file.write(data)
file.close()


edit: to get values from a table: example from: palewire

from mechanize import Browser
from BeautifulSoup import BeautifulSoup

mech = Browser()

url = "http://www.palewire.com/scrape/albums/2007.html"
page = mech.open(url)

html = page.read()
soup = BeautifulSoup(html)

table = soup.find("table", border=1)

for row in table.findAll('tr')[1:]:
    col = row.findAll('td')

    rank = col[0].string
    artist = col[1].string
    album = col[2].string
    cover_link = col[3].img['src']

    record = (rank, artist, album, cover_link)
    print "|".join(record)
Thorvaldur
Is there a easy way to parse html tables into csv using beautiful soup? Your example seems to focus on text enclosed in 'pre' tags.
monkut
with beautiful soup you just look for any tag you like that is near the data you want then findAll('thattag',limit=x) ...
Thorvaldur
Also, just look at the docs for Beautiful soup, there is many options to accomplish a variety of tasks.
Thorvaldur
Nice! I tried to generalize your solution here: http://stackoverflow.com/questions/2611418/scrape-html-tables
dreeves
A: 

Excel can open a http page.

Eg:

  1. Click File, Open

  2. Under filename, paste the URL ie: http://stackoverflow.com/questions/259091/how-can-i-scrape-an-html-table-to-csv

  3. Click ok

Excel does its best to convert the html to a table.

Its not the most elegant solution, but does work!

Christian Payne
+2  A: 

Even easier (because it saves it for you for next time) ...

In Excel

Data/Import External Data/New Web Query

will take you to a url prompt. Enter your url, and it will delimit available tables on the page to import. Voila.

le dorfier
A: 

I think you should go with Automation Anywhere because you can just click and extract html table data and get it exported to csv. Nothing more you have to do. Point & click data extraction.

Bob