tags:

views:

57

answers:

3

i am screenscraping using watir and i download a xls file. when i open this file in notepad, i find that its just a bunch of html tables. is there a any function or gem that will convert this page into a bunch of arrays. any ideas is appreciated.

+1  A: 

In general it is a simple exercise to walk through a HTML file with a table and extract rows and columns as long as they don't use colspan or rowspan attributes. Those mess up the logical flow requiring some sensing of the gaps they cause, and a need to fill in the gaps with the repeated value from the *spans. http://stackoverflow.com/questions/2062051/ruby-nokogiri-parsing-html-table-ii might help.

From looking at XLS files on my desktop I don't think they're XML or HTML. I'm not sure what you downloaded. I did a quick search and roo (http://roo.rubyforge.org/) appears to be a good starting point.

Greg
+1  A: 
  1. Narrow it down to ...
  2. Clear out the whitespace
  3. Replace the tabs with "
  4. Replace tags with ",
  5. Replace the & & tags with nothing
  6. Replace the tags with |
  7. Split the rows with |
  8. Split the fields with ,

You can simplify it a little bit more, but that's the gist of it.

Dave McNulla
+1  A: 

XLS is a binary format. If you are seeing HTML tables in the file contents it means you probably did not download the file correctly.

How is the XLS file being downloaded through Watir? Are you having to automate the File Download window, or did you just follow a link to the XLS file and write the contents to a file?

JEH