screen-scraping

What language/tool should I use for HTML parsing?

Hello all, I have a couple of websites that I want to extract data from and based on previous experiences, this isn't as easy as it sound. Why? Simply because the HTML pages I have to parse aren't properly formatted (missing closing tag, etc.). Considering that I have no constraints regarding the technology, language or tool that I can...

Scrape web page contents

I am developing a project, for which I want to scrape the contents of a website in the background and get some limited content from that scraped website. For example, in my page I have "userid" and "password" fields, by using those I will access my mail and scrape my inbox contents and display it in my page. Please help me to solve the p...

Iconv::IllegalSequence when using www::mechanize

I'm trying to do a little bit of webscraping, but the WWW:Mechanize gem doesn't seem to like the encoding and crashes :-/ The post request results in a 302 redirect (which mechanize follows, so far so good) and the resulting page seems to crash it :-/ I googled quite a bit, but nothing came up so far how to solve this. Any of you got an ...

How I do to block Web scraping without blocking Well behaved bots?

I'm building an e-commerce website with a large database of products. Of course, is nice when Goggle indexes all products of the website. But what if some competitor wants Web Scrap the website and get all images and product descriptions? I was observing some websites with similar lists of products, and they place a CAPTCHA, so "only h...

Super-fast screen scraping techniques?

I often find myself needing to do some simple screen scraping for internal purposes (i.e. a third party service I use only publishes reports via HTML). I have at least two or three cases of this now. I could use apache httpclient and create all the necessary screen scraping code but it takes a while. Here is my usual process: Open up C...

Parsing an HTML file with selectorgadget.com

How can I use beautiful soup and selectorgadget to scrape a website. For example I have a website - (a newegg product) and I would like my script to return all of the specifications of that product (click on SPECIFICATIONS) by this I mean - Intel, Desktop, ......, 2.4GHz, 1066Mhz, ...... , 3 years limited. After using selectorgadget I ...

how can I protect scraping of certain data on my web pages?

I want to protect only certain numbers that are displayed after each request. There are about 30 such numbers. I was planning to have images generated in the place of those numerbers, but if the image is not warped as with captcha, wont scripts be able to decipher the number anyway? Also, how much of a performance hit would loading image...

Screen-scraping of a proprietary website for academic use

A client of mine who is a social sciences researcher at a university is asking if I can write a spider to do statistical data mining from a subscription-only academic database. He would like to use the statistics for his academic research. (For those interested, this would involve downloading thousands of text documents and then doing l...

Reading and responding to matching criteria on the screen

I'm looking to developing something for my Win32 system that can find and respond to particular screen events. For instance, when bitmap range (100,100) to (130,130) of my screen (a 30x30 pixel portion of the screen) matches a provided 30x30 pixel baseline, then do a certain action. Can anyone get me started with this? Perhaps there's...

How do you login to a webpage and retrieve its content in C#?

How do you login to a webpage and retrieve its content in C#? ...

Bypass the alert and error that occur while screen scraping

I have created a web page to screen scrape a site, while scraping from the other site; there is some error on that site so it's throwing an error (object expected). But finally I get my result perfectly. It shows that the error occurs in my program. Is it possible to bypass those errors (without showing them on the screen). I don't wan...

Generating possible URLs from forms

Hi, I am trying to get all the URLs (and then get the data) that are generated by the form on this page - http://www.vodafone.in/_layouts/servicecallertunes.aspx with little success. I have installed HTTP Headers(0.14) addon on Firefox 3.0.5, Ubuntu. But the resultant URL is very weird and pretty long. Eg: POST /_layouts/servicecall...

Screen scraping: regular expressions or XQuery expressions?

I was answering some quiz questions for an interview, and the question was about how would I do screen scraping. That is, picking content out of a web page, assuming you don't have a better structured way to query the information directly (e.g. a web service). My solution was to use an XQuery expression. The expression was fairly long...

How can I download Yahoo Groups?

I want to download some Yahoo Groups (files, photos, messages, memberlist) and I've found these scripts: http://freshmeat.net/projects/grabyahoogroup/ http://sourceforge.net/project/showfiles.php?group_id=62034 I've downloaded ActivePerl and the needed modules from CPAN (nothing fancy; they're very easy to find). I've managed to ins...

Why is Beautiful Soup truncating this page?

I am trying to pull at list of resource/database names and IDs from a listing of resources that my school library has subscriptions to. There are pages listing the different resources, and I can use urllib2 to get the pages, but when I pass the page to BeautifulSoup, it truncates its tree just before the end of the entry for the first r...

Screen scrape web page that displays data page wise using Mechanize

I am trying to screen scrape a web page (using Mechanize) which displays the records in a grid page wise. I am able to read the values displayed in the first page but now need to navigate to the next page to read appropriate values. <tr> <td><span>1</span></td> <td><a href="javascript:__doPostBack('gvw_offices','Page$2')">2</a><...

Screen scraping an ASP.NET web page to retrieve data displayed in the grid view

I am using RUBY to screen scrap a web page (created in asp.net) which uses gridview to display data. I am successfully able to read the data displayed on page-1 of the grid but unable to figure out how I can move to the next page in the grid to read all the data. Problem is the page number hyperlinks are not normal hyperlinks (with URL)...

Looking for a recommendation of a good tutorial on best practices for a web scraping project?

I need to do a fairly extensive project involving web scraping and am considering using Hpricot or Beautiful Soup (i.e. Ruby or Python). Has anyone come across a tutorial that they thought was particularly good on this subject that would help me start the project off on the right foot? ...

How can I simulate a web site login in ASP.NET, then scrape some data from a page

Does anyone have any recommendations for performing the following in ASP.NET code: 1) Login into a password protected site with a username and password (target site is not necessarily ASP.NET) 2) Navigate to a specific page and/or perform a search 3) Pull specific data from the page (this is the easy part) Although using an API would...

How can I screen scrape with Perl?

I need to display some values that are stored in a website, for that I need to scrape the website and fetch the content from the table. Any ideas? ...