screen-scraping

Siblings with dom/xpath

Hi, Have been trying several days to parse the following html code (notice that there is not a real hierarchal tree structure). Everything is pretty much on the same level. <p><span class='one'>week number</span></p> <p><span class='two'>day of the week</span></p> <table class='spreadsheet'> table data </table> <p><span class='two'>a...

What data/functionality would you want to have available through a web API that isn't available today?

Have you ever looked for a web API for certain data or functionality, only to find that there isn't an API available to meet your needs, or that the APIs that are available are inadequate for some reason? I am really interested in collecting such experiences. Please note that I am not asking about specific sites / web apps (so, for exam...

Get links from webpage to textbox (vb.net + html agility pack)

Im making a vb.net app and im using htmlagilitypack. I need hap to get the profile links from yellowpages.ca Here is an example of the html: <a href="/bus/Ontario/Brampton/A-Safe-Self-Storage/17142.html?what=af&amp;where=Ontario&amp;le=1238793c7aa%7Ccf8042ceaa%7C2ae32e5a2a" onmousedown="utag.link({link_name:'busname', link_attr1:'in_li...

Accessing child divs using DOMDocument and XPath

I'm building a basic screen scraper for personal use and learning purposes, so please do not post comments like "You need to ask permission" etc. The data I'm trying to access is structured as follows: <tr> <td> <div class="wrapper"> <div class="randomDiv"> <div class="divContent"> ...

Parse CDATA content in Ruby on Rails

Hi, I am new to rails. Could you help me with a good tutorial on how to parse CDATA content in ruby on rails. I have learnt to use Feed-zirra to parse the content but I am not able to parse content from the websites which use CDATA. If it is not possible to do it with feed-zirra could you help me with alternatives. Looking forward f...

How can I package a scrapy project using cxfreeze?

I have a scrapy project that I would like to package all together for a customer using windows without having to manually install dependencies for them. I came across cxfreeze, but I'm not quite sure how it would work with a scrapy project. I'm thinking I would make some sort of interface and run the scrapy crawler with 'from scrapy.cmd...

How to fail gracefully and get notified if screen scraping fails in ruby on rails

I am working on a Rails 3 project that relies heavily on screen scraping to collect data mainly using Nokogiri. I'm aggregating essentially all the same data but I'm grabbing it from many difference sources and as time goes on I will be adding more and more. However I am acutely aware that screen scraping can be notoriously unreliable....

Screen Scrapping - Read Captcha

Hi, I am working on Screen Scrapping, I was able to do, but some of the website have captcha and need to enter captcha information to proceed further Is there anyway to read captcha information and submit those values that are in captcha or how can we handle this scenario ? Thanks ...

Capture virtual printer output on linux

I'm writing a Java screen-scraping application for a 3270 mainframe and rather than scroll through page after page of 80x24 chars I'd like to output all pages to a printer and then capture and parse the printer output. The 3270 client has a print option, so I just need to virtualise a printer device and then somehow capture the output....

Playing with Scrapi in Rails 3.. getting Segmentation Fault error / Abort Trap

What I've done so far.. sudo gem install scrapi sudo gem install tidy This didn't work because it didn't have the libtidy.dylib So I did this : sudo port install tidy sudo cp libtidy.dylib /Library/Ruby/Gems/1.8/gems/scrapi-1.2.0/lib/tidy/libtidy.dylib Then I started following the simple railscast at : http://media.railscasts.co...

Help needed with screen scraping using anemone and nokogiri

I have a starting page of http://www.example.com/startpage which has 1220 listings broken up by pagination in the standard way eg 20 results per page. I have code working that parses the first page of results and follows links that contain "example_guide/paris_shops" in their url. I then use Nokogiri to pull specific data of that final ...

web scraping groupon

i want scrap groupon.com now my problem is such sites when you load for the first time asks you to join their email service but when you reload the page they directly show you the content of the page. how do i do it? i am using php for my scripting. also if anyone could suggest a framework or library in php which makes scraping easy it ...

heavy iTunes Connect scraping

I'm looking at different options to get the sales reports and other data out of the iTunes Connect website. Since Apple doesn't provide an API, all the solutions I found are based on scraping the page. As I need the information for a product that we offer, I'm not that happy to give all the iTunes accounts to a 3rd party service. This i...

Web page scraping: press javascript button

Hello I am trying to scrape a web page and to recieve the data i need to press a button. This is the source code for the button: "a class="press-me_btn" href="javascript:void( NewPage['DemoPage'].startDemo() );" id="js_press-me_btn">PRESS ME Is it possible to "press" the button somehow without using a browser? either by using wget wi...

WYSIWYG web scraping/crawling setup using Javascript/html5?

Hi folks, My goal is to allow less experienced people to setup the required parameters needed to scrape some information from a website. The idea is that a user enters an URL, after which this URL is loaded in a frame. The user should then be able to select text within this frame, which should give me enough information to scrape this ...

How do I make pQuery work with slightly malformed HTML?

pQuery is a pragmatic port of the jQuery JavaScript framework to Perl which can be used for screen scraping. pQuery quite sensitive to malformed HTML. Consider the following example: use pQuery; my $html_malformed = "<html><head><title>foo</title></head><body>bar</body></html>>"; my $page = pQuery($html_malformed); my $title = $page->...

How to capture screen snippets and share it with users

Sometimes you might like a screen snippet in a certain web page, you would ideally would want to capture that and probably add some notes to a portion of the user interface. What kind of tools are available to capture this information and share it with other users. ...

java html parser doesnt read all page

Hi everybody I'm parsing html pages to get specific information, but there are some pages that I cant get all the information displayed on the web page, for example in this page I cant get the reviews information. By the way, if you see the source code of the page there are very much empty lines, and the reviews information dont appear...

Options for handling javascript heavy pages while screen scraping

Disclaimer here: I'm really not a programmer. I'm eager to learn, but my experience is pretty much basic on c64 20 years ago and a couple of days of learning Python. I'm just starting out on a fairly large (for me as a beginner) screen scraping project. So far I have been using python with mechanize+lxml for my browsing/parsing. Now I'm...

How to perform web scraping to find specific linked pages in Java on Google App Engine?

I need to retrieve text from a remote web site that does not provide an RSS feed. What I know is that the data I need is always on pages linked to from the main page (http://www.example.com/) with a link that contains the text " Invoices Report ". For example: <a href="http://www.example.com/data/invoices/2010/10/invoices-report---tue...