screen-scraping

Applications for data scraping from websites and database creation

I am looking for application/software that will help me in scraping data from yellow pages, jigsaw and other similar kind of websites. I want to collect info like contact details/ name designation and email address. Please advice some software that will be able to do so, the price i am looking should be affordable or preferably free. ...

How can I download information from a website if it returns XML/JSON in its response?

Does Python3 have a built in method to do this? Any guidance at all would be great! :) The website in question exposes all of its information and even gives you an API key to use. ...

Intelligent screen scraping using different proxies and user-agents randomly?

I want to download few HTML pages from http://abc.com/view_page.aspx?ID= The ID is from an array of different numbers. I would be interested in visiting multiple instances of this URL and saving the file as [ID].HTML using different proxy IP/ports. I want to use different user-agents and I want to randomize the wait times before each...

Following a link using Nokogiri for scraping

Is there a method to follow a link using Nokogiri for scraping? I know I can extract the href and open it, but I thought I saw a method to do this using hpricot and was wondering if there was something like that in Nokogiri. ...

PHP equivalent of PyQuery or Nokogiri?

Basically, I want to do some HTML screen scraping, but figuring out if it is possible in PHP. In Python, I would use PyQuery. In Ruby, I would use Nokogiri. ...

How does Cell Minute Tracker work?

It's been a mystery how does Cell Minute Tracker manage to fetch AT&T users data. Maybe someone here has the long waited answer. I'm really curious rather they got a confirmation to scrape user’s cellular report And how they can fire up multiple requests to AT&T site without being banned? I'm waiting for someone who could shed some lig...

How can I take a screenshot of a website w/ .NET?

I'm looking for ideas on how to take screenshots of websites within a .NET application. This application will be a windows service. Thanks! ...

Scraping &#151 character (long dash) error in Nokogiri

I having trouble scraping a certain long dash that is encoded as ; on the Time magazine site. It looks like this: —. It works fine when this dash is encoded as &mdash, but when the problem dash is scraped, it is returned as unknown characters. I am using Nokogiri and am wondering if I have to use some sort of special encoding? The p...

Scraping ASP.NET site with Ruby

I would like to scrape the search results of this ASP.NET site using Ruby and preferably just using Hpricot (I cannot open an instance of Firefox): http://www.ngosinfo.gov.pk/SearchResults.aspx?name=&foa=0 However, I am having trouble figuring out how to go through each page of results. Basically, I need simulate clicking on links l...

Parse livescores from web site

Hi all, I was thinking of parsing live scores from a web site via PHP and them use them for an application I am planning to implement, so my question is is it legal to do that, parse info from web site and use it ? If I quote the source if the info ? ...

Website content crawling

We have a Business Listings directory hosted on IIS 6 Windows 2003. Our competitors crawl and steal our content and customers. We have tried IP blocking using honeypot URLs and log parsing without much success. Is anyone aware of a network device or a proxy server that I can run in front of my web server to minimize this issue? All su...

How to "scan" a website (or page) for info, and bring it into my program?

Well, I'm pretty much trying to figure out how to pull information from a webpage, and bring it into my program (in Java). For example, if I know the exact page I want info from, for the sake of simplicity a Best Buy item page, how would I get the appropriate info I need off of that page? Like the title, price, description? What woul...

Screen scraping: getting around "HTTP Error 403: request disallowed by robots.txt"

Is there a way to get around the following? httperror_seek_wrapper: HTTP Error 403: request disallowed by robots.txt Is the only way around this to contact the site-owner (barnesandnoble.com).. i'm building a site that would bring them more sales, not sure why they would deny access at a certain depth. I'm using mechanize and Beautif...

Screen Scraping

Hi I'm trying to implement a screen scraping scenario on my website and have the following set so far. What I'm ultimately trying to do is replace all links in the $results variable that have "ResultsDetails.aspx?" to "results-scrape-details/" then output again. Can anyone point me in the right direction? <?php $url = "http://mysite:90...

Screen Scraping - how to get AJAX based filtered data

hi, I am working on screen scraping, its easy when filteration in query string, but the problem in AJAX based filteration, e.g. here is an sample URL When you open this page, enter hotel name and click Go, Ajax filter work and show the result accordingly or you click on Next Page, it will shown next record using AJAX based. please sugg...

Are there any free .NET OCR libraries that will perform OCR on an application window directly?

I am looking for a free .NET OCR library that will be able to do OCR on a given application window or even a image in memory (I can take a snapshot of the application window myself). I have looked at tessnet2 and MODI but both require an image located on disk. I need to use OCR because the application I am trying to write a script for ...

How to script a URL screenshot without X?

I'd like to automate a 'screenshot' of arbitary URL's using a Linux build that doesn't have X installed. There appears to be some (costly) web services to do this, but I specifically want something I can do locally. Tried imagemagick without much success, though Mozilla used to have a command line option to do it? ...

Reading Ontology with Jena, feeding it with RDF triples, and producing correct RDF string output.

Hi, I have an ontology, which I read in with Jena to help me scrape some RDFa triples from a website. I don't currently store these triples in a Jena model, but that is fairly straight forward to do, its on my to do next list. The area I am struggling with, though, is to get Jena to output correct RDF for the ontology I have. The ontol...

Stubbing tests when using Ruby Mechanize

Hi Everyone, I've been trying to use Mocha to do some stubbing for tests on code using Mechanize. Here is an example method: def lookup_course subject_area = nil, course = nil, quarter = nil, year = nil raise ArgumentError, "Subject Area can not be nil" if (subject_area.nil? || subject_area.empty?) page = get_page FIND_BASIC_...

How to use regular expressions to pull a substring? (screen scraping)

Hey guys, i'm really trying to understand regular expressions while scraping a site, i've been using it in my code enough to pull the following, but am stuck here. I need to quickly grab this: http://www.example.com/online/store/TitleDetail?detail&amp;sku=123456789 from this: ('<a href="javascript:if(handleDoubleClick(this.id)){windo...