screen-scraping

How to detect Javascript pop-up notifications in WatiN?

I have a, what seems to be, rather common scenario I'm trying to work through. I have a site that accepts input through two different text fields. If the input is malformed or invalid, I receive a Javascript pop-up notification. I will not always receive one, but I should in the event of (like I said earlier) malformed data, or when a...

Why Shouldn't I Programmatically Submit Username/Password to Facebook/Twitter/Amazon/etc?

I wish there was a central, fully customizable, open source, universal login system that allowed you to login and manage all of your online accounts (maybe there is?)... I just found RPXNow today after starting to build a Sinatra app to login to Google, Facebook, Twitter, Amazon, OpenID, and EventBrite, and it looks like it might save s...

curl scrapping problem

I want to scrap mail ids from a page and have got a script which is running in most sites. But in some sites they are loading mail ids with javascript so curl cant able to load the contents of the page with mail ids. i mean here http://www.everynation.org/churches/church-directory/africa/zambia Here they are loading mailids with java...

Scraping data from Flash (Games)

I saw this video, and I am really curious how it was performed. Does anyone have any ideas? My intuition is that he scraped pixels from the screen (one per 'box'), and then fed that into some program to determine the next move. Is scraping pixel-by-pixel the way to do this, or is there a better way? I am looking to do something similar ...

Yahoo Web Scrapes: What are the limits?

We are using a web scraper and have it set up to have a sleep function which has a random function set up (so that it isn't the same time between each scrape) but we are still getting blocked from Yahoo after 20-30 requests. Does any one know if there is a limit (i.e: 20 requests per minutes, 200 an hour) Right now our average between ...

Is there a way to programmatically extract the feed of a podcast from the iTunes page?

From an iTunes page, like http://itunes.apple.com/us/podcast/this-week-in-tech-mp3-edition/id73329404, is there a way to extract the corresponding feed address? In this case it would be http://leoville.tv/podcasts/twit.xml. I know that if you open on iTunes you can extract it manually, but I want to do it programmatically. There's a lin...

How can I prevent my asp.net site from being screen scraped?

How can I prevent my asp.net 3.5 website from being screen scraped by my competitor? Ideally, I want to ensure that no webbots or screenscrapers can extract data from my website. Is there a way to detect that there is a webbot or screen scraper running ? ...

how to create scrap page in asp.net with c# coding?

hai how to create the scrap or inbox like in orkut scraping in asp.net with c# coding? any one guide me i'm strucked in my project!!!!!!!! ...

Python: How to extract xml embedded in a html file?

I have a html file with xml snipped embedded, the source code is pasted in the pastbin: http://pastebin.com/Hy0QaWk8 my task is to extract the text enclosed in the first textarea, which is a xml snippet, from the html. Without any change to the original snippet. I'm able to get it by using the BeautifulSoup, but it changes all the tag ...

Get information from WebPage.

I want to set up an app which can get the information from a particular web page. Then i display the value which got from that page to the iPhone user. Detail:In the webpage on server ,there is the schedule for bus time. If the user input origin and terminus then show the user the time information(list on webpage) in a label. That's all...

Why am I getting a new session ID on every page fetch in my Perl WWW::Mechanize script?

So I'm scraping a site that I have access to via HTTPS, I can login and start the process but each time I hit a new page (URL) the cookie Session Id changes. How do I keep the logged in Cookie Session Id? #!/usr/bin/perl -w use strict; use warnings; use WWW::Mechanize; use HTTP::Cookies; use LWP::Debug qw(+); use HTTP::Request; use LWP:...

Can we only get the web page header information and not the body? (Mechanize)

What if I only need to download the page if it has not changed since the last download? What is the best way? can I get the size of the page first, then compare the decide if it has changed, if so, I ask for download else skip? I plan to use (python) mechanize. ...

c# Network Programming - HTTPWebRequest Scraping

Hi, I am building a web scraping application. It should scrape a complex web site with concurrent HttpWebRequests from a single host to a single target web server. The application should run on Windows server 2008. One single HttpWebRequest for data could take from 1 minute to 4 minutes to complete (because of long running db operatio...

Python GUI Scraper hanging issues.

I wrote a scraper using python a while back, and it worked fine in the command line. I have made a GUI for the application now, but I am having trouble with one issue. When I attempt to update text inside the gui (e.g. 'fetching URL 12/50'), I am unable seeing as the function within the scraper is grabbing 100+ links. Also when going ...

Is there a good tutorial for figuring out what a website is doing so your program can do the same thing?

Is there a good guide or tutorial for people who need to programmatically interact with dynamic websites? There's been a rash of Perl questions about that lately, and I haven't found a good resource to point people toward. I'm asking not because I need one but because I don't want to waste my time writing it if it already exists. Althoug...

Calling UIGetScreenImage() on manually-spawned thread prints "_NSAutoreleaseNoPool():" message to log

This is the body of the selector that is specified in NSThread +detachNewThreadSelector:(SEL)aSelector toTarget:(id)aTarget withObject:(id)anArgument NSAutoreleasePool *pool = [[NSAutoreleasePool alloc] init]; while (doIt) { if (doItForSure) { NSLog(@"checking"); doItForSure = NO; ...

xvfb on a machine with a display, can an application run 'in the background?'

I'm setting up to cron a web scraping job, using xvfb, firefox, and watir on my Mac OS X. In testing the script so far, firefox pops up visibly on the local desktop, the watir script executes, and then firefox exits (I quit firefox in my script). I'd like to set the xvfb DISPLAY such that firefox will run, but won't be seen on the loca...

How to extract images from flash viewers?

This deals with the (diverse) flash viewers that let you zoom in on images on websites. I’m trying to extract the large, zoomed-in image rendered by the viewer. In many cases the images seem to be dynamically called by the viewer, or are created only for the part of the image you are zooming on at that point. Ideally, the approach here...

Asp.Net Scrapping Grid Pages

I need cod in C#. Look, i am trying to post the search.aspx page which contains Asp.Net grid. When grid is rendered it loads very first page on the screen and then there are number of pages in the grid header. I scrap first page, and now i want to move on to the next page. All this is being done using following code: HttpWebRequest my...

Facebook fan page photo's scraping

Hi, We want to add a facebook fan page photo competition to our fan page. The meaning is that ppl can upload photo's and others can like them. The person with the most likes on his photo wins a price. Now i was wondering if anyone knows a good idea on how to get a snapshot of all the photo's on a given moment. So that when we want to s...