screen-scraping

Can a cURL based HTTP request imitate a browser based request completely ?

Hello Experts, This is a two part question. Q1: Can cURL based request 100% imitate a browser based request? Q2: If yes, what all options should be set. If not what extra does the browser do that cannot bee imitated by cURL? I have a website and I see thousands of request being made from a single IP in a very short time. These reque...

Rotating Proxies for web scraping

I've got a python web crawler and I want to distribute the download requests among many different proxy servers, probably running squid (though I'm open to alternatives). For example, it could work in a round-robin fashion, where request1 goes to proxy1, request2 to proxy2, and eventually looping back around. Any idea how to set this up?...

Using Google App Engine's Cron service to extract data from a URL

Hi Guys! I need to scrape a simple webpage which has the following text: Value=29 Time=128769 The values change frequently. I want to extract the Value (29 in this case) and store it in a database. I want to scrape this page every 6 hours. I am not interested in displaying the value anywhere, I just am interested in the cron. Hope I ...

Website scraping using jquery and ajax

Hello, I want to be able to manipulate the html of a given url. Something like html scraping. I know this can be done using curl or some scraping library.But i would like to know if it is possible to use jquery to make a get request to the url using ajax and retrieve the html of the url, and run jquery code on the html returned ? Thank...

YQL scrape entire website/domain

Hi, I'm trying to scape back a set of links and content from a domain. The Query in google would be "site:www.newswebsite.com search_term" I've seen some close stuff to getting this working, but I can't seem to quite get a search working across a whole website, and then filter by the search term. Is this possible without a custom d...

looking for alternative to Webzinc .NET , screen scraping, web automation library for .net

i came across this .net library http://www.webzinc.com/online/faq.aspx however, i was wondering if there was a free alternative out there ? ...

How to extract pictures from website which are using a timestamp as name

Hello all, I think I know the answer for this question allready, but just as curious I am, I'll ask it anyways. I'm running a webshop which products come with a csv file. I can import all the objectsng without any trouble, the only thing is that images and thumbnail locations are not exported with the the database dump. (it's never per...

Get Selected text in browser programatically

Hi, From my windows application, i want to detect selected text in "Internet Explorer", Firefox and any other browser. Do you know what piece of code should i use in order to achieve this? Thanks, The idea is not to search for a text in IE, but instead "capture the selected text" in IE. By the way not only IE, but any windows applica...

C# Scraping of HTML/.asp

Hi, I'm working on a "personal-can-it-work" sort of thing, and i have everything working great except for trying to parse some information from a .asp sourcefile into my Program. This is the parsing code i have so far // parse out the results try { int snr_start = result.IndexOf("SNR"); ...

How to use Ruby to scrape, build a session, and launch a page on a target site

I am wondering how to use Ruby to scrape a website, with the goal of launching a new browser with the destination page loaded. This is needed, because the destination page is not stateless, and requires a number of session parameters. For an example flow, see how Kayak.com does this. 1. Go to Kayak.com, and search for a hotel in Chica...

PHP/AJAX Image Grabbing script similar to functionality Facebook messaging...

Hello, When sending a message on Facebook, if you include a URL it generally grabs a picture from the webpage and adds it at the bottom as a thumbnail. You then have the ability to select through a number of pictures featured on the site. I can see how this could be built, but to save me the hassle I wonder if somebody has already don...

Authenticate on a website and Screen scraping with objective-c

I'm developing an iPhone application where I wish to authenticate (login form) on a site and retrieve some information by doing some screen scraping. Is there an API available to do this or documentation how I could do this? thanks ...

using curl to get from one webpage to another involving javascript

Hi, I have webpage1.html which has a hyperlink whose href="some/javascript/function/outputLink()" Now, using curl (or any other method in php) how do I deduce the hyperlink (of http:// format) from the javascript function() so that I can go to next page. Thanks ...

Web scraping etiquette

I'm considering writing a simple web scraping application to extract information from a website that does not seem to specifically prohibit this. I've checked for other alternatives (eg RSS, web service) to get this information, but there are none available at this stage. Despite this I've also developed/maintained a few websites mys...

how can i grab CData out of BeatuifulSoup

I have a website that I'm scraping that has a similar structure the following. I'd like to be able to grab the info out of the CData block. I'm using BeautifulSoup to pull other info off the page, so if the solution can work with that, it would help keep my learning curve down as I'm a python novice. Specifically, I want to get at the ...

parse html in adobe air

I am trying to load and parse html in adobe air. The main purpose being to extract title, meta tags and links. I have been trying the HTMLLoader but I get all sort of errors, mainly javascript uncaught exceptions. I also tried to load the html content directly (using URLLoader) and push the text into HTMLLoader (using loadString(...)) b...

Is there a better library than urlgrabber for fetching remote urls in python?

I'm writing a spider that needs a load_url function that performs the following for me: Retry the URL if there is a temporary error, without leaking exceptions. Not leak memory or file handles Use HTTP-KeepAlive for speed (optional) URLGrabber looks great on the surface, but it has trouble. The first I hit a problem with too many fil...

How to extract the data from a website using java?

Hi I am familier with java programming language I like to extract the data from a website and store it to my database running on my machine.Is that possible in java.If so which API I should use. For example the are number of schools listed on a website How can I extract that data and store it to my database using java. ...

OS X text drawing

As I (semi) understand it, all on-screen text in any Windows application is drawn by the same drawtext functionality. It is possible to hook onto this method and view (or even change) every bit of text being drawn to the display. How does OS X put text on the screen? Is there a similar way to hook into this API and view all text being...

How to remove expired items from database with Scrapy

I am using spidering a video site that expires content frequently. I am considering using scrapy to do my spidering, but am not sure how to delete expired items. Strategies to detect if an item is expired are: Spider the site's "delete.rss". Every few days, try reloading the contents page and making sure it still works. Spider every ...