screen-scraping

How to scrape images from a web site with javascript and servlets

I have a web page that has the following content (I've changed the URL in the src tag for privacy purposes, otherwise viewing the page source is identical): <HTML> <BODY> <script type="text/javascript" src="http://localhost/servlet?publicKey=abcdefg12345678&amp;amp"&gt;&lt;/script&gt; </BODY> </HTML> The resulting page displays an i...

How to offer a website scrapping service like dapper.com, diffbot.com? (especially the visual selection)

Hi, I'm trying to build a functionality for the users of my site to "screen scrap" websites of their choice and I'm looking for a toolkit that allows me to do that OR a third party site that allows other sites (like mine) to use their interface. Let me try explaining it better. I have customers on my site. These customers are intereste...

Help parsing a page with python

Hi, I would like to parse a webpage to can get the url of the video download. I use python and firebug but I cant get the url link. Example: The url where I have to get the video link is: hxxp://www.rtve.es/mediateca/videos/20100125/saber-comer---salsa-verde-judiones-25-01-10/676590.shtml" The video is hxxp://www.rtve.es/resources/TE...

scrape html generated by javascript with python

I need to scrape a site with python. I obtain the source html code with the urlib module, but I need to scrape also some html code that is generated by a javascript function (which is included in the html source). What this functions does "in" the site is that when you press a button it outputs some html code. How can I "press" this butt...

How to display content from other sites on my own page?

I know this isn't a specific programming question, but I really need to know how this can be done. How does a website like this: http://www.dogpile.com/ display search results from google and other search engines on it's own page. The only way I can think about doing something like this is by using iframes but of course then the conten...

WebBrowsing in C# - Libraries, Tools etc. - Anything like Mechanize in Perl?

Looking for something similar to Mechanize for .NET C#. If you don't know what Mechanize is.. http://search.cpan.org/dist/WWW-Mechanize/ I will maintain a list of suggestions here. Anything for browsing/posting/screen scraping (Other than WebRequest and WebBrowser Control). Parsing HTMLAgilityPack - http://www.codeplex.com/htmlagil...

Can I get my instance of mechanize.Browser to stay on the same page after calling b.form.submit()?

In Python's mechanize.Browser module, when you submit a form the browser instance goes to that page. For this one request, I don't want that; I want it just to stay on the page it's currently on and give me the response in another object (for looping purposes). Anyone know a quick to do this? EDIT: Hmm, so I have this kind of working wi...

Embedding part of a web site

Suppose I want to embed the latest comic strip of one of my favorite webcomics into my site as a kind of promotion for it. The webcomic has the strip inside of a div with an id, so I figured I can just embed the div in my site, except that I couldn't find any code examples for how to do it (they all show how to embed flash or a whole web...

Scraping html with Python or...

One of the arguments I make to my (Microbiology and Genetics) students is that "data" is/are messy, and Python can help with that (of course other languages can too). So here is a practical kind of web-based data-gathering exercise. I notice that there a few people who answer Python-related questions among the users with the highest re...

C#: Need to render a form/control on an "imaginary" desktop

Okay, here's what I'm trying to do. First I'll explain the end result I'm trying to achieve in case there are other ideas on how to do this. I'm making a screen capture utility that takes a screen shot of only one window... my window (which I have total programmatic control over). However, this window may be much larger than the desktop...

Screen scraping with Python

Does Python have screen scraping libraries that offer JavaScript support? I've been using pycurl for simple HTML requests, and Java's HtmlUnit for more complicated requests requiring JavaScript support. Ideally I would like to be able to do everything from Python, but I haven't come across any libraries that would allow me to do it. Do...

HttpRequest: pass through AuthLogin

I would need to make a simple program that logs with given credentials to certain website and then navigate to some element (link). It is even possible (I mean this Authlogin thing)? EDIT: SORRY - I am on my company machine and I cannot click on "Vote" or "Add comment" - the page says "Done, but with errors on page" (IE..). I do appreci...

ruby mechanize in Facebook

I'm trying to click the Settings button on the home page, but when I do I get this page back: #<WWW::Mechanize::Page {url #<URI::HTTP:0x1023c5fc0 URL:http://www.facebook.com/editaccount.php?ref=mb&amp;drop&gt;} {meta} {title nil} {iframes} {frames} {links} {forms}> which is.. kinda empty! Is there some problems with these iframes ...

Screen scrape a web page that uses javaScript and frames

Hi, I want to scrape data from www.marktplaats.nl . I want to analyze the scraped description, price, date and views in Excel/Access. I tried to scrape data with Ruby (nokogiri, scrapi) but nothing worked. (on other sites it worked well) The main problem is that for example selectorgadget and the add-on firebug (Firefox) don’t find any ...

Getting HTML from web pages that use AJAX

I wanted to know how to scrape web pages that use AJAX to fetch content on the web page being rendered. Typically a HTTP GET for such pages will just fetch the HTML page with the JavaScript code embedded in it. But I want to know if it is possible to programmatically (preferably Java) query for such pages and simulate a web browser kind ...

Can I scrape flash?

I'd like to scrape a website to programmatically collect any external links within any flash elements on the page. I'd also like to collect any other text, if possible, but the links are the important part. Is this possible? A freeware library/service to accomplish this task would be preferable, but if none is, how can I accomplish the t...

Whats the most efficent way to scrape data from a website (in php)?

Im trying to scrape data from IMDB, but naturally there are a lot of pages, and doing it in a serial fashion takes way too long. Even with I do multi-threaded CURL. Is there a faster way of doing it? Yes I know IMDb offers text files, but they dont offer everything, in any sane fashion. ...

Submitting POST data to a different site and then extracting the output with PHP

I'd like to use the snoopy class, but I don't have the proper server permissions with my shared hosting to install it. Any easy to use alternatives? I need to submit this POST data: hash = $_POST['hash'] Submit = Submit to this site: http://milw0rm.com/cracker/info.php And extract the output in the -::PASS column from http://milw0...

php using CURL to grab whois record

Example: http://www.whois.net/whois/hotmail.com When open in browser, output is shown. When using curl call, it show nothing. What's wrong? I want to return whole page result, then use regular expression to retrieve data at Expiration Date: 29-Mar-2015 00:00:00 line. $postfields= null; $postfields["noneed"] = ""; $queryurl= "http://...

how can i "screen scrape" other windows program in VB6 ?

I would like to monitor a process every second until it displays an expected "error" message. how can i monitor something.exe and get notification via "screen scraping" the error message from something.exe all from my vb6 program ? is it possible to terminate or click the "okay" button from vb6 ? is this sort of thing better suited for...