screen-scraping

Good apps I could use to store a page locally?

Hi folks, I really need to find a reliable way in order to store a web page locally, with all it's dependencies e.g. html, css stylesheets, javascript, etc... A python library would be awesome, a CLI would be great too. Also would this type of app/library have a standardized name? Any suggestions guys? =) ...

How does Analytics and Usability software do this?!

Hi folks, I have been using analytics software for a while, and I've been asking myself how can such software copy a webpage completely to then place it in an iframe and overlay it with images and info. An example: A major problem I encountered is copying the webpage. In particular, copying the webpage the user is currently viewi...

Getting (a) title (b) summary and (c) relevant images of web page, a la Facebook status updates

Did you ever submit a link in your Facebook status? When you do, they do something very nice: They get a title, summary, and bunch of relevant images from that page, and you can choose one of them as thumbnail. I need something like that right now. Is there any open-source piece of code that does this? (It needs to be in Python because ...

Python parsing: lxml to get just part of a tag's text

I'm working in Python with HTML that looks like this. I'm parsing with lxml, but could equally happily use pyquery: <p><span class="Title">Name</span>Dave Davies</p> <p><span class="Title">Address</span>123 Greyfriars Road, London</p> Pulling out 'Name' and 'Address' is dead easy, whatever library I use, but how do I get the remainder...

php scraping HTML - problems with IE only

Hi, I am scraping a website with HTML with php that retrieves a page and removes certain elements to only show a photo gallery. It works flawlessly for every browser BUT any version of IE (typical ;)). We can fix the problem by rewriting the .css file, but we cannot implement it into the head of the php as this will be overwritten by th...

Using C# how do I get a list/array of all script tags (and their contents) on a webpage?

I am using HttpWebRequest to put a remote web page into a String and I want to make a list of all it's script tags (and their contents) for parsing. What is the best method to do this? ...

Is there an API that can take a URL and return a tagcloud datastructure?

Is there an API that can take a URL and return a tagcloud datastructure? ...

Login Javascript within PHP

Hi, I have been creating a web scraper for an internal application with PHP but one of the pages has a JavaScript login is there any way of autonomously logging in to scrape the data as usual? (I am using curl to log in to the other two sites) ...

Send browser headers via PHP

How can I send a header to a website as if PHP / Apache is a browser? I'm trying to scrape a site, but it looks like they send a 404 error if it's coming from another server... Or, if you know any other good ways to scrape content from a site? Also, here is my current code: <?php $curl_handle=curl_init(); curl_setopt($curl_han...

Syntax for phpquery scraping

Hello, I need to use a wordpress plugins : http://wordpress.org/extend/plugins/wp-web-scrapper WP Web Scraper to extract the link of an audio tracks on a itunes web page. here's the page where i want to extract the link : http://itunes.apple.com/us/album/guero/id52311104 here’s the link I want to extract on this page : http://a1.pho...

Python: Detecting the actual text paragraphs in a string

The big mission: I am trying to get a few lines of summary of a webpage. i.e. I want to have a function that takes a URL and returns the most informative paragraph from that page. (Which would usually be the first paragraph of actual content text, in contrast to "junk text", like the navigation bar.) So I managed to reduce an HTML page ...

Convert a relative URL to an absolute URL with Simple HTML DOM?

When I'm scraping content from some pages, the script gives a relative URL. Is it possible to get a absolute URL with Simple HTML DOM? ...

How do I turn a web-based calculator into a callable program?

There is a free, online calculator on a web page that I want to access from a C# program. The calculator is very simple -- just an HTML table. There is no JavaScript or Flash. I want to be able to turn this page into a method that I can call. The method would presumably call the web page, enter the appropriate numbers, read the resul...

regular expression help

Possible Duplicate: Programmatically access currency exchange rates Hi . I need a help again. The deal is that i want to load a postbank.bg homepage, then just catch the HTML table where the exchange rate is , and to get the rates in an undertandable array which i can then process... Any ideas , which is the easyest way ...

Use SimpleHtmlDOM + Login?

I am using SimpleHtmlDOM PHP quite successfully to scrape some of my favorite webpages. Some of these pages, however, require me to log in before I can get at the information that I really care about. Does anyone know how (or if it's possible) to get this library to access a page that requires a username and password be enterred before y...

Grabbing each frame of an HTML5 canvas

These palette cycle images are breathtaking: http://www.effectgames.com/demos/canvascycle/?sound=0 I'd like to make some (or all) of these into desktop backgrounds. I could use an animated gif version, but I have no idea how to get that from the canvas "animation". Is there anything available yet that can do something along these line...

How to make a "screen region selector" to capture a region of the screen in .NET?

I've used this http://www.vbforums.com/showthread.php?t=385497 sample to capture the screen and save it to an image, I'd like to know if someone have a sample for a "selector" for selecting the region of the screen to capture, like camstasia or camstudio have. I dont want to understand how it works, just need a code sample but couldnt f...

BeautifulSoup and ASP.NET/C#

Has anyone integrated BeautifulSoup with ASP.NET/C# (possibly using IronPython or otherwise)? Is there a BeautifulSoup alternative or a port that works nicely with ASP.NET/C# The intent of planning to use the library is to extract readable text from any random URL. Thanks ...

What's a good & complete PHP/MySQL Screen Scraper project?

Requirements Written in PHP Control over the code (open source would be awesome, purchasing code is an option too) Optional features Listen to robots.txt Automatic rate limiting Scrape based on rules into a data object Admin interface, or configurable back end, to setup new rules Something like CSS selectors to pick our data in th...

Scraping websites with Javascript enabled?

I'm trying to scrape and submit information to websites that heavily rely on Javascript to do most of its actions. The website won't even work when i disable Javascript in my browser. I've searched for some solutions on Google and SO and there was someone who suggested i should reverse engineer the Javascript, but i have no idea how to ...