screen-scraping

Programmatic Form Submit

Hi, I want to scrape the contents of a webpage. The contents are produced after a form on that site has been filled in and submitted. I've read on how to scrape the end result content/webpage - but how to I programmatically submit the form? I'm using python and have read that I might need to get the original webpage with the form, ...

Legalities of screen scraping

How does the fair use doctrine apply to websites in terms of screen-scraping? The particular example I am thinking of is extraction of the useful data from a website, and re-presentation of the raw data aggregated with data from other similar websites. For example, suppose one was to extract data from a variety of websites to produce a ...

Protection from screen scraping

Following on from my question on the Legalities of screen scraping, even if it's illegal people will still try, so: What technical mechanisms can be employed to prevent or at least disincentivise screen scraping? Oh and just for grins and to make life difficult, it may well be nice to retain access for search engines. I may well be pla...

Does anyone know of a GUI-less application that can be called from JavaScript to create and save desktop screen shots?

While the subject could sound like I'm looking to do something shifty, I'm not; I maintain an internal web site used by several hundred phone operators, and would like to add the following functionality: I would like to add a control in the header of all of the web pages that would capture an image of the entire desktop and save the im...

Scraping using PHP + SimpleXML... I can grab images but not raw text?

I'm trying to grab a specific bit of raw text from a web site. Using this site and other sources, I learned how to grab specific images using simpleXML and xpath. However the same approach doesn't appear to be working for grabbing raw text. Here's what's NOT working right now. // first I set the xpath of the div that contains the text...

How to find inbound links to a given URL on the fly?

Technorarati's got their Cosmos api, which works fairly well but limits you to noncommercial use and no more than 500 queries a day. Yahoo's got a Site Explorer InLink Data API, but it defines the task very literally, returning links from sidebar widgets in blogs rather than just links from inside blog content. Is there any other alter...

Automated Class timetable optimize crawler?

Overall Plan Get my class information to automatically optimize and select my uni class timetable Overall Algorithm Logon to the website using its Enterprise Sign On Engine login Find my current semester and its related subjects (pre setup) Navigate to the right page and get the data from each related subject (lecture, practical and ...

Screen Scraping with PHP and XPath

Does anyone know how to maintain text formatting when using XPath to extract data? I am currently extracting all blocks <div class="info"> <h5>title</h5> text <a href="somelink">anchor</a> </div> from a page. The problem is when I access the nodeValue, I can only get plain text. How can I capture the contents including formatting, i...

autogenerate HTTP screen scraping Java code

Hi, I need to screen scrape some data from a website, because it isn't available via their web service. When I've needed to do this previously, I've written the Java code myself using Apache's HTTP client library to make the relevant HTTP calls to download the data. I figured out the relevant calls I needed to make by clicking through t...

Will providing APIs help deter screen scraping?

I have been thinking quite a bit here lately about screen scraping and what a task it can be. So I pose the following question. Would you as a site developer expose simple APIs to prevent users from screen scraping, such as JSON results? These results could then implement caching, and they are much smaller for traffic than the huge amo...

UnicodeEncodeError with BeautifulSoup 3.1.0.1 and Python 2.5.2

With BeautifulSoup 3.1.0.1 and Python 2.5.2, and trying to parse a web page in French. However, as soon as I call findAll, I get the following error: UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 1146: ordinal not in range(128) Below is the code I am currently running: import urllib2 from BeautifulSoup i...

Programmatic data input/output to citrix application

I have written a script which is a pretty brutal hack using a language called AutoIt. Essentially it screen scrapes and sends keys to mimic a user moving throughout the citrix app (it's a 25+ year old dos app). It works relatively well, however it does need a lot of babysitting. I am planning on re-writting it in C#, however I'm hopin...

Extracting tables from PDF files?

Anyone got any experience with extracting data from PDF files programatically, in particular embedded tables? What tools did you use? Is this always a one-off process depending on the file, or are there tools which will work against all sorts of different files? ...

Writing a C# program that scans ecommerce website and extracts products pictures + prices + description from them

Hi, I'm developing an ecommerce search engine that allows you to search for products in a lot of ecommerce websites. How do I approach the matter? I need an application that will be able to scan websites, parse their HTML and determine which of the images in the website are product images, which are product descriptions, which are pro...

screen scraping technique using php

Hi friends, How to screen scrape a particular website. I need to log in to a website and then scrape the inner information. How could this be done? Please guide me. Duplicate: How to implement a web scraper in PHP? ...

How do I implement a screen scraper in PHP?

I have a user ID and a password to log in to a web site via my program. Once logged in, the URL will change from http://localhost/Test/loginpage.html to http://www.4wtech.com/csp/web/Employee/Login.csp. How can I "screen scrape" the data from the second URL using PHP? ...

Scrape and generate RSS feed

I use Simple HTML DOM to scrape a page for the latest news, and then generate an RSS feed using this PHP class. This what I have now: <?php // This is a minimum example of using the class include("FeedWriter.php"); include('simple_html_dom.php'); $html = file_get_html('http://www.website.com'); foreach($html->find('td[width="380...

How do i extract my required data from HTML file?

This is the HTML I have: p_tags = '''<p class="foo-body"> <font class="test-proof">Full name</font> Foobar<br /> <font class="test-proof">Born</font> July 7, 1923, foo, bar<br /> <font class="test-proof">Current age</font> 27 years 226 days<br /> <font class="test-proof">Major teams</font> <span style="white-space: nowrap">Japan...

I'd like to scrape the iTunes top X RSS feed and insert into a dB...

Preferably I'd like to do so with some bash shell scripting, maybe some PHP or PERL and a MySQL db. Thoughts? ...

Count number of results for a particular word on Twitter

To further a personal project of mine, I have been pondering how to count the number of results for a user specified word on Twitter. I have used their API extensively, but have not been able to come up with an efficient or even halfway practical way to count the occurrences of a particular word. The actual results are not critical, ju...