Hi,
I want to scrape the contents of a webpage. The contents are produced after a form on that site has been filled in and submitted.
I've read on how to scrape the end result content/webpage - but how to I programmatically submit the form?
I'm using python and have read that I might need to get the original webpage with the form, ...
How does the fair use doctrine apply to websites in terms of screen-scraping?
The particular example I am thinking of is extraction of the useful data from a website, and re-presentation of the raw data aggregated with data from other similar websites. For example, suppose one was to extract data from a variety of websites to produce a ...
Following on from my question on the Legalities of screen scraping, even if it's illegal people will still try, so:
What technical mechanisms can be employed to prevent or at least disincentivise screen scraping?
Oh and just for grins and to make life difficult, it may well be nice to retain access for search engines. I may well be pla...
While the subject could sound like I'm looking to do something shifty, I'm not; I maintain an internal web site used by several hundred phone operators, and would like to add the following functionality:
I would like to add a control in the header of all of the web pages that would capture an image of the entire desktop and save the im...
I'm trying to grab a specific bit of raw text from a web site. Using this site and other sources, I learned how to grab specific images using simpleXML and xpath.
However the same approach doesn't appear to be working for grabbing raw text. Here's what's NOT working right now.
// first I set the xpath of the div that contains the text...
Technorarati's got their Cosmos api, which works fairly well but limits you to noncommercial use and no more than 500 queries a day.
Yahoo's got a Site Explorer InLink Data API, but it defines the task very literally, returning links from sidebar widgets in blogs rather than just links from inside blog content.
Is there any other alter...
Overall Plan
Get my class information to automatically optimize and select my uni class timetable
Overall Algorithm
Logon to the website using its
Enterprise Sign On Engine login
Find my current semester and its
related subjects (pre setup)
Navigate to the right page and get the data from each related
subject (lecture, practical and
...
Does anyone know how to maintain text formatting when using XPath to extract data?
I am currently extracting all blocks
<div class="info">
<h5>title</h5>
text <a href="somelink">anchor</a>
</div>
from a page. The problem is when I access the nodeValue, I can only get plain text. How can I capture the contents including formatting, i...
Hi,
I need to screen scrape some data from a website, because it isn't available via their web service. When I've needed to do this previously, I've written the Java code myself using Apache's HTTP client library to make the relevant HTTP calls to download the data. I figured out the relevant calls I needed to make by clicking through t...
I have been thinking quite a bit here lately about screen scraping and what a task it can be. So I pose the following question.
Would you as a site developer expose simple APIs to prevent users from screen scraping, such as JSON results?
These results could then implement caching, and they are much smaller for traffic than the huge amo...
With BeautifulSoup 3.1.0.1 and Python 2.5.2, and trying to parse a web page in French. However, as soon as I call findAll, I get the following error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 1146: ordinal not in range(128)
Below is the code I am currently running:
import urllib2
from BeautifulSoup i...
I have written a script which is a pretty brutal hack using a language called AutoIt. Essentially it screen scrapes and sends keys to mimic a user moving throughout the citrix app (it's a 25+ year old dos app). It works relatively well, however it does need a lot of babysitting.
I am planning on re-writting it in C#, however I'm hopin...
Anyone got any experience with extracting data from PDF files programatically, in particular embedded tables? What tools did you use? Is this always a one-off process depending on the file, or are there tools which will work against all sorts of different files?
...
Hi,
I'm developing an ecommerce search engine that allows you to search for products in a lot of ecommerce websites.
How do I approach the matter?
I need an application that will be able to scan websites, parse their HTML and determine which of the images in the website are product images, which are product descriptions, which are pro...
Hi friends,
How to screen scrape a particular website. I need to log in to a website and then scrape the inner information.
How could this be done?
Please guide me.
Duplicate: How to implement a web scraper in PHP?
...
I have a user ID and a password to log in to a web site via my program. Once logged in, the URL will change from http://localhost/Test/loginpage.html to http://www.4wtech.com/csp/web/Employee/Login.csp.
How can I "screen scrape" the data from the second URL using PHP?
...
I use Simple HTML DOM to scrape a page for the latest news, and then generate an RSS feed using this PHP class.
This what I have now:
<?php
// This is a minimum example of using the class
include("FeedWriter.php");
include('simple_html_dom.php');
$html = file_get_html('http://www.website.com');
foreach($html->find('td[width="380...
This is the HTML I have:
p_tags = '''<p class="foo-body">
<font class="test-proof">Full name</font> Foobar<br />
<font class="test-proof">Born</font> July 7, 1923, foo, bar<br />
<font class="test-proof">Current age</font> 27 years 226 days<br />
<font class="test-proof">Major teams</font> <span style="white-space: nowrap">Japan...
Preferably I'd like to do so with some bash shell scripting, maybe some PHP or PERL and a MySQL db. Thoughts?
...
To further a personal project of mine, I have been pondering how to count the number of results for a user specified word on Twitter. I have used their API extensively, but have not been able to come up with an efficient or even halfway practical way to count the occurrences of a particular word. The actual results are not critical, ju...