screen-scraping

are accessiblity and anti-scrapabality mutually exclusive?

I want to make a site that is both difficult to screen-scrap and accessible. Is that an oxymoron? ...

How to write XPath to capture text that is not tagged

I'm trying to scrap customer reviews from a site and ran into an interesting set-up. <div class="Review"> <img class="stars" etc> <b>ReviewerName</b> - yyyy-mm-dd <br/> <p>Review</p> <a>was this helpful links</a> <hr/> <br/> <!-- Repeat above for additional reviews. --> </div> For the life of me I can't come up with ...

Whats a good way to protect a link database from automatic scrapers?

I have a large link database, that I would want to protect against others who would want to copy them. Is there anything I can do other than force people to enter a CAPTCHA before each link? ...

PHP Scraping Page

I'm trying to scrape a page where the information I'm looking for lies within: <tr class="defRowEven"> <td align="right">label</td> <td>info</td> </tr> I'm trying to get the label and info out of the page. Before I was doing something like: $hrefs = $xpath->evaluate("/html/body//a"); That is how I'm grabbing the URL's. Is t...

Preventing RSS feed scraping?

On a Wordpress site, I have both a normal blog that I want Google to detect and an RSS feed for outgoing links to other sites. I don't need/want bots to get at this other RSS feed nor do I want people to be able to get the link for their own use. I've disabled RSS for the main blog successfully but am not sure how to encrypt/protect/hid...

How to manipulate a Joomla! website for easy screen scraping

I got permission from the owner (who knows nothing about web development) of a Joomla! website to extract the articles from the site (for real!) I got the urls from the RSS feed, but the feed does not include the full text. Do you know a way to manipulate the index.php parameters to get the article as clean as posible? The url right ...

Web/Screen Scraping with Google App Engine - Code works in python interpreter but not GAE

I want to do some web scraping with GAE. (Infinite Campus Student Information Portal, fyi). This service requires you to login to get in the website. I had some code that worked using mechanize in normal python. When I learned that I couldn't use mechanize in Google App Engine I ended up using urllib2 + ClientForm. I couldn't get it to l...

Text Extraction from HTML Java

Hi. I'm working on a program that downloads HTML pages and then selects some of the information and write it to another file. I want to extract the information which is intbetween the paragraph tags, but i can only get one line of the paragraph. My code is as follows; FileReader fileReader = new FileReader(file); BufferedReader buffR...

Python web scraping involving HTML tags with attributes

I'm trying to make a web scraper that will parse a web-page of publications and extract the authors. The skeletal structure of the web-page is the following: <html> <body> <div id="container"> <div id="contents"> <table> <tbody> <tr> <td class="author">####I want whatever is located here ###</td> </tr> </tbody> </table> </div> </div> </...

How to extract value from hidden field on form

I have form (on my own blog/cms install which i want to play with a bit) with hidden value which i want to extract. Problem is that there are 2 forms on that page, each with that hidden field with value. On each form field name is the same, only hidden value differs. Something like this: <input type="hidden" id="_hiddenname" name="_hidd...

When scraping a lot of stats from a webpage, how often should I insert the collected results in my DB?

I'm scraping a website (scripting responsibly by throttling my scraping and with permission) and I'm going to be gathering statistics on 300,000 users. I plan on storing this data in a SQL Database, and I plan on scraping this data once a week. My question is, how often should I be doing inserts on the database as results come in from ...

How can I query rankings for the users in my DB, but only consider the latest entry for each user?

Lets say I have a database table called "Scrape" possibly setup like: UserID (int) UserName (varchar) Wins (int) Losses (int) ScrapeDate (datetime) I'm trying to be able to rank my users based on their Wins/Loss ratio. However, each week I'll be scraping for new data on the users and making another entry in the Scrape table...

scraping/simulate browsing help

I want to make a program that will simulate a user browsing a site and clicking on links. Cookies and javascript have to be enabled. I've successfully done this in python, but I want to write it an compilable language (python ide's don't cut it). The links on the site are generated with javascript and are dynamic. With python I used PAMI...

Legality of harvesting public websites in an opensource application

The question is about a free and open source product which allows users to catalog and organize their personal ebook collection. The book's info could be harvested from various websites (amazon, B&N, etc.) via ISBN queries. I was wondering if there are legal issues for developing such a product, in regards to violating "terms of use" of...

C# - reading text off of an existing process

We are having to read text off of an existing VB6 application. So we use the methods FindWindow, GetWindowText, and EnumChildWindows out of kernel32 and can enumerate and read the displayed text in this process. We are able to read 90% of the text with our method, but there is a specific control (or box) in general that we cannot read....

How to enable thumbnail selection for external links? (or: reproduce Facebook's "post to profile" functionality)

When posting a link to your facebook profile, users are presented with the option to choose a thumbnail to represent the link, as seen in the following example: http://www.everyday.com.my/photo/2009/4/Add-Sushi-King-into-my-Facebook-profile.jpg (New users aren't allowed to embed images) The thumbnails presented to the user are the diff...

state of HTML after onload javascript

hi there, many webpages use onload JavaScript to manipulate their DOM. Is there a way I can automate accessing the state of the HTML after these JavaScript operations? A took like wget is not useful here because it just downloads the original source. Is there perhaps a way to use a web browser rendering engine? Ideally I am after a s...

How can I tell what kind of whitespace is in a string?

I am scraping some information from a 10 year old website that was built in ASP using Frontpage(originally) and Dreamweaver(lately). I am using PHP. I am getting back strings with whitespace that is not spaces. Using the PHP trim function, some of the white space is removed but not all. original string: string(47) " School Calendar" ...

Finding the UserID of a Stack Overflow user with their Display Name in C#?

I'm building a small Stack Overflow application, but to collect information from Stack Overflow about a user I need to know their UserID. I would like the user to be able to enter their display name/username and for the application to find their UserID. However, I understand that usernames are not unique, but would it be possible to find...

Optimal Configuration for Disgusing Identity of Scraping

I'm running a bunch of scripts that are scraping data from a website. For reasons I won't bore you with, I can't run them all off the same host--instead I need to set up six different hosts. I want to configure my hosting setup to disguise the fact that all six hosts have the same owner. I have gotten six different shared hosting acco...