questions about screen-scraping | ansaurus

screen-scraping

are accessiblity and anti-scrapabality mutually exclusive?

I want to make a site that is both difficult to screen-scrap and accessible. Is that an oxymoron? ...

language-agnostic

screen-scraping

How to write XPath to capture text that is not tagged

I'm trying to scrap customer reviews from a site and ran into an interesting set-up. <div class="Review"> <img class="stars" etc> <b>ReviewerName</b> - yyyy-mm-dd <br/> <p>Review</p> <a>was this helpful links</a> <hr/> <br/>  </div> For the life of me I can't come up with ...

screen-scraping

Whats a good way to protect a link database from automatic scrapers?

I have a large link database, that I would want to protect against others who would want to copy them. Is there anything I can do other than force people to enter a CAPTCHA before each link? ...

screen-scraping

data-protection

PHP Scraping Page

I'm trying to scrape a page where the information I'm looking for lies within: <tr class="defRowEven"> <td align="right">label</td> <td>info</td> </tr> I'm trying to get the label and info out of the page. Before I was doing something like: $hrefs = $xpath->evaluate("/html/body//a"); That is how I'm grabbing the URL's. Is t...

screen-scraping

Preventing RSS feed scraping?

On a Wordpress site, I have both a normal blog that I want Google to detect and an RSS feed for outgoing links to other sites. I don't need/want bots to get at this other RSS feed nor do I want people to be able to get the link for their own use. I've disabled RSS for the main blog successfully but am not sure how to encrypt/protect/hid...

screen-scraping

How to manipulate a Joomla! website for easy screen scraping

I got permission from the owner (who knows nothing about web development) of a Joomla! website to extract the articles from the site (for real!) I got the urls from the RSS feed, but the feed does not include the full text. Do you know a way to manipulate the index.php parameters to get the article as clean as posible? The url right ...

screen-scraping

Web/Screen Scraping with Google App Engine - Code works in python interpreter but not GAE

I want to do some web scraping with GAE. (Infinite Campus Student Information Portal, fyi). This service requires you to login to get in the website. I had some code that worked using mechanize in normal python. When I learned that I couldn't use mechanize in Google App Engine I ended up using urllib2 + ClientForm. I couldn't get it to l...

google-app-engine

screen-scraping

Text Extraction from HTML Java

Hi. I'm working on a program that downloads HTML pages and then selects some of the information and write it to another file. I want to extract the information which is intbetween the paragraph tags, but i can only get one line of the paragraph. My code is as follows; FileReader fileReader = new FileReader(file); BufferedReader buffR...

screen-scraping

html-content-extraction

text-extraction

Python web scraping involving HTML tags with attributes

I'm trying to make a web scraper that will parse a web-page of publications and extract the authors. The skeletal structure of the web-page is the following: <html> <body> <div id="container"> <div id="contents"> <table> <tbody> <tr> <td class="author">####I want whatever is located here ###</td> </tr> </tbody> </table> </div> </div> </...

screen-scraping

How to extract value from hidden field on form

I have form (on my own blog/cms install which i want to play with a bit) with hidden value which i want to extract. Problem is that there are 2 forms on that page, each with that hidden field with value. On each form field name is the same, only hidden value differs. Something like this: <input type="hidden" id="_hiddenname" name="_hidd...

screen-scraping

When scraping a lot of stats from a webpage, how often should I insert the collected results in my DB?

I'm scraping a website (scripting responsibly by throttling my scraping and with permission) and I'm going to be gathering statistics on 300,000 users. I plan on storing this data in a SQL Database, and I plan on scraping this data once a week. My question is, how often should I be doing inserts on the database as results come in from ...

screen-scraping

How can I query rankings for the users in my DB, but only consider the latest entry for each user?

Lets say I have a database table called "Scrape" possibly setup like: UserID (int) UserName (varchar) Wins (int) Losses (int) ScrapeDate (datetime) I'm trying to be able to rank my users based on their Wins/Loss ratio. However, each week I'll be scraping for new data on the users and making another entry in the Scrape table...

sql-server-2005

screen-scraping

greatest-n-per-group

scraping/simulate browsing help

I want to make a program that will simulate a user browsing a site and clicking on links. Cookies and javascript have to be enabled. I've successfully done this in python, but I want to write it an compilable language (python ide's don't cut it). The links on the site are generated with javascript and are dynamic. With python I used PAMI...

webbrowser-control

screen-scraping

Legality of harvesting public websites in an opensource application

The question is about a free and open source product which allows users to catalog and organize their personal ebook collection. The book's info could be harvested from various websites (amazon, B&N, etc.) via ISBN queries. I was wondering if there are legal issues for developing such a product, in regards to violating "terms of use" of...

screen-scraping

C# - reading text off of an existing process

We are having to read text off of an existing VB6 application. So we use the methods FindWindow, GetWindowText, and EnumChildWindows out of kernel32 and can enumerate and read the displayed text in this process. We are able to read 90% of the text with our method, but there is a specific control (or box) in general that we cannot read....

screen-scraping

How to enable thumbnail selection for external links? (or: reproduce Facebook's "post to profile" functionality)

When posting a link to your facebook profile, users are presented with the option to choose a thumbnail to represent the link, as seen in the following example: http://www.everyday.com.my/photo/2009/4/Add-Sushi-King-into-my-Facebook-profile.jpg (New users aren't allowed to embed images) The thumbnails presented to the user are the diff...

screen-scraping

state of HTML after onload javascript

hi there, many webpages use onload JavaScript to manipulate their DOM. Is there a way I can automate accessing the state of the HTML after these JavaScript operations? A took like wget is not useful here because it just downloads the original source. Is there perhaps a way to use a web browser rendering engine? Ideally I am after a s...

screen-scraping

How can I tell what kind of whitespace is in a string?

I am scraping some information from a 10 year old website that was built in ASP using Frontpage(originally) and Dreamweaver(lately). I am using PHP. I am getting back strings with whitespace that is not spaces. Using the PHP trim function, some of the white space is removed but not all. original string: string(47) " School Calendar" ...

screen-scraping

Finding the UserID of a Stack Overflow user with their Display Name in C#?

I'm building a small Stack Overflow application, but to collect information from Stack Overflow about a user I need to know their UserID. I would like the user to be able to enter their display name/username and for the application to find their UserID. However, I understand that usernames are not unique, but would it be possible to find...

screen-scraping

Optimal Configuration for Disgusing Identity of Scraping

I'm running a bunch of scripts that are scraping data from a website. For reasons I won't bore you with, I can't run them all off the same host--instead I need to set up six different hosts. I want to configure my hosting setup to disguise the fact that all six hosts have the same owner. I have gotten six different shared hosting acco...

screen-scraping

1
...
8
9
10
11
12
...
30