screen-scraping

getting text that will be displayed to user from html

Bit of a random one, i am wanting to have a play with some NLP stuff and I would like to: Get all the text that will be displayed to the user in a browser from HTML. My ideal output would not have any tags in it and would only have fullstops (and any other punctuation used) and new line characters, though i can tolerate a fairly reason...

Other options for accessing input fields with Ruby Mechanize?

According to the documentation: "Mechanize lets you access form input fields in a few different ways". But I can only see one way using accessors. What other options are there? For example: can you reference form field parts like "Mechanize::Form::Text:0x101698168" instead of having to use the name value. ...

What movie website allows people to scrape it?

I've wanted to make a C# library to scrape movie information and return it to the application, but someone told me that it's against the TOS. RottenTomatoes seems to have no problems with it from what I've read on their licensing page, but I'm not quite sure. Where could I aquire movie information legally and without cost? It's for an ...

Possible to automate a web search?

Is it possible in a website search form to enter in series of searches? I have a list of destinations and would like to see if for each destination the search returns a result or throws an error. ...

Crawling and Scraping iTunes App Store

I noticed that iTunes preview allows you to crawl and scrape pages via the http:// protocol. However, many of the links are trying to be opened in iTunes rather than the browser. For example, when you go to the iBooks page, it immediately tries opening a url with an itms:// protocol. Are there any other methods of crawling the App Store...

Are there any C# library for screen scraping?

Hi, there are lots of open source screen scraping libraries for python,php. However I couldn't find any .Net counterpart. Could you recommend any library for screen scraping or just html parsing which make life easier. ...

Using Ruby webdriver (Selenium 2.0) to click on a javascript link

Using ruby, how can I get webdriver to click on a javascript link? The link I'm trying to click on is: <a class="TabOff" href=" javascript:showConfirm('/campustoolshighered/k12_admin_admin_menu.do');">Administration </a> Would I be able to trigger the javascript with a keyPress event? If so, does anyone know the syntax for doing that?...

Scraping sites that require login with Python

I use several ad networks for my sites, and to see how much money I made I need to log in to each daily to add up the values. I was thinking of making a Python script that would do this for me to get a quick total. I know I need to do a POST request to log in, then store the cookies that I get back and then GET request the report page wh...

Which ruby/rails or PHP and css selector based scraper toolkit do you recommend?

I'm looking for suggestions regarding scraping toolkits. The solution need not be very tolerant of malformed HTML or able to adapt to many different situations. It doesnt need to be very scalable, it will be run at most once daily. It needs to do one thing and do it well: scrape HTML from a specific site. I would rather use a css select...

What technology for large scale scraping/parsing?

We're designing a large scale web scraping/parsing project. Basically, the script needs to go through a list of web pages, extract the contents of a particular tag, and store it in a database. What language would you recommend for doing this on a large scale(tens of millions of pages?). . We're using MongoDB for the database, so anythi...

Gap.com is redirecting me when I try to Screen Scrape

We are building a site that allows users to collect and store their favorite products from all over the Internet to one spot. We have an algorithm that filters out and finds the correct image by reading the source code. 80% of the sites work correctly but 2 large companies are blocking redirecting us from a product page to their homepa...

Data Scraping from PDF and Excel

I am doing a little data scraping, There are 3 types of file from which i am scraping data. 1- HTML 2- PDF 3- Excel(xls) For HTML i am comfortable, i am using HTML Agility for that. For PDF and excel i need suggestions from anyone. Thanks in advance. ...

Retreiving a lot url adresses

Dear Coding Experts, Edit: Just for clarification I am using python, and would like to do this within python. I am in the middle of collecting data for a research project at our university. Basically I need to scrape a lot of information from a website that moniters the European Parliament. Here is an example of how the url of one site...

How do I prevent site scraping?

I have a fairly large music website with a large artist database. I've been noticing other music sites scraping our site's data (I enter dummy Artist names here and there and then do google searches for them). How can I prevent screen scraping? Is it even possible? ...

Scraping a table using BeautifulSoup

Dear Python Experts, I have a question which i suspect is fairly straight forward. I have the following type of page from which I want to collect the information in the last table (if you scroll all the way down it is the one in the box labelled "Procedure"): http://www.europarl.europa.eu/sides/getDoc.do?type=REPORT&amp;mode=XML&amp;re...

How to Store Entire WebPages for Later Parsing?

I've been doing a lot of parsing of webpages lately and my process usually looks something like this: Obtain list of links to Parse Import list into database Download Entire Webpage for each link and store into mysql Add Index for each scraping session Scrape relevant sections (content, metas, whatever) Steps 4,5 -- Rinse/Repeat -- as ...

Prevent images from downloading with ScrAPI

I need to scrape some websites, and would like to avoid downloading images from the pages I am scraping - I only need the text. I am hoping this will speed up the process. Any ideas on how to manage this? Thanks, Jon ...

php http post button

Hi there, I'm using PHP to data scrape another website. However, on certain occasions I need to confirm a variable (due to have two very alike possibilities). The button I'm supposed to click to confirm my variable is: <input type="submit" class="buttonEmphasized confirm_nl" name="start" value="Bevestig" accesskey="s" /> However, ...

Is there a simple way in R to extract only the text elements of an HTML page?

Is there a simple way in R to extract only the text elements of an HTML page? I think this is known as 'screen scraping' but I have no experience of it, I just need a simple way of extracting the text you'd normally see in a browser when visiting a url. ...

Scraping websites in Java

What I am trying to do is a take a list of URL's and download each URL's content (for indexing). The biggest problem is that if I encounter a link that is something like a facebook event that simply redirects to the login page I need to be able to detect and skip that URL. It seems as though the robots.txt file is there for this purpose....