screen-scraping

How do I find "wide characters" printed by perl?

A perl script that scrapes static html pages from a website and writes them to individual files appears to work, but also prints many instances of wide character in print at ./script.pl line n to console: one for each page scraped. However, a brief glance at the html files generated does not reveal any obvious mistakes in the scraping. ...

Scraping Data From a Dynamic Website

Background: The page has a table with data in it. There are several hyperlinks that when clicked, the data in the table is replaced with new data. Also, the page is an aspx page. Goal: I want to scrape the data in the table for all hyperlinks pressed. I have looked at what is going on via firebug and when a hyperlink is clicked, it ge...

How to scrape websites such as Hype Machine?

I'm curious about website scraping (i.e. how it's done etc..), specifically that I'd like to write a script to perform the task for the site Hype Machine. I'm actually a Software Engineering Undergraduate (4th year) however we don't really cover any web programming so my understanding of Javascript/RESTFul API/All things Web are pretty l...

Is there any python lib to scrape search engine(S) results?

I am looking for a python library to scrape results from search engines (google, yahoo, bing, etc). I only found for google -> http://github.com/kevinw/xgoogle/tree/253db7ddc8603a9dcb038ae42684cf3499a22a4b Does someone knows one for multiple search engines? ...

capturing ajax requests

I want to capture an ajax http request w/ all of its headers/cookies/post params being sent to save it so I can scrape it later. I can't find a good way of doing this with firefox or chrome. Firebug truncates long post paramters saying "... Firebug request size limit has been reached by Firebug. ... " in the middle of it, which doesn't...

scraping with selenium

I would like to scrape some dynamic data off of a website. On the site, there are a couple of links at the top labeled "1", "2", "3", and "next". If a link labeled by a number is pressed, it dynamically loads in some data into a content div. If "next" is pressed, it goes to a page with labels "4", "5", "6", "next" and the data for page ...

Simple PHP Screen Scraping Function

I'm experimenting with autoblogging (i.e., RSS-driven blog posting) using WordPress, and all that's missing is a component to automattically fill in the content of the post with the content that the RSS's URL links to (RSS is irrelevant to the solution). Using standard PHP 5, how could I create a function called fetchHTML([URL]) that re...

Blocking Web Scrapers

What are ways that websites can block web scrapers? How can you identify if your server is being accessed by a bot? ...

Get data from a facebook page wall or group wall for use on personal website

Hi! I want to connect to public facebook page or group and list all entries from the wall on a personal website. I will use PHP on my server so that would be the best solution for me. Or javascript. Could anyone explain or perhaps give a working code on how to do this? Or just all steps nessesary for making this? If its possible to han...

XPath expression to select text not in paragraph

I'm deveoping web scraping scoftware that relies on XPath to extract information from web pages. One application of the software is to scrape reviews of shows from websites. One page I'm trying to scrape is the Guardian's latest Edinburgh festival reviews: http://www.guardian.co.uk/culture/edinburghfestival+tone/reviews The section I w...

C# RegEx on a StreamReader will not return matches.

I'm writing myself a simple screen scraping application to play around with the HTMLAgilityPack library, and after getting it to work on several different types of HtmlNodes, I figured I'd get fancy and throw in a Regex for Email addresses as well. The only problem is that the application never finds any matches, or maybe it is but not r...

Using Flash to get a verification code

Say I have a flash movie in an HTML page, I want to get the value of a tag in the HTML page (for example, google-site-verification content value). How could this be done? <meta name="google-site-verification" content="12345678978564261321567498789" /> *UPDATE - I want to embed a flash template into an html web page, in this web page...

Screen Scrape a page of a web app - Internal Server Error

I am tring to screen scrape a page of a web app that just contains text and is hosted by a 3rd party. It's not a properly formed HTML page, however the text that is diplayed will tell us if the web app is up or down. When I try to scrape the sreen it returns an error when it tries the WebRequest. The error is "The remote server retu...

How to Import facebook contact using php and curl

I want to create a contact importer application.How to Import facebook contact using php and curl(prefer screenscraping).Please anybody can help me.... ...

Indian Railway Train Search API

Is there any API provided by Indian Railways to search its train network, time-tables etc. There are many sites out there which show time-table etc. I searched Google but couldn't find any info on Web services or APIs provided by Railways. Is data scraping the only way? ...

To identify links regarding the Press Release pages alone

My task is to find the actual Press release links of a given link. Say http://www.apple.com/pr/ for example. My tool has to find the press release links alone from the above URL excluding other advertisement links, tab links(or whatever) that are found in that site. The program below is developed and the result t...

Scrape and convert website into HTML?

I haven't done this in 3 or 4 years, but a client wants to downgrade their dynamic website into static HTML. Are there any free tools out there to crawl a domain and generate working HTML files to make this quick and painless? Edit: it is a Coldfusion website, if that matters. ...

LXml Xpath processing of multi-line field

I'm doing some scraping of a page and I'm fine with getting most fields, but having some problems with the address. <address> 56 South Ave <br> Miami, FL 33131 <br> </address> address = myWebPage.xpath("//div[contains(@class,'rightcol')]//address") I can get the first line, 56 South Avenue, using the above code. But I can't...

PHP Lyrics Plugin

Similar to these two threads www.stackoverflow.com/questions/3458076/how-to-use-javas-built-in-javascript-engine-to-run-script-on-a-web-page www.stackoverflow.com/questions/3443769/how-do-i-get-this-page-programatically I am trying to get the lyrics via php. www.lyricsplugin.com/winamp03/plugin/?artist=Linkin%20Park&title=Numb So I ...

Desktop Application - Datagrid data parsing

I have a 3rd party desktop weather application. It has a datagrid with few columns. I need to read all the non-zero enteries of 3rd column. I started exploring AutoHotKey, but hit road blocks. Now, looking into Microsoft Spy++. It is displaying the control names, buttons, text on the main control. But, it is not displaying the contents o...