webscraping

Login Javascript within PHP

Hi, I have been creating a web scraper for an internal application with PHP but one of the pages has a JavaScript login is there any way of autonomously logging in to scrape the data as usual? (I am using curl to log in to the other two sites) ...

How can I get the full change history for an article on Wikipedia?

I'd like a way to download the content of every page in the history of a popular article on Wikipedia. In other words I want to get the full contents of every edit for a single article. How would I go about doing this? Is there a simple way to do this using the Wikipedia API. I looked and didn't find anything the popped out as a si...

What's a good & complete PHP/MySQL Screen Scraper project?

Requirements Written in PHP Control over the code (open source would be awesome, purchasing code is an option too) Optional features Listen to robots.txt Automatic rate limiting Scrape based on rules into a data object Admin interface, or configurable back end, to setup new rules Something like CSS selectors to pick our data in th...

Is there any python lib to scrape search engine(S) results?

I am looking for a python library to scrape results from search engines (google, yahoo, bing, etc). I only found for google -> http://github.com/kevinw/xgoogle/tree/253db7ddc8603a9dcb038ae42684cf3499a22a4b Does someone knows one for multiple search engines? ...

Is there any language which is just "perfect" for web scraping?

I have used 3 languages for Web Scraping - Ruby, PHP and Python and honestly none of them seems to perfect for the task. Ruby has an excellent mechanize and XML parsing library but the spreadsheet support is very poor. PHP has excellent spreadsheet and HTML parsing library but it does not have an equivalent of WWW:Mechanize. Python ...

Scrape and convert website into HTML?

I haven't done this in 3 or 4 years, but a client wants to downgrade their dynamic website into static HTML. Are there any free tools out there to crawl a domain and generate working HTML files to make this quick and painless? Edit: it is a Coldfusion website, if that matters. ...

How to use Java to navigate a Web Search

I need to scrape French court cases for a project, but I can't figure out how to get Java to navigate the Court's search engine. Here's the search page I need to manipulate. I want to start scraping the results page, but I can't get to that page from Java with just the URL. I need some way to have Java order the server to execute a se...

How to scrape "table like" data from stackexchange homepage? (in R)

Hello all, I wish to scrape the home page of one of the new stackexchange websites: http://webapps.stackexchange.com/ (just once, and for only several pages, nothing that should bother the servers). If I had wanted it from stackoverflow, I know there is a database dump, but for the new stackexchange, they don't exist yet. Here is wha...

Parse html with ajax json inside

Hi I have such files to parse (from scrapping) with Python: some HTML and JS here... SomeValue = { 'calendar': [ { 's0Date': new Date(2010, 9, 12), 'values': [ { 's1Date': new Date(2010, 9, 17), 'price': 9900 }, { 's1Date': new Date(2010, 9, 18), 'price': 9900 }, ...

What's wrong with this string formatting?

Hello, I was wondering if anyone knows what's up with this html string code: <object height=\\\"38\" + \"5\\\" width=\\\"64\" + \"0\\\" classid=\\\"clsid:D27CDB6E- AE6D-11cf-96B8-444553540000\\\" id=\\\"movie_player\\\" ><param name=\\\"movie\\\" value=\\\"http:\\/\\/s.ytimg.com\\/yt\\/swf\\/watch_as3-vfl186120.swf\\\"><param nam...

Data Scraping Problem

Hi, I am scraping data from facebook page for the wall posts, here is the url: http://www.facebook.com/GMHTheBook?v=wall&amp;ref=ts#!/GMHTheBook?v=wall&amp;ref=ts I sucessfully scraped all the visible wall posts using CURL. Problem: At the end of visible wall posts, there is Older Posts link which shows more wall posts once you clic...

html scraping and css queries

Hello, what are the advantages and disadvantages of the following libraries? PHP Simple HTML DOM Parser QP phpQuery From the above i've used QP and it failed to parse invalid HTML, and simpleDomParser, that does a good job, but it kinda leaks memory because of the object model. But you may keep that under control by calling $object->...

Scraping a google search page for the top 10 search links for a keyword

i want to scrape the top 10 search links from a google page on searching a keyword. i am using webharvest . Planning to scrape the href links and filter out the top 10 using some attribute pattern? Is it the right way,its not working at the moment. Any other simple way to do it ? :( ...

Scraping content from webpage

I need to scrape a remote html page looking for images and links. I need to find an image that is "most likely" the product image on the page and links that are "near" that image. I currently do this with a javascript bookmarklet so that I am able to get the rendered x/y coordinates of images and links to help me determine if those are...

How to use Scrubty properly to grab URL from the XML outputted content

I am by no means a master with Ruby and am quite new to Scrubyt. I was just trying out some examples found on there wiki page. The example i was working on was getting the search results returned by Google when you search for 'ruby' and I had the idea of grabbing the URL of each result so I could go ahead and fetch that page as well. The...

HtmlUnit and XPath: DOMNode.getByXPath only works on HtmlPage?

I'm trying to parse a page with links to articles whose important content looks like this: <div class="article"> <h1 style="float: none;"><a href="performing-arts">Performing Arts</a></h1> <a href="/performing-arts/EIF-theatre-review-Sin-Sangre.6517348.jp"> <span class="mth3"> <span id="wctlMiniTemplate1_ctl00_ctl00_ctl01_...

Find Tags in website HTML's

I'm using Perl. I have the tag, for example: "XYZ_PKM_HTML" I would like to be able to provide a base url, for example: www.example.com and the to get the HTML page (not necessarily the main page, thats easy) where this tag appears. is it possible? any idea? (or already made modules, looked on cpan, there were some interesting stuff, bu...

Automatically download sales reports from iTunes Connect

I had a nice and hacky Perl script to automatically scrape and download sales report files from iTunes Connect. As of today, Apple overhauled the sales report site. It looks a lot nicer, but it uses a lot of JavaScript and simple scraping isn't going to work any more. So, does anybody know of a way to scrape this new site effectively?...

Read HTML Table in R - Troubleshooting

Hi All, I have seen a number of posts here that describe how to parse HTML tables using the XML package. That said, I have got my code to work except that my first data row gets read in as my column names. My code is taken from the answser at this link How can I get around this? Many thanks, Brock ...

Trying to scrape the entire content of a div.

I have this project i'm working on and id like to add a really small list of nearby places using facebooks places in an iframe featured from touch.facebook.com I can easily just use touch.facebook.com/#/places_friends.php but then that loads the headers the and the other navigation bars for like messges, events ect bars and i just want t...