Hi,
I have been creating a web scraper for an internal application with PHP but one of the pages has a JavaScript login is there any way of autonomously logging in to scrape the data as usual?
(I am using curl to log in to the other two sites)
...
I'd like a way to download the content of every page in the history of a popular article on Wikipedia. In other words I want to get the full contents of every edit for a single article. How would I go about doing this?
Is there a simple way to do this using the Wikipedia API. I looked and didn't find anything the popped out as a si...
Requirements
Written in PHP
Control over the code (open source would be awesome, purchasing code is an option too)
Optional features
Listen to robots.txt
Automatic rate limiting
Scrape based on rules into a data object
Admin interface, or configurable back end, to setup new rules
Something like CSS selectors to pick our data in th...
I am looking for a python library to scrape results from search engines (google, yahoo, bing, etc).
I only found for google -> http://github.com/kevinw/xgoogle/tree/253db7ddc8603a9dcb038ae42684cf3499a22a4b
Does someone knows one for multiple search engines?
...
I have used 3 languages for Web Scraping - Ruby, PHP and Python and honestly none of them seems to perfect for the task.
Ruby has an excellent mechanize and XML parsing library but the spreadsheet support is very poor.
PHP has excellent spreadsheet and HTML parsing library but it does not have an equivalent of WWW:Mechanize.
Python ...
I haven't done this in 3 or 4 years, but a client wants to downgrade their dynamic website into static HTML.
Are there any free tools out there to crawl a domain and generate working HTML files to make this quick and painless?
Edit: it is a Coldfusion website, if that matters.
...
I need to scrape French court cases for a project, but I can't figure out how to get Java to navigate the Court's search engine.
Here's the search page I need to manipulate. I want to start scraping the results page, but I can't get to that page from Java with just the URL. I need some way to have Java order the server to execute a se...
Hello all,
I wish to scrape the home page of one of the new stackexchange websites: http://webapps.stackexchange.com/ (just once, and for only several pages, nothing that should bother the servers). If I had wanted it from stackoverflow, I know there is a database dump, but for the new stackexchange, they don't exist yet.
Here is wha...
Hi
I have such files to parse (from scrapping) with Python:
some HTML and JS here...
SomeValue =
{
'calendar': [
{ 's0Date': new Date(2010, 9, 12),
'values': [
{ 's1Date': new Date(2010, 9, 17), 'price': 9900 },
{ 's1Date': new Date(2010, 9, 18), 'price': 9900 },
...
Hello,
I was wondering if anyone knows what's up with this html string code:
<object height=\\\"38\" + \"5\\\" width=\\\"64\" + \"0\\\" classid=\\\"clsid:D27CDB6E-
AE6D-11cf-96B8-444553540000\\\" id=\\\"movie_player\\\" ><param name=\\\"movie\\\"
value=\\\"http:\\/\\/s.ytimg.com\\/yt\\/swf\\/watch_as3-vfl186120.swf\\\"><param
nam...
Hi,
I am scraping data from facebook page for the wall posts, here is the url:
http://www.facebook.com/GMHTheBook?v=wall&ref=ts#!/GMHTheBook?v=wall&ref=ts
I sucessfully scraped all the visible wall posts using CURL.
Problem:
At the end of visible wall posts, there is Older Posts link which shows more wall posts once you clic...
Hello,
what are the advantages and disadvantages of the following libraries?
PHP Simple HTML DOM Parser
QP
phpQuery
From the above i've used QP and it failed to parse invalid HTML, and simpleDomParser, that does a good job, but it kinda leaks memory because of the object model. But you may keep that under control by calling $object->...
i want to scrape the top 10 search links from a google page on searching a keyword.
i am using webharvest . Planning to scrape the href links and filter out the top 10 using some
attribute pattern? Is it the right way,its not working at the moment. Any other simple way to do it ? :(
...
I need to scrape a remote html page looking for images and links. I need to find an image that is "most likely" the product image on the page and links that are "near" that image. I currently do this with a javascript bookmarklet so that I am able to get the rendered x/y coordinates of images and links to help me determine if those are...
I am by no means a master with Ruby and am quite new to Scrubyt. I was just trying out some examples found on there wiki page. The example i was working on was getting the search results returned by Google when you search for 'ruby' and I had the idea of grabbing the URL of each result so I could go ahead and fetch that page as well. The...
I'm trying to parse a page with links to articles whose important content looks like this:
<div class="article">
<h1 style="float: none;"><a href="performing-arts">Performing Arts</a></h1>
<a href="/performing-arts/EIF-theatre-review-Sin-Sangre.6517348.jp">
<span class="mth3">
<span id="wctlMiniTemplate1_ctl00_ctl00_ctl01_...
I'm using Perl.
I have the tag, for example: "XYZ_PKM_HTML"
I would like to be able to provide a base url, for example: www.example.com
and the to get the HTML page (not necessarily the main page, thats easy) where this tag appears.
is it possible? any idea? (or already made modules, looked on cpan, there were some interesting stuff, bu...
I had a nice and hacky Perl script to automatically scrape and download sales report files from iTunes Connect. As of today, Apple overhauled the sales report site. It looks a lot nicer, but it uses a lot of JavaScript and simple scraping isn't going to work any more.
So, does anybody know of a way to scrape this new site effectively?...
Hi All,
I have seen a number of posts here that describe how to parse HTML tables using the XML package. That said, I have got my code to work except that my first data row gets read in as my column names.
My code is taken from the answser at this link
How can I get around this?
Many thanks,
Brock
...
I have this project i'm working on and id like to add a really small list of nearby places using facebooks places in an iframe featured from touch.facebook.com I can easily just use touch.facebook.com/#/places_friends.php but then that loads the headers the and the other navigation bars for like messges, events ect bars and i just want t...