I'm looking for a library that has functionality similar to Perl's WWW::Mechanize, but for PHP. Basically, it should allow me to submit HTTP GET and POST requests with a simple syntax, and then parse the resulting page and return in a simple format all forms and their fields, along with all links on the page.
I know about CURL, but it's...
What is the best method to scrape a dynamic website where most of the content is generated by what appears to be ajax requests? I have previous experience with a Mechanize, BeautifulSoup, and python combo, but I am up for something new.
--Edit--
For more detail: I'm trying to scrape the CNN primary database. There is a wealth of infor...
I am writing a scraper that downloads all the image files from a HTML page and saves them to a specific folder. all the images are the part of the HTML page.
...
The Problem
I use a tool at work that lets me do queries and get back HTML tables of info. I do not have any kind of back-end access to it.
A lot of this info would be much more useful if I could put it into a spreadsheet for sorting, averaging, etc. How can I screen-scrape this data to a CSV file?
My First Idea
Since I know jQuery, ...
How do you screen scrape ajax pages?
...
Is there a PHP class/library that would allow me to query an XHTML document with CSS selectors? I need to scrape some pages for data that is very easily accessible if I could somehow use CSS selectors (jQuery has spoiled me!). Any ideas?
...
Is there a unified way to do this? Browsers usually don't respond as expected to user32's GetWindowText and SendMessage, which you can use to scrape the text out of most win32 applications.
I'd like to get the equivalent of "View Source" on the open web page.
Currently, I'm using the API for screen readers to scrape from IE, but tha...
I am building a web app where I need to get all the images and any flash videos that are embedded (e.g. youtube) on a given URL. I'm using Python.
I've googled, but have not found any good information about this (probably because I don't know what this is called to search for), does anyone have any experience with this and knows how it ...
I am working on an algorithm that will try to pick out, given an HTML file, what it thinks is the parent element that most likely contains the majority of the page's content text.
For example, it would pick the div "content" in the following HTML:
<html>
<body>
<div id="header">This is the header we don't care about</div>
...
I'm writing a script that pulls XML data from wowarmory.com, using PHP 5 and cURL:
$url = "http://www.wowarmory.com";
$userAgent = 'Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.8.1.12) Gecko/20080201 Firefox/2.0.0.12';
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_URL,$url);
$str ...
I am working on an app for doing screen scraping of small portions of external web pages (not an entire page, just a small subset of it).
So I have the code working perfectly for scraping the html, but my problem is that I want to scrape not just the raw html, but also the CSS styles used to format the section of the page I am extractin...
I am following a VB tutorial to do some HTML manipulation using LINQ
It has the following construct
Imports <xmlns="http://www.w3.org/1999/xhtml">
How do I do the same in C#?
There appears to be something called an XMLNamespaceManager that may hold the solution, but I am too foolish to understand how to work it, and I am not sur...
I'm wanting to create a REST API for TV listings in my country. While online aggregations of TV listings do exist they're too tied to the presentation to be of any use to software developers.
In order to get hold of this information I'm thinking of going to each source and scraping the relevant information. While I've obtained similar i...
I essentially want to spider my local site and create a list of all the titles and URLs as in:
http://localhost/mySite/Default.aspx My Home Page
http://localhost/mySite/Preferences.aspx My Preferences
http://localhost/mySite/Messages.aspx Messages
I'm running Windows. I'm open to anything that works--a C# console app, Powe...
I'm writing an application that will need to open up browser windows (probably can stick to IE) to websites that use Forms Authentication. The trick is that they need to be authenticated already, in order to save time due to the sheer number of sites we need to get into. (Eventually I'll be screen scraping them and processing the data....
What are the best algorithms for recognizing structured data on an HTML page?
For example Google will recognize the address of home/company in an email, and offers a map to this address.
...
Are all these types of sites just illegally scraping Google or another search engine?
As far as I can tell ther is no 'legal' way to get this data for a commercial site.. The Yahoo! api ( http://developer.yahoo.com/search/siteexplorer/V1/inlinkData.html ) is only for noncommercial use, Yahoo! Boss does not allow automated queries etc.
An...
How would you solve this problem?
You're scraping HTML of blogs. Some of the HTML of a blog is blog posts, some of it is formatting, sidebars, etc. You want to be able to tell what text in the HTML belongs to which post (i.e. a permalink) if any.
I know what you're thinking: You could just look at the RSS and ignore the HTML altogether...
I need to scrape data from a windows application to run a query in another program. Does anyone know of a good starting point for me to do this in .NET?
...
Situation:
Site with content protected by username/password (not all controlled since they can be trial/test users)
a normal search engine can't get at it because of username/password restrictions
a malicious user can still login and pass the session cookie to a "wget -r" or something else.
The question would be what is the best solu...