screen-scraping

Is there a PHP equivalent of Perl's WWW::Mechanize?

I'm looking for a library that has functionality similar to Perl's WWW::Mechanize, but for PHP. Basically, it should allow me to submit HTTP GET and POST requests with a simple syntax, and then parse the resulting page and return in a simple format all forms and their fields, along with all links on the page. I know about CURL, but it's...

Scrape a dynamic website

What is the best method to scrape a dynamic website where most of the content is generated by what appears to be ajax requests? I have previous experience with a Mechanize, BeautifulSoup, and python combo, but I am up for something new. --Edit-- For more detail: I'm trying to scrape the CNN primary database. There is a wealth of infor...

Download image file from the HTML page source using python?

I am writing a scraper that downloads all the image files from a HTML page and saves them to a specific folder. all the images are the part of the HTML page. ...

How can I scrape an HTML table to CSV?

The Problem I use a tool at work that lets me do queries and get back HTML tables of info. I do not have any kind of back-end access to it. A lot of this info would be much more useful if I could put it into a spreadsheet for sorting, averaging, etc. How can I screen-scrape this data to a CSV file? My First Idea Since I know jQuery, ...

How do you screen scrape ajax pages?

How do you screen scrape ajax pages? ...

PHP CSS Selector Library?

Is there a PHP class/library that would allow me to query an XHTML document with CSS selectors? I need to scrape some pages for data that is very easily accessible if I could somehow use CSS selectors (jQuery has spoiled me!). Any ideas? ...

Unified way to scrape HTML from any type of browser process?

Is there a unified way to do this? Browsers usually don't respond as expected to user32's GetWindowText and SendMessage, which you can use to scrape the text out of most win32 applications. I'd like to get the equivalent of "View Source" on the open web page. Currently, I'm using the API for screen readers to scrape from IE, but tha...

How to scan a webpage and get images and youtube embeds?

I am building a web app where I need to get all the images and any flash videos that are embedded (e.g. youtube) on a given URL. I'm using Python. I've googled, but have not found any good information about this (probably because I don't know what this is called to search for), does anyone have any experience with this and knows how it ...

Scraping largest block of text from HTML document

I am working on an algorithm that will try to pick out, given an HTML file, what it thinks is the parent element that most likely contains the majority of the page's content text. For example, it would pick the div "content" in the following HTML: <html> <body> <div id="header">This is the header we don't care about</div> ...

Can't access website via cURL from localhost, but can from hosted server.

I'm writing a script that pulls XML data from wowarmory.com, using PHP 5 and cURL: $url = "http://www.wowarmory.com"; $userAgent = 'Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.8.1.12) Gecko/20080201 Firefox/2.0.0.12'; $ch = curl_init(); curl_setopt($ch, CURLOPT_USERAGENT, $userAgent); curl_setopt($ch, CURLOPT_URL,$url); $str ...

Screen scraping pages that use CSS for layout and formatting...how to scrape the CSS applicable to the html?

I am working on an app for doing screen scraping of small portions of external web pages (not an entire page, just a small subset of it). So I have the code working perfectly for scraping the html, but my problem is that I want to scrape not just the raw html, but also the CSS styles used to format the section of the page I am extractin...

c# XML manipulation VB code conversion query... and more!

I am following a VB tutorial to do some HTML manipulation using LINQ It has the following construct Imports <xmlns="http://www.w3.org/1999/xhtml"&gt; How do I do the same in C#? There appears to be something called an XMLNamespaceManager that may hold the solution, but I am too foolish to understand how to work it, and I am not sur...

Python Screenscraping

I'm wanting to create a REST API for TV listings in my country. While online aggregations of TV listings do exist they're too tied to the presentation to be of any use to software developers. In order to get hold of this information I'm thinking of going to each source and scraping the relevant information. While I've obtained similar i...

Quickest way to get list of <title> values from all pages on localhost website

I essentially want to spider my local site and create a list of all the titles and URLs as in: http://localhost/mySite/Default.aspx My Home Page http://localhost/mySite/Preferences.aspx My Preferences http://localhost/mySite/Messages.aspx Messages I'm running Windows. I'm open to anything that works--a C# console app, Powe...

How can I pass cookies into an external web browser?

I'm writing an application that will need to open up browser windows (probably can stick to IE) to websites that use Forms Authentication. The trick is that they need to be authenticated already, in order to save time due to the sheer number of sites we need to get into. (Eventually I'll be screen scraping them and processing the data....

Algorithms recognizing physical address on a webpage

What are the best algorithms for recognizing structured data on an HTML page? For example Google will recognize the address of home/company in an email, and offers a map to this address. ...

How do sites like Hubspot track inbound links?

Are all these types of sites just illegally scraping Google or another search engine? As far as I can tell ther is no 'legal' way to get this data for a commercial site.. The Yahoo! api ( http://developer.yahoo.com/search/siteexplorer/V1/inlinkData.html ) is only for noncommercial use, Yahoo! Boss does not allow automated queries etc. An...

Map RSS entries to HTML body w. non-exact search

How would you solve this problem? You're scraping HTML of blogs. Some of the HTML of a blog is blog posts, some of it is formatting, sidebars, etc. You want to be able to tell what text in the HTML belongs to which post (i.e. a permalink) if any. I know what you're thinking: You could just look at the RSS and ignore the HTML altogether...

Screen-scraping a windows application in c#

I need to scrape data from a windows application to run a query in another program. Does anyone know of a good starting point for me to do this in .NET? ...

How to protect/monitor your site from crawling by malicious user

Situation: Site with content protected by username/password (not all controlled since they can be trial/test users) a normal search engine can't get at it because of username/password restrictions a malicious user can still login and pass the session cookie to a "wget -r" or something else. The question would be what is the best solu...