screen-scraping

Options for HTML scraping?

I'm thinking of trying Beautiful Soup, a python package for HTML scraping. Are there any other HTML scraping packages I should be looking at? Python is not a requirement, I'm actually interested in hearing about other languages as well. The story so far: Python Beautiful Soup lxml Ruby Hpricot scrAPI scRUBYt! .NET Html Agility ...

How to implement a web scraper in PHP?

What built-in PHP functions are useful for web scraping? What are some good resources (web or print) for getting up to speed on web scraping with PHP? ...

How to fetch HTML in Java

Without the use of any external library, what is the simplest way to fetch a website's HTML content into a String? ...

HTML Scraping in Php.

I've been doing some html scraping in PHP using regular expressions. This works, but the result is finicky and fragile. Has anyone used any packages that provide a more robust solution? A config driven solution would be ideal, but I'm not picky. ...

Extract Address Information from a Web Page

I need to take a web page and extract the address information from the page. Some are easier than others. I'm looking for a firefox plugin, windows app, or VB.NET code that will help me get this done. Ideally I would like to have a web page on our admin (ASP.NET/VB.NET) where you enter a URL and it scraps the page and returns a Dataset ...

How to use webclient in a secure site?

I need to automate a process involving a website that is using a login form. I need to capture some data in the pages following the login page. I know how to screen-scrape normal pages, but not those behind a secure site. Can this be done with the .NET WebClient class? How would I automatically login? How would I keep logged in for t...

Python regular expression for HTML parsing (BeautifulSoup)

I want to grab the value of a hidden input field in HTML. <input type="hidden" name="fooId" value="12-3456789-1111111111" /> I want to write a regular expression in Python that will return the value of fooId, given that I know the line in the HTML follows the format <input type="hidden" name="fooId" value="**[id is here]**" /> Can ...

Export ASPX to HTML

We're building a CMS. The site will be built and managed by the users in aspx pages, but we would like to create a static site of HTML's. The way we're doing it now is with code I found here that overloads the Render method in the Aspx Page and writes the HTML string to a file. This works fine for a single page, but the thing with our C...

Getting HTML from a page behind a login

This question is a follow up to my previous question about getting the HTML from an ASPX page. I decided to try using the webclient object, but the problem is that I get the login page's HTML because login is required. I tried "logging in" using the webclient object: WebClient ww = new WebClient(); ww.DownloadString("Login.aspx?UserNa...

Saving HTML tables to a Database

I am trying to scrape an html table and save its data in a database. What strategies/solutions have you found to be helpful in approaching this program. I'm most comfortable with Java and PHP but really a solution in any language would be helpful. EDIT: For more detail, the UTA (Salt Lake's Bus system) provides bus schedules on its web...

how to save a public html page with all media and preserve structure

Looking for a linux application (or firefox extension) that will allow me to scrape an html mockup and keep the page's integrity. Firefox does an almost perfect job but doesn't grab images referenced in the css. The Scrabbook extension for Firefox gets everything, but flattens the directory structure. I wouldn't terribly mind if all f...

screen scraping a command window using .net managed code

I am writing a program in dot net that will execute scripts and command line programs using the framework 2.0's Process object. I want to be able to access the screen buffers of the process in my program. I've investigated this and it appears that I need to access console stdout and stderr buffers. Anyone know how this is accomplished us...

Is there another way to do screen scaping apart from regular expressions?

I'm doing a personal, just for fun, project that is using screen scraping to give me a System Tray notification in case another line on an HTML table is added, modified or deleted. Having done this before I thought: well let's go with the regular expression thing and that's it, but being a curious person, made me think that there could ...

Add RSS to any website?

Is there any website/service which will enable me to add RSS subscription to any website? This is for my company I work. We have a website which displays company related news. These news are supplied by an external agency and they gets updated to our database automatically. Our website picks up random/new news and displays them. We are ...

What's a good tool to screen-scrape with Javascript support?

Is there a good test suite or tool set that can automate website navigation -- with Javascript support -- and collect the HTML from the pages? Of course I can scrape straight HTML with BeautifulSoup. But this does me no good for sites that require Javascript. :) ...

Reading and posting to web pages using C#

I have a project at work the requires me to be able to enter information into a web page, read the next page I get redirected to and then take further action. A simplified real-world example would be something like going to google.com, entering "Coding tricks" as search criteria, and reading the resulting page. Small coding examples li...

What is the best way to parse a web page in Ruby?

I have been looking at XML and HTML libraries on rubyforge for a simple way to pull data out of a web page. For example if I want to parse a user page on stackoverflow how can I get the data into a usable format? Say I want to parse my own user page for my current reputation score and badge listing. I tried to convert the source retri...

What are some good methods to hinder screen scrapers from grabbing specific pieces of content off my site?

Pretty sure this question counts as blasphemy to most web 2.0 proponents, but I do think there are times when you could possibly not want pieces of your site being easily ripped off into someone else's arbitrary web aggregator. At least enough so they'd need to be arsed to do it by hand if they really wanted it. My idea was to make a s...

How do screen scrapers work?

I hear people writing these programs all the time and I know what they do, but how do they actually do it? I'm looking for general concepts. ...

Perl: HTML Scraping from an Authenticated website

While HTML Scraping is pretty well-documented from what I can see, and I understand the concept and implementation of it, what is the best method for scraping from content that is tucked away behind authentication forms. I refer to scraping from content that I legitimately have access to, so a method for automatically submitting login da...