screen-scraper

C# Screen Scraper - Handle long uri's

Hi! I'm building a html screen scraper, which parses urls, and then compare those with a set of other urls. The comparison is done with Uri.AbsoluteUri or Uri.Host. My problem is that when i'm creating a new Uri (new Uri(url)), an UriFormatException is thrown when the url is to long, or contains to many slashes. Since my predefined s...

How Many Java HttpURLConnections Should I Be Able to Open Concurrently?

I'm writing a multi-threaded Java web crawler. From what I understand of the web, when a user loads a web page the browser requests the first document (eg, index.html) and as it receives the HTML it will find other resources that need to be included (images, CSS, JS) and ask for those resources concurrently. My crawler is only requestin...

How to get specific content with PHP and DOM Document?

Howdy, I have a url I want to grab. I only want a short piece of content from it. The content in question is in a div that has a ID of sample. <div id="sample"> Content </div> I can grab the file like so: $url= file_get_contents('http://www.example.com/'); But how do I select just that sample div. Any ideas? ...

Screen scraper application (not HTML)

Hello. I need screen scraper application which will recognize text from the screen (and not use winapi to do this so source could be in image file). I found a lot of commercial solutions, but I need something open source or free. I plan to include it in my C# project, so there should be some SDK available. Thanks. ...

C# library similar to HtmlUnit

Hello. I need to write standalone application which will "browse" external resource. Is there lib in C# which automatically handles cookies and supports JavaScript (through JS is not required I believe)? The main goal is to keep session alive and submitting forms so I could pass multistep registration process or "browse" web site after ...

Java Framework - Using screen scraping to mesh heterogenous server environments

OK. So I have a CMS written in Java that satisfies the needs of several hundred clients. But periodically, a client will need a specialized application: for example, a class registration database application. So let's say that I don't feel like writing it or I'm too busy. So I outsource it to someone else but I don't want his/her code ...

Using Ruby And Ubuntu With Optical Character Recognition

I am a university student and it's time to buy textbooks again. This quarter there are over 20 books I need for classes. Normally this wouldn't be such a big deal, as I would just copy and paste the ISBNs into Amazon. The ISBNs, however, are converted into an image on my school's book site. All I want to do is get the ISBNs into a string...

How can I get java's parser to be forgiving with badly formed html?

I'm attempting to do some screen scraping however the html being returned is causing an error as there is no header (i think). Below is the code public class xpath { private Document doc = null; public xpath() { HttpClient httpclient = new DefaultHttpClient(); HttpGet httpget = new HttpGet("http://blah.com/blah.php?param1...

Convert a (nested)HTML unordered list of links to PHP array of links

Hi, I have a regular, nested HTML unordered list of links, and I'd like to scrape it with PHP and convert it to an array. The original list looks something like this: <ul> <li><a href="http://someurl.com"&gt;First item</a> <ul> <li><a href="http://someotherurl.com/"&gt;Child of First Item</a></li> <li><a href="http://som...

Screen scrape a website that blocks ips

Hello I want to screen scrape a site like yelp to get phone numbers of italian restaurants.. I created a simple program to do just what I wanted but they blocked my servers ip I am using php to do it. How can I get past the ip block? I've heard about programs like screen-scraper, but I still haven't used it yet What is the best way to...

How can I write an automated script to log into, navigate, and save a website from a headless server?

Currently, I have a script that calls Firefox and runs a macro, but this is very buggy and rarely works the way I want it to. ...