views:

36

answers:

1

Hi!

I need a library (hopefully in C#!) which works as a web crawler to access HTTP files and FTP files. In principle, Im happy with reading HTML, I want to extend it to PDF, WORD, etc..

Im happy with a starter's open source software or at least any directions for documentation.

Best regards, David

+1  A: 

Check NCrawler project

Simple and very efficient multithreaded web crawler with pipeline based processing written in C#. Contains HTML, Text, PDF, and IFilter document processors and language detection(Google). Easy to add pipeline steps to extract, use and alter information.

Nick Martyshchenko
+1: Very good suggestion, I'll give it some testing to see if it can help me. At first glance it seems so.
David Conde