Which web crawler to use to save news articles from a website into .txt files?

views:

answers:

Which web crawler to use to save news articles from a website into .txt files?

Hi, i am currently in dire need of news articles to test a LSI implementation (it's in a foreign language, so there isnt the usual packs of files ready to use).

So i need a crawler that given a starting url, let's say http://news.bbc.co.uk/ follows all the contained links and saves their content into .txt files, if we could specify the format to be UTF8 i would be in heaven.

I have 0 expertise in this area, so i beg you for some sugestions in which crawler to use for this task.

What you are looking for is a "Scraper", and you will have to write one. Further more you may be in violation of the BBC's Terms of Use like anyone cares.

Rook 2010-02-19 15:48:17

I just mentioned bbc since everyone knows of it... like i said i need text's in a foreign language so if i find a program that does this i wont be using it in bbc.

brokencoding 2010-02-19 15:50:05

In general scraping can be in violation of a ToS, idk if this is against the BBC's ToS. Also stack overflow is only for programming questions. if you are looking for a program to do everything for you then you should have posted this to SuperUser.

Rook 2010-02-19 17:41:15

you can grab the site with wget. Then run it all through some HTML renderer (Lynx text browser does the job adequately with --dump-html option) to convert HTML to TXT. You will need to write the script to call Lynx on each downloaded file yourself, but that should be easy enough.

SF. 2010-02-19 15:57:31

ansaurus

tags:

views:

answers:

Which web crawler to use to save news articles from a website into .txt files?

related questions