views:

117

answers:

5

I need a program to get all the web pages under a website. The website is Chinese, I want to get all those English words out. Then I can extract all the information I need. Any ideas for this? Is there any software for this purpose?

If NO, I would like to write one. Any suggestions?

Thanks much.

+1  A: 

Not a PHP solution, but you can use the Lynx text-only Web browser with the -crawl and -dump options to visit all the pages on a site and dump them as text files. You can then use a script to extract the information you want from them.

Ken Keenan
+2  A: 

Your pretty much describing a web crawler (something that takes a page, looks for all the links, follows them etc). There are crawler implementations already out there, tool that act like crawlers (such as wget), and questions relating to them here on Stack Overflow. For example...

http://stackoverflow.com/questions/102631/how-to-write-a-crawler

Once you have something that can visit each page you then need code that'll parse the page and look for the text you're interested in.

Martin Peck
+9  A: 

Use eg wget -r http://site.to.copy.com to recursively retrieve all the web pages to your local machine (hope it's not too big...), then you can search or do whatever with the files afterwards.

Erlend Leganger
What I was going to suggest. Why bother building yet another mousetrap?
Carl Smotricz
You may want to consider using "--convert-links" flag also so that you can browse locally...
AJ
Depending on how many pages you're intending to download, you might also make sure to specify the --limit-rate option to avoid overloading the server.
David
A: 

So you want a webcrawler/webrobot. I would start by looking around if there isn't any 3rd party software for that. Googling the aforementioned keywords should deliver enough results.

If you insist in writing one yourself, please respect the robots.txt of the site in question, otherwise you're likely going to be blacklisted. Anyway, the Java way would be using java.net.URLConnection and a small HTML parser like jTidy to extract the information. Finally to separate all Latin text from the Chinese write a small stackbased parser which checks the codepoints.

BalusC
+2  A: 

wget (manpage here) can also serve well as a crawler, look at its --recursive option.

Wim