how to parse (only text) web sites while crawling | ansaurus

tags:

views:

61

answers:

1

Q:

how to parse (only text) web sites while crawling

i can succesfully run crawl command via cygwin on windows xp. and i can also make web search via using tomcat.

but i also want to save parsed pages during crawling event

so when i start crawling with like this

bin/nutch crawl urls -dir crawled -depth 3

i also want save parsed html files to text files

i mean during this period which i started with above command

nutch when fetched a page it will also automaticly save that page parsed (only text) to text files

these files names could be fetched url

i really need help about this

this will be used at my university language detection project

ty

A:

The crawled pages are stored in the segments. You can have access to them by dumping the segment content:

nutch readseg -dump crawl/segments/20100104113507/ dump

You will have to do this for each segment.

Pascal Dimassimo 2010-04-15 13:22:25

related questions

BNF grammar test case generation

Print stack trace information from C#

What is a good way to format logs?

How do you parse a filename in bash?

How to parse a string into a nullable int in C# (.NET 3.5)

An easy way to diff log files, ignoring the time stamps?

Learning Resources on Parsers, Interpreters, and Compilers

Does C# have built-in support for parsing page-number strings?

Resources for lexing, tokenising and parsing in python

Parsing, where can I learn about it.

Parsing XML using unix terminal

Equation (expression) parser with precedence?

What HTML parsing libraries do you recommend in Java

Where do I get the Antlr Ant task?

How do I put unicode characters in my Antlr grammar?

Resolving reduce/reduce conflict in yacc/ocamlyacc

Best Approach to Parse for SQL in PHP Files?

.Net Parse verses Convert

How can I learn about parser combinators?

Parse usable Street Address, City, State, Zip from a string

C# Save Dialogs

Delimited string parsing framework for .NET

Looking for algorithm that reverses the sprintf() function output

Split a string ignoring quoted sections

What is the best way to parse strings in Java