ansaurus

Question

Strip HTML from a web page and calculate word frequency?

Answer 1

A:

You can use the Lynx Web Browser to spit out the document text and save it.

Do you want to do this automatically? Do you want a separate application that does this? Or do you want help coding it into your application? What platforms (windows desktop, web server, etc) will it run on?

moogs 2008-10-16 04:12:11

Answer 2

+2 A:

Assuming you want to do this with Groovy (guessing based on the groovy tag), your approaches are likely to be either heavily shell-script oriented or using Java libraries. In the case of shell-scripting I would agree with moogs, using Lynx or Elinks is probably the easiest way to go about it. Otherwise have a look at HTMLParser and see Processing Every Word in a File (scroll down to find the relevant code snippet)

You're probably stuck with finding Java libs for use with Groovy for the HTML parsing, as it doesn't appear there are any Groovy libs for it. If you're not using Groovy, then please post the desired language, since there are a multitude of HTML to text tools out there, depending on what language you're working in.

Jay 2008-10-16 04:35:58

Answer 3

+1 A:

If you want a collection of tokenized words from HTML then can't you just parse it like XML (needs to be valid XML) and grab all of the text between tags? How about something like this:

def records = new XmlSlurper().parseText(YOURHTMLSTRING)
def allNodes = records.depthFirst().collect{ it }
def list = []
allNodes.each {
    it.text().tokenize().each {
        list << it
    }
}

mbrevoort 2008-10-16 16:08:01

ansaurus

tags:

views:

answers:

Strip HTML from a web page and calculate word frequency?

related questions