views:

958

answers:

3

In Groovy, how do I grab a web page and remove HTML tags, etc., leaving only the document's text? I'd like the results dumped into a collection so I can build a word frequency counter.

Finally, let me mention again that I'd like to do this in Groovy.

A: 

You can use the Lynx Web Browser to spit out the document text and save it.

Do you want to do this automatically? Do you want a separate application that does this? Or do you want help coding it into your application? What platforms (windows desktop, web server, etc) will it run on?

moogs
+2  A: 

Assuming you want to do this with Groovy (guessing based on the groovy tag), your approaches are likely to be either heavily shell-script oriented or using Java libraries. In the case of shell-scripting I would agree with moogs, using Lynx or Elinks is probably the easiest way to go about it. Otherwise have a look at HTMLParser and see Processing Every Word in a File (scroll down to find the relevant code snippet)

You're probably stuck with finding Java libs for use with Groovy for the HTML parsing, as it doesn't appear there are any Groovy libs for it. If you're not using Groovy, then please post the desired language, since there are a multitude of HTML to text tools out there, depending on what language you're working in.

Jay
+1  A: 

If you want a collection of tokenized words from HTML then can't you just parse it like XML (needs to be valid XML) and grab all of the text between tags? How about something like this:

def records = new XmlSlurper().parseText(YOURHTMLSTRING)
def allNodes = records.depthFirst().collect{ it }
def list = []
allNodes.each {
    it.text().tokenize().each {
        list << it
    }
}
mbrevoort