ansaurus

Question

Scraping largest block of text from HTML document

Answer 1

+1 A:

You could create an app that looks for contiguous block of text disregarding formatting tags (if required). You could do this by using a DOM parser and walking the tree- keeping track of the immediate parent (because that is your output).

Start form parent nodes and traverse the tree for each node that is just formatting- it would continue the 'count' within that sub block. It would count the characters of the content.

Once you find the most content block- traverse back up the tree to its parent to get your answer.

I think your solution relies on how you traverse the DOM and keep track of the nodes that you are scanning.

What language are you using? Any other details for your project? There may be language specific or package specific tools you could use as well.

Klathzazt 2008-11-14 08:13:04

I would be using python/beautifulsoup.I like this idea. I'll try an implementation where I filter out all the small formatting tags and then process the text.

Max 2008-11-14 17:12:44

Answer 2

+1 A:

You will also have to formulate a level on which you want to select the node. In your example, the 'body' node has an even larger amount of text in it. So you have to formulate what a 'parent element' exactly is.

Michiel Overeem 2008-11-14 08:18:28

wasn't it about 'leafs' only or do I get smth wrong?

tharkun 2008-11-14 08:24:07

Answer 3

+2 A:

Here's roughly how I would approach this:

// get array of all elements (body is used as parent here but you could use whatever)
var elms = document.body.getElementsByTagName('*');
var nodes = Array.prototype.slice.call( elms, 0 );

// get inline elements out of the way (incomplete list)
nodes = nodes.filter(function (elm) {
  return !/^(a|br?|hr|code|i(ns|mg)?|u|del|em|s(trong|pan))$/i.test( elm.nodeName );
});

// sort elements by most text first
nodes.sort(function(a,b){
  if (a.textContent.length == b.textContent.length) return 0;
  if (a.textContent.length > b.textContent.length)  return -1;
  return 1;
});

Using ancestry functions like a.compareDocumentPosition(b), you can also sink elements during sorting (or after), depending on how complex this thing needs to be.

Borgar 2008-11-14 14:00:55

Thank you Borgar. I think it seems this solution rests on getting rid of the small inline formatting tags first as you and Klathzazt are saying.

Max 2008-11-14 17:13:59

ansaurus

tags:

views:

answers:

Scraping largest block of text from HTML document

related questions