views:

405

answers:

3

I am working on an algorithm that will try to pick out, given an HTML file, what it thinks is the parent element that most likely contains the majority of the page's content text. For example, it would pick the div "content" in the following HTML:

<html>
   <body>
      <div id="header">This is the header we don't care about</div>
      <div id="content">This is the <b>Main Page</b> content.  it is the
      longest block of text in this document and should be chosen as
      most likely being the important page content.</div>
   </body>
</html>

I have come up with a few ideas, such as traversing the HTML document tree to its leaves, adding up the length of the text, and only seeing what other text the parent has if the parent gives us more content than the children do.

Has anyone ever tried something like this, or know of an algorithm that can be applied? It doesn't have to be solid, but as long as it can guess a container that contains most of the page content text (for articles or blog posts, for example), that would be awesome.

+1  A: 

You could create an app that looks for contiguous block of text disregarding formatting tags (if required). You could do this by using a DOM parser and walking the tree- keeping track of the immediate parent (because that is your output).

Start form parent nodes and traverse the tree for each node that is just formatting- it would continue the 'count' within that sub block. It would count the characters of the content.

Once you find the most content block- traverse back up the tree to its parent to get your answer.

I think your solution relies on how you traverse the DOM and keep track of the nodes that you are scanning.

What language are you using? Any other details for your project? There may be language specific or package specific tools you could use as well.

Klathzazt
I would be using python/beautifulsoup.I like this idea. I'll try an implementation where I filter out all the small formatting tags and then process the text.
Max
+1  A: 

You will also have to formulate a level on which you want to select the node. In your example, the 'body' node has an even larger amount of text in it. So you have to formulate what a 'parent element' exactly is.

Michiel Overeem
wasn't it about 'leafs' only or do I get smth wrong?
tharkun
+2  A: 

Here's roughly how I would approach this:

// get array of all elements (body is used as parent here but you could use whatever)
var elms = document.body.getElementsByTagName('*');
var nodes = Array.prototype.slice.call( elms, 0 );

// get inline elements out of the way (incomplete list)
nodes = nodes.filter(function (elm) {
  return !/^(a|br?|hr|code|i(ns|mg)?|u|del|em|s(trong|pan))$/i.test( elm.nodeName );
});

// sort elements by most text first
nodes.sort(function(a,b){
  if (a.textContent.length == b.textContent.length) return 0;
  if (a.textContent.length > b.textContent.length)  return -1;
  return 1;
});

Using ancestry functions like a.compareDocumentPosition(b), you can also sink elements during sorting (or after), depending on how complex this thing needs to be.

Borgar
Thank you Borgar. I think it seems this solution rests on getting rid of the small inline formatting tags first as you and Klathzazt are saying.
Max