views:

52

answers:

3

Hi Folks,

The goal is to find the largest piece of contiguous text in a document. The problem is that the largest piece does not lie under a single element, e.g. a blog post which has <p> tags in it so iterating nodes and comparing innerHTMLs is not going to work. And by getting innerText of an element, the root node always contains biggest text. So how should one accomplish that?

Thanks

+3  A: 

Your problem can be complicated because if there is a div that contains 2 words, plus another <p> inside the div with 200 words in it, then do you count the div having 202 words, or do you count the p having 200 words and therefore is the biggest?

If there are 4 borders for p, then it can make sense to say it is p with 200 words. If there is no border, then it makes sense to say it is div with 202 words.

You can try writing a function to traverse down a node, and if there is any block element with 4 borders, then don't include the word counts.

Things can be more complicated if there are floated divs, which are set to display:inline to work around an IE 6 bug. Or if there are borders, but the color is the same as the background color of the containing div.

If you don't care about the inside elements having borders, then one attempt can be just to look at the immediate children of body, and find out how many characters there are inside of it (sum of text under all descendants, probably using innerText or innerHTML and strip all the tags).

You might also look into finding the biggest element with the biggest area (width x height), if you are looking for the content section, unless there is a long and narrow sidebar or ad section to the left and right, with the content area wide but really short.

動靜能量
A: 

The most time effective tactic in screen scraping is always to define templates for each instance of what you are scraping. Considering that most pages these days have a "content" container, all you have to do is add the name of the "content" div for each of your sources. If you are scraping blogs it also becomes much easier as you can create rules for most popular blogging systems as they usually have the same content container across implementations. So you can try defaults first and if they come up empty log the url and manually identify the container.

If you really want to automate this you probably will (and I am guessing here) need to compare size of sibling nodes and check their type of the DOM tree at each level of the DOM and only follow the largest branch. When you hit a level where all the siblings are text nodes the container for these most likely your "main content" container. You can accomplish this using jQuery for node iteration or just "normal" javascript DOM functions.

Michal
A: 

When I started out typing this answer, I was going to write that it is pretty simple. I was thinking about cloneNode(false). Then i thought about textnodes, then the normalize function, and then the case when textnodes arent adjacent.

Apart from recursing the entire DOM you will have to do the following to each elementNode (NodeType = 1)

ElLength = thisEl.nodeValue.length ;
if (thisEl.hasChildNodes()){
    for each (node in thisEl.childNodes){
        if (node.nodeType == 3) { // textnode
            ElLength += node.data.length;
        }
    }
}

then you'll have to remember the largest ElLength and the corresponding element.

It's gonna be slow if your DOM is huge.

Code hasn't been tested... I wrote it just to give an example

Ravindra Sane