views:

268

answers:

8

I'm looking for an algorithm (or some other technique) to read the actual content of news articles on websites and ignore anything else on the page. In a nutshell, I'm reading an RSS feed programatically from Google News. I'm interested in scraping the actual content of the underlying articles. On my first attempt I have the URLs from the RSS feed and I simply follow them and scrape the HTML from that page. This very clearly resulted in a lot of "noise", whether it be HTML tags, headers, navigation, etc. Basically all the information that is unrelated to the actual content of the article.

Now, I understand this is an extremely difficult problem to solve, it would theoretically involve writing a parser for every website out there. What I'm interested in is an algorithm (I'd even settle for an idea) on how to maximize the actual content that I see when I download the article and minimize the amount of noise.

A couple of additional notes:

  • Scraping the HTML is simply the first attempt I tried. I'm not sold that this is the best way to do things.
  • I don't want to write a parser for every website I come across, I need the unpredictability of accepting whatever Google provides through the RSS feed.
  • I know whatever algorithm I end up with is not going to be perfect, but I'm interested in a best possible solution.

Any ideas?

+3  A: 

As long as you've accepted that fact that whatever you try is going to be very sketchy based on your requirements, I'd recommend you look into Bayesian filtering. This technique has proven to be very effective in filtering spam out of email.

Bill the Lizard
Thank you, I actually considered this technique at one point, but I stumbled over how exactly I would train this. It's not that articles are good or bad, it's that certain text in the article is good or bad. If I went this route, I would have to classify acceptability on a word-by-word basis.
The Matt
You could train it based upon the content of the text inside the html elements. First extract all the paragraphs of text, and then train it based on whether that paragraph is part of the article or not.I'm not sure how well this would work -- you may find that articles contain certain sorts of words while the noise contains different sorts. Or you may find that they are too indistinguishable.You can try and be a bit smart based on stop words, whether the paragraph contains words related to the headline, length of the text etc. as well.
ICR
@The Matt: Thanks for clarifying. I wasn't sure from your question if you wanted to completely eliminate an article, or just certain paragraphs within an article. ICR's comment is right on. The only thing that I would add is that you may be able to eliminate poorly-placed ads with a "which of these things is not like the others" scan of the paragraphs within an article. Assuming the ads have dissimilar text to the content, you could de-highlight text that you suspect to be advertising. (This will stop being effective as ad-content of an article approaches 50%.)
Bill the Lizard
+2  A: 

Take a look at templatemaker (Google code homepage). The basic idea is that you request a few different pages from the same site, then mark down what elements are common across the set of pages. From there you can figure out where the dynamic content is.

Try running diff on two pages from the same site to get an idea of how it works. The parts of the page that are different are the places where there is dynamic (interesting) content.

lost-theory
A: 

Here's my a (probably naive) plan of how to approach this:

Assuming the RSS feed contains the opening words of the article, you could use these to locate the start of the article in the DOM. Walk back up the DOM a little (first parent DIV? first non-inline container element?) and snip. That should be the article.

Assuming you can get the document as a XML (HtmlAgilityPack can help here), you could (for instance) grab all descendant text from <p> elements with the following Linq2Xml:

            document
                .Descendants(XName.Get("p", "http://www.w3.org/1999/xhtml"))
                .Select(
                p=>p
                       .DescendantNodes()
                       .Where(n => n.NodeType == XmlNodeType.Text)
                       .Select(t=>t.ToString())
                )
                .Where(c=>c.Any())
                .Select(c=>c.Aggregate((a,b)=>a+b))
                .Aggregate((a,b)=>a+"\r\n\r\n"+b);

We successfully used this formula for scraping, but it seems like the terrain you have to cross is considerably more inhospitable.

spender
ha, well put, "inhospitable" is pretty accurate. your technique would work for predefined sites in which I knew would actually obey the structure or <p>'s or <div>'s, but I need the flexibility to accommodate a much larger pool of page structures.
The Matt
Well, there sure should be noise p elements... They could be even at the top of the article, so not very good...
ilya n.
Sure, but coupled with a text match from the RSS summary, this could be applied to a specific part of the document identified from the match. YMMV
spender
+2  A: 

When reading news outside of my RSS reader, I often use Readability to filter out everything but the meat of the article. It is Javascript-based so the technique would not directly apply to your problem, but the algorithm has a high success rate in my experience and is worth a look. Hope this helps.

Chris Ballance
+1  A: 

Here's what I would do after I checked the robots.txt file to make sure it's fine to scrap the article and parsed the document as an XML tree:

  1. Make sure the article is not broken into many pages. If it is, 'print view', 'single page' or 'mobile view' links may help to bring it to single page. Of course, don't bother if you only want the beginning of the article.

  2. Find the main content frame. To do that, I would count the amount of information in every tag. Now, what we're looking is a node that is big but consists of many small subnodes.

  3. Now I would try to filter out any noise inside the content frame. Well, the websites I read don't put any crap there, only useful images, but you do need to kill anything that has inline javascript and any external links.

  4. Optionally, flatten that into plain text (that is, go into the tree and open all elements; block elements create a new paragraph).

  5. Guess the header. It's usually something with h1, h2 or at least big font size, but you can simplify life by assuming that it somehow resembles the page title.

  6. Finally, find the authors (something with names and email), the copyright notice (try metadata or the word copyright) and the site name. Assemble these somewhere together with the the link to original and state clearly it's probably fair use (or whatever legal doctrine you feel like applies to you.)

ilya n.
A: 

First note that it could be very hard to do as good as some indexing systems because sites may present web pages with only the content to be indexed when a crawling is detected. For example, in Google News the query "Scientists discover a fossil" inurl:dinosaur.discovered source:cnn does not return this article, even though I would expect it to since the STORY HIGHLIGHTS of an article are relevant enough to be indexed. However, using just Google Search the query "Scientists discover a fossil" inurl:dinosaur.discovered site:www.cnn.com does return the article. Hence, I would surmise that when the CNN web site detects the Google News crawler, it returns only the content CNN wants indexed.

Idea 1: On most sites I would expect advertising and navigational elements to be in branches of the DOM separate from the main content. So my first attempt would be to compute the total number of words (or sentences even) at each leaf in the DOM tree. Next, I would pass this weight up the tree until they all merged or broke a threshold. I would then treat all the text below the node or nodes as the article content. The drawback to this method is that advertising inserted into the flow of the content would be included.

Idea 2: Start out computing the weights of the leaves of the DOM as in Idea 1. Next pass up the weight only one or two levels up the tree. Compute the average weight of these nodes and only choose those that exceed a threshold.

Would love to know what methods you implement that work and/or don't work!

Jeff Kubina
A: 

Obviously not a whole solution, but instead of trying to find the relevant content, it might be easier to disqualify non-relevant content. You could classify certain types of noises and work on coming up with smaller solutions that eliminate them. You could have advertisement filters, navigation filters, etc.

I think that the larger question is do you need to have one solution work on a wide range of content, or are you willing to create a framework that you can extend and implement on a site by site basis? On top of that, how often are you expecting change to the underlying data sources (i.e. volatility)?

joseph.ferris
Ideally I want one solution to work across the wide spectrum of sources. I want any change to underlying data sources to be accommodated gracefully by whatever method I come up with. While I don't expect frequent formatting changes to "existing" underlying data sources, I do need to take into account new and unexpected formats which could happen quite often. Since I'm using Google News as the aggregation method here, the possibilities for new formats could be quite high.
The Matt
A: 

You might want to look at Latent Dirichlet Allocation which is an IR technique to generate topics from text data that you have. This should help you reduce noise and get some precise information on what the page is about.