views:

114

answers:

1

Has anyone integrated BeautifulSoup with ASP.NET/C# (possibly using IronPython or otherwise)? Is there a BeautifulSoup alternative or a port that works nicely with ASP.NET/C#

The intent of planning to use the library is to extract readable text from any random URL.

Thanks

A: 

Html Agility Pack is a similar project, but for C# and .NET


EDIT:

To extract all readable text:

document.DocumentNode.InnerText

Note that this will return the text content of <script> tags.

To fix that, you can remove all of the <script> tags, like this:

foreach(var script in doc.DocumentNode.Descendants("script").ToArray())
    script.Remove();
foreach(var style in doc.DocumentNode.Descendants("style").ToArray())
    style.Remove();

(Credit: SLaks)

Colin Pickard
How would I use HAP for scraping readable text from a HTML page. In BeautifulSoup, it's very easy to do this.
I've updated my answer
Colin Pickard
Does the DocumentNode.InnerText get all the text within the <body> tags. My worry is that I need to support this for URLs that do not follow any standard. There might be gunk all over. Is HAP smart enough to distinguish between readable text and irrelevant HTML tags, comments, client scripts
HAP is pretty smart at detecting what text will be output by a browser, but of course many sites these days will make a lot of changes to the text visible in the final render with css, javascript and images. So really the only true way to determine what is a person could read when the page is rendered by a browser, would be to render it in a browser...
Colin Pickard