views:

479

answers:

4

I need a HTML scraper or a DOM editor. I know the question has been asked many times, and the answer is HTML agility pack. But it doesn't look any good to me. I tried to removed a simple form element, but it removed only the <form> tag and leaved all other tags inside it, also it leaved the </form> tag. I used the PHP Simple HTML DOM Parser, and it works amazingly brilliant, the only problem, I need something on client side. Is there any other options in C#, as HTML agility pack is no good with no documentation. I don't need any outdated library, which is not been worked on.

    foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//form"))
        link.Remove();

Also is there any other options, like how can I use the Python Beautiful Soup with C# client. Also I am looking for something simple jquery like syntax and not XPATH, like the PHP Simple HTML DOM Parser.

+2  A: 

If you don't trust HTML agility pack, then you can use HTML agility pack to "cast" your HTML into XML, and then attack the problem with LINQ to XML.

http://web.archive.org/web/20080719181517/http%3A//vijay.screamingpens.com/archive/2008/05/26/linq-amp-lambda-part-3-html-agility-pack-to-linq.aspx

From there, it should be easy to complete your requirements.

Here's an extension method to do just that, reproduced from the archived page above:

public static class HtmlDocumentExtensions
{
    public static XDocument ToXDocument(this HtmlDocument document)
    {
        using (StringWriter sw = new StringWriter())
        {
            document.OptionOutputAsXml = true;
            document.Save(sw);
            return XDocument.Parse(sw.GetStringBuilder().ToString());
        }
    }
}
spender
If I don't trust HTML agility pack, then I won't use to convert it to XML also, I would use Tidy.NET. Also parsing XML can always be done, the thing I am looking is for simpler solutions using some DOM parser which is efficient, reliable and have some sort of documentation and that does not force you to learn XPath.
Priyank Bolia
I'd suggest that your mistrust of HTMLAgility pack is misplaced. Once you convert to XML, you're using standard .Net XML manipulation which doesn't require a knowlege of XPath, is well documented, efficient, reliable and about as simple as it gets. Using the technique above means you need know very little about HTMLAgility pack. It works. We use this technique to good effect.
spender
+1; see also http://stackoverflow.com/questions/1512562/parsing-html-page-with-htmlagilitypack/1512629#1512629
Ruben Bartelink
And http://stackoverflow.com/questions/1512562/parsing-html-page-with-htmlagilitypack/1512619#1512619
Ruben Bartelink
A: 

Html Agility Pack handles FORM in a special way, this is totally by design, to comply as much as possible with HTML 3.2. In plain old HTML, overlapping tags is perfectly correct, but not in XML. We choose a compromise, but the behavior can be tuned using the ElementFlags (look in the C# code)

Please consult the forums for help on this, for example, this http://htmlagilitypack.codeplex.com/Thread/View.aspx?ThreadId=53782

I don't buy your argument. The first principle of any user library is 'don't make me think'. I except things to work in the standard way as other libs do. Secondly, If I say link.Remove();, what good HTML would it be with a single </form> tag. The whole valid HTML is corrupted, just for the sake of compliance to old HTML ways. Is this the design you are taking about.
Priyank Bolia
I had already left looking HTML agility pack, so there no point in returning. The question is about alternative and combining Beautiful Soup or other libraries with C#. I am seriously looking at writing Python scripts and executing them and catching the output, rather than fixing broken code.
Priyank Bolia
+1  A: 

The best is to use BeautifulSoup and execute the python files, as shell execute. You might also use IronPython, etc. But shell execute the Python files looks best to me. I was able to write a scraper in Python using BeautifulSoup without any pre knowledge of Python in less than an hour, which means its easy and very efficient, and bug free as compared to HTML agility pack.

The only problem, now 14 MB of additional requirements, but I prefer correctness and simplicity over download bandwidth.

Priyank Bolia
A: 

The best way to scrape web pages is Scrapemark and it's faster than BeautifulSoup.

minder