views:

755

answers:

2

Can I use Html Agility Pack to make the output look nicely indented, unnecessary white space stripped?

+1  A: 

A variation of this question has been answered recently

Basically the outcome of this was that while you can use HtmlAgilityPack to clean it up a bit by using the fix nested tags.

The best solution is to use something called Tidy which is an application that was originally created by some developers at w3c and then made open source. Its the engine that powers the w3c validator as well.

This article covers how to use it but you had to sign up (free) to view it:

It seems like a legit article but its funny because nobody else seems to have covered this topic in the last six years...

rtpHarry
I read the above post (before asking actually), and tried it out. While this fixes nested tags (and in my case foobared it too, meaning that the input was valid, when the output wasn't), it does not make the code look nice (and thus easy to debug). It seems that with html agility pack, my only option is to read the html in and then reconstruct the entire document - which seems a bit heavy weight to me...I'll have a look at the tidy article once I can get myself to register at devx (why the hell do they require this? - this is *so* 90ies!!!)
Jan Limpens
yeah registration is strange - I initially passed on it but before posting my answer I decided it was no good saying "this article probably works I didnt sign up". Sign up was quick and easy tho!
rtpHarry
+1  A: 

HAP is not going to give you the results you are after.

Try using a .net wrapper for HtmlTidy such as the one found here

using System;
using System.IO;
using System.Net;
using Mark.Tidy;

namespace CleanupHtml
{
    /// <summary>
    /// http://markbeaton.com/SoftwareInfo.aspx?ID=81a0ecd0-c41c-48da-8a39-f10c8aa3f931
    /// </summary>
    internal class Program
    {
        private static void Main(string[] args)
        {
            string html =
                new WebClient().DownloadString(
                    "http://stackoverflow.com/questions/2593147/html-agility-pack-make-code-look-neat/2610903#2610903");

            using (Document doc = new Document(html))
            {
                doc.ShowWarnings = false;
                doc.Quiet = true;
                doc.OutputXhtml = true;
                doc.OutputXml = true;
                doc.IndentBlockElements = AutoBool.Yes;
                doc.IndentAttributes = false;
                doc.IndentCdata = true;
                doc.AddVerticalSpace = false;
                doc.WrapAt = 120;

                doc.CleanAndRepair();

                string output = doc.Save();
                Console.WriteLine(output);
                File.WriteAllText("output.htm", output);
            }
        }
    }
}

Results:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"&gt;
<html xmlns="http://www.w3.org/1999/xhtml"&gt;
  <head>
    <meta name="generator" content="HTML Tidy for Windows (vers 14 October 2008), see www.w3.org" />
    <title>
      Html Agility Pack: make code look neat - Stack Overflow
    </title>
    <link rel="stylesheet" href="http://sstatic.net/so/all.css?v=6638" type="text/css" />
    <link rel="shortcut icon" href="http://sstatic.net/so/favicon.ico" />
    <link rel="apple-touch-icon" href="http://sstatic.net/so/apple-touch-icon.png" />
    <link rel="search" type="application/opensearchdescription+xml" title="Stack Overflow" href=
    "http://sstatic.net/so/opensearch.xml" />
    <script type="text/javascript" src="http://ajax.googleapis.com/ajax/libs/jquery/1.3.2/jquery.min.js"&gt;
</script>
    <script type="text/javascript" src="http://sstatic.net/so/js/master.js?v=6523"&gt;
</script>
    <script type="text/javascript">
//<![CDATA[
    var imagePath='http://sstatic.net/so/img/';

    //]]>
    </script>
    <link rel="canonical" href="http://stackoverflow.com/questions/2593147/html-agility-pack-make-code-look-neat" />
    <link rel="alternate" type="application/atom+xml" title=
    "Feed for question 'Html Agility Pack: make code look neat'" href="/feeds/question/2593147" />
    <script src="http://sstatic.net/so/js/question.js?v=6714" type="text/javascript">
</script>
    <script type="text/javascript">
//<![CDATA[
        var fkey = "b00609a1a5f2966a687eca3f84e4dd64";



        $(function() {

            vote.init(2593147);
            comments.init();
            styleCode();

        });
    //]]>
    </script>
  </head>
  <body>
    <noscript>
    <div id="noscript-padding"></div></noscript>
    <div id="notify-container"></div><script type="text/javascript">
//<![CDATA[
        $(function() { notify.showFirstTime(); });
    //]]>
    </script>
    <div class="container">
      <div id="header">
        <div id="topbar">
          <div id="hlinks">
            <a href=
            "/users/login?returnurl=%2fquestions%2f2593147%2fhtml-agility-pack-make-code-look-neat%2f2610903">login</a>
            <span class="lsep">|</span> <a href="http://careers.stackoverflow.com/"&gt;careers&lt;/a&gt; <span class=
            "lsep">|</span> <a href="/about">about</a> <span class="lsep">|</span> <a href="/faq">faq</a>
          </div>
          <div id="hsearch">
            <form id="search" action="/search" method="get" name="search">
              <div>
                <input name="q" class="textbox" tabindex="1" onfocus="if (this.value=='search') this.value = ''" type=
                "text" maxlength="80" size="28" value="search" />
              </div>
            </form>
          </div>
        </div><br class="cbt" />
        <div id="hlogo">
          <a href="/"><img src="http://sstatic.net/so/img/logo.png" width="250" height="61" alt="Stack Overflow" /></a>
        </div>
        <div id="hmenus">
          <div class="nav">
            <ul>
              <li class="youarehere">
                <a href="/questions">Questions</a>
              </li>
              <li>
                <a href="/tags">Tags</a>
              </li>
              <li>
                <a href="/users">Users</a>
              </li>
              <li>
                <a href="/badges">Badges</a>
              </li>
              <li>
                <a href="/unanswered">Unanswered</a>
              </li>
            </ul>
          </div>
          <div class="nav" style="float:right">
            <ul>
              <li style="margin-right:0px">
                <a href="/questions/ask">Ask Question</a>
              </li>
            </ul>
          </div>
        </div>
      </div>
      <div id="content">
        <div id="question-header">
          <h2>
            <a href="/questions/2593147/html-agility-pack-make-code-look-neat" class="question-hyperlink">Html Agility
            Pack: make code look neat</a>
          </h2>
        </div>
        <div id="mainbar">
          <div id="question" class="">
            <div class="everyonelovesstackoverflow">
              <script type="text/javascript">
//<![CDATA[
              document.write('<s'+'cript lang' + 'uage="jav' + 'ascript" src="http://ads.stackoverflow.com/a.aspx?ZoneID=3&amp;amp;Task=Get&amp;amp;IFR=False&amp;amp;PageID=52405&amp;amp;SiteID=1&amp;amp;Random=' + (+new Date()) + '&amp;Keywords=htmlagilitypack">'); 
              document.write('</'+'scr'+'ipt>');
              //]]>
              </script> <noscript>
              <div>
                <a href=
                "http://ads.stackoverflow.com/a.aspx?ZoneID=3&amp;amp;Task=Click&amp;amp;Mode=HTML&amp;amp;SiteID=1&amp;amp;PageID=52405"&gt;
                <img src=
                "http://ads.stackoverflow.com/a.aspx?ZoneID=3&amp;amp;Task=Get&amp;amp;Mode=HTML&amp;amp;SiteID=1&amp;amp;PageID=52405"
                alt="" /></a>
              </div></noscript>
            </div>
            <table>
              <tr>
                <td class="votecell">
                  <div class="vote">
                    <input type="hidden" value="2593147" /> <img class="vote-up" src=
                    "http://sstatic.net/so/img/vote-arrow-up.png" width="40" height="25" alt="vote up" title=
                    "This question is useful and clear (click again to undo)" /> <span class="vote-count-post">1</span>
                    <img class="vote-down" src="http://sstatic.net/so/img/vote-arrow-down.png" width="40" height="25"
                    alt="vote down" title="This question is unclear or not useful (click again to undo)" /> <img class=
                    "vote-favorite" src="http://sstatic.net/so/img/vote-favorite-off.png" width="32" height="31" alt=
                    "star" title="This is a favorite question (click again to undo)" />
                    <div class="favoritecount"></div>
                  </div>
                </td>
                <td>
                  <div>
                    <div class="post-text">
                      <p>
                        Can I use Html Agility Pack to make the output look nicely indented, unnecessary white space
                        stripped?
                      </p>
                    </div>
                    <div class="post-taglist">
                      <a href="/questions/tagged/htmlagilitypack" class="post-tag" title=
                      "show questions tagged 'htmlagilitypack'" rel="tag">htmlagilitypack</a>
                    </div>
                    <table class="fw">
                      <tr>
                        <td class="vt">
                          <div class="post-menu">
                            <a id="flag-post-2593147" title="flag this post for serious problems" name=
                            "flag-post-2593147">flag</a>
                          </div>
                        </td>
                        <td class="post-signature owner">
                          <div class="user-info">
                            <div class="user-action-time">
                              asked <span title="2010-04-07 14:13:47Z" class="relativetime">2 days ago</span>
                            </div>
                            <div class="user-gravatar32">
                              <a href="/users/51795/illdev"><img src=
                              "http://www.gravatar.com/avatar/52dc0db2cdacc6e9769d074a37466317?s=32&amp;amp;d=identicon&amp;amp;r=PG"
                              height="32" width="32" alt="" /></a>
                            </div>
                            <div class="user-details">
                              <a href="/users/51795/illdev">illdev</a><br />
                              <span class="reputation-score" title="reputation score">53</span><span title=
                              "5 bronze badges"><span class="badge3">&#9679;</span><span class=
                              "badgecount">5</span></span>
                            </div>
                          </div><br class="cbt" />
                          <div class="accept-rate cool" title=
                          "this user has accepted an answer for 2 of 4 eligible questions">
                            50% accept rate
                          </div>
                        </td>
                      </tr>
                    </table>
                  </div>
                </td>
              </tr>
              <tr>
                <td class="votecell"></td>
                <td>
                  <div id="comments-2593147" class="comments">
                    <table>
                      <tbody>
                        <tr id="comment-2600849" class="comment">
                          <td></td>
                          <td class="comment-text">
                            <div>
                              what output? From where? some more details perhaps? &ndash;&nbsp;<a href=
                              "/users/97614/sam-holder" title="1868" class="comment-user">Sam Holder</a> <span class=
                              "comment-date"><span title="2010-04-07 14:16:41Z">2 days ago</span></span>
                            </div>
                          </td>
                        </tr>
                        <tr id="comment-2600851" class="comment">
                          <td></td>
                          <td class="comment-text">
                            <div>
                              <i>(reference)</i> <a href="http://htmlagilitypack.codeplex.com/Wikipage" rel=
                              "nofollow">htmlagilitypack.codeplex.com/Wikipage</a> &ndash;&nbsp;<a href=
                              "/users/208809/gordon" title="16497" class="comment-user">Gordon</a> <span class=
                              "comment-date"><span title="2010-04-07 14:16:55Z">2 days ago</span></span>
                            </div>
                          </td>
                        </tr>
                        <tr id="comment-2624419" class="comment">
                          <td></td>
                          <td class="comment-text">
                            <div>
                              output = html code output &ndash;&nbsp;<a href="/users/51795/illdev" title="53" class=
                              "comment-user owner">illdev</a> <span class="comment-date"><span title=
                              "2010-04-10 13:14:42Z">12 secs ago</span></span>
                            </div>
                          </td>
                        </tr>
                      </tbody>
                    </table>
                  </div>
                </td>
              </tr>
            </table>
          </div>
          <div id="answers">
            <a name="tab-top" id="tab-top"></a>
            <div id="answers-header">
              <div id="subheader">
                <h2>
                  2 Answers
                </h2>
                <div id="tabs">
                  <a href="/questions/2593147?tab=oldest#tab-top" title=
                  "Answers in the order they were given">oldest</a> <a href="/questions/2593147?tab=newest#tab-top"
                  title="Most recent answers first">newest</a> <a class="youarehere" href=
                  "/questions/2593147?tab=votes#tab-top" title="Answers with the most votes first">votes</a>
                </div>
              </div>
            </div><a name="2610845"></a>
            <div id="answer-2610845" class="answer">
              <table>
                <tr>
                  <td class="votecell">
                    <div class="vote">
                      <input type="hidden" value="2610845" /> <img class="vote-up" src=
                      "http://sstatic.net/so/img/vote-arrow-up.png" width="40" height="25" alt="vote up" title=
                      "This answer is useful (click again to undo)" /> <span class="vote-count-post">0</span>
                      <img class="vote-down" src="http://sstatic.net/so/img/vote-arrow-down.png" width="40" height="25"
                      alt="vote down" title="This answer is not useful (click again to undo)" />
                    </div>
                  </td>
                  <td>
                    <div class="post-text">
                      <p>
                        A variation of this question has been answered recently
                      </p>
                      <ul>
                        <li>
                          <a href=
                          "http://stackoverflow.com/questions/2490765/which-is-the-best-html-tidy-pack-is-there-any-option-in-html-agility-pack-to-mak/2507673#2507673"&gt;
                          http://stackoverflow.com/questions/2490765/which-is-the-best-html-tidy-pack-is-there-any-option-in-html-agility-pack-to-mak/2507673#2507673&lt;/a&gt;
                        </li>
                      </ul>
                      <p>
                        Basically the outcome of this was that while you <strong>can</strong> use HtmlAgilityPack to
                        clean it up a bit by using the fix nested tags.
                      </p>
                      <p>
                        The best solution is to use something called Tidy which is an application that was originally
                        created by some developers at w3c and then made open source. Its the engine that powers the w3c
                        validator as well.
                      </p>
                      <p>
                        This article covers how to use it but you had to sign up (free) to view it:
                      </p>
                      <ul>
                        <li>
                          <a href="http://www.devx.com/dotnet/Article/20505/1763/" rel=
                          "nofollow">http://www.devx.com/dotnet/Article/20505/1763/&lt;/a&gt;
                        </li>
                      </ul>
                      <p>
                        It seems like a legit article but its funny because nobody else seems to have covered this
                        topic in the last six years...
                      </p>
                    </div>
                    <table class="fw">
                      <tr>
                        <td class="vt">
                          <div class="post-menu">
                            <a href="/questions/2593147/html-agility-pack-make-code-look-neat/2610845#2610845" title=
                            "permalink to this answer">link</a><span class="lsep">|</span><a id="flag-post-2610845"
                            title="flag this post for serious problems" name="flag-post-2610845">flag</a>
                          </div>
                        </td>
                        <td align="right" class="post-signature">
                          <div class="user-info">
                            <div class="user-action-time">
                              answered <span title="2010-04-09 20:55:18Z" class="relativetime">16 hours ago</span>
                            </div>
                            <div class="user-gravatar32">
                              <a href="/users/156388/rtpharry"><img src=
                              "http://www.gravatar.com/avatar/6811db2b37e824fdf6c5c4fcdddd4146?s=32&amp;amp;d=identicon&amp;amp;r=PG"
                              height="32" width="32" alt="" /></a>
                            </div>
                            <div class="user-details">
                              <a href="/users/156388/rtpharry">rtpHarry</a><br />
                              <span class="reputation-score" title="reputation score">88</span><span title=
                              "6 bronze badges"><span class="badge3">&#9679;</span><span class=
                              "badgecount">6</span></span>
                            </div>
                          </div>
                        </td>
                      </tr>
                    </table>
                  </td>
                </tr>
                <tr>
                  <td class="votecell"></td>
                  <td>
                    <div id="comments-2610845" class="comments dno">
                      <table>
                        <tbody>
                          <tr>
                            <td></td>
                            <td></td>
                          </tr>
                        </tbody>
                      </table>
                    </div>
                  </td>
                </tr>
              </table>
            </div>
            <div class="everyonelovesstackoverflow">
              <script type="text/javascript">
//<![CDATA[
              document.write('<s'+'cript lang' + 'uage="jav' + 'ascript" src="http://ads.stackoverflow.com/a.aspx?ZoneID=14&amp;amp;Task=Get&amp;amp;IFR=False&amp;amp;PageID=52405&amp;amp;SiteID=1&amp;amp;Random=' + (+new Date()) + '&amp;Keywords=htmlagilitypack">'); 
              document.write('</'+'scr'+'ipt>');
              //]]>
              </script> <noscript>
              <div>
                <a href=
                "http://ads.stackoverflow.com/a.aspx?ZoneID=14&amp;amp;Task=Click&amp;amp;Mode=HTML&amp;amp;SiteID=1&amp;amp;PageID=52405"&gt;
                <img src=
                "http://ads.stackoverflow.com/a.aspx?ZoneID=14&amp;amp;Task=Get&amp;amp;Mode=HTML&amp;amp;SiteID=1&amp;amp;PageID=52405"
                alt="" /></a>
              </div></noscript>
            </div><a name="2610903"></a>
            <div id="answer-2610903" class="answer">
              <table>
                <tr>
                  <td class="votecell">
                    <div class="vote">
                      <input type="hidden" value="2610903" /> <img class="vote-up" src=
                      "http://sstatic.net/so/img/vote-arrow-up.png" width="40" height="25" alt="vote up" title=
                      "This answer is useful (click again to undo)" /> <span class="vote-count-post">0</span>
                      <img class="vote-down" src="http://sstatic.net/so/img/vote-arrow-down.png" width="40" height="25"
                      alt="vote down" title="This answer is not useful (click again to undo)" />
                    </div>
                  </td>
                  <td>
                    <div class="post-text">
                      <p>
                        Output as XHTML and run that through an <a href=
                        "http://msdn.microsoft.com/en-us/library/system.xml.xmltextwriter.indentation.aspx" rel=
                        "nofollow">XmlTextWriter</a>
                      </p>
                    </div>
                    <table class="fw">
                      <tr>
                        <td class="vt">
                          <div class="post-menu">
                            <a href="/questions/2593147/html-agility-pack-make-code-look-neat/2610903#2610903" title=
                            "permalink to this answer">link</a><span class="lsep">|</span><a id="flag-post-2610903"
                            title="flag this post for serious problems" name="flag-post-2610903">flag</a>
                          </div>
                        </td>
                        <td align="right" class="post-signature">
                          <div class="user-info">
                            <div class="user-action-time">
                              answered <span title="2010-04-09 21:02:34Z" class="relativetime">16 hours ago</span>
                            </div>
                            <div class="user-gravatar32">
                              <a href="/users/242897/sky-sanders"><img src=
                              "http://www.gravatar.com/avatar/df4a7fbd8a054fd6193ca0ee62952f1f?s=32&amp;amp;d=identicon&amp;amp;r=PG"
                              height="32" width="32" alt="" /></a>
                            </div>
                            <div class="user-details">
                              <a href="/users/242897/sky-sanders">Sky Sanders</a><br />
                              <span class="reputation-score" title="reputation score">4,014</span><span title=
                              "2 silver badges"><span class="badge2">&#9679;</span><span class=
                              "badgecount">2</span></span><span title="14 bronze badges"><span class=
                              "badge3">&#9679;</span><span class="badgecount">14</span></span>
                            </div>
                          </div>
                        </td>
                      </tr>
                    </table>
                  </td>
                </tr>
                <tr>
                  <td class="votecell"></td>
                  <td>
                    <div id="comments-2610903" class="comments dno">
                      <table>
                        <tbody>
                          <tr>
                            <td></td>
                            <td></td>
                          </tr>
                        </tbody>
                      </table>
                    </div>
                  </td>
                </tr>
              </table>
            </div>
            <form id="post-form" action="/questions/2593147/answer/submit" method="post" name="post-form">
              <h2 class="space">
                Your Answer
              </h2><script src="http://sstatic.net/so/Js/wmd.js?v=6016" type="text/javascript">
</script> <script type="text/javascript">
//<![CDATA[
              $(function() { 
              editorReady(1, heartbeat.answers);
              });   
              //]]>
              </script>
              <div id="post-editor">
                <div id="wmd-container">
                  <div id="wmd-button-bar"></div>
                  <textarea id="wmd-input" name="post-text" cols="92" rows="15" tabindex="101">
</textarea>
                </div>
                <div class="community-option">
                  <input id="communitymode" name="communitymode" type="checkbox" /> <label for="communitymode" title=
                  "community owned posts do not generate any reputation for the owner, have a lower reputation barrier for collaborative editing, and show only a revision history instead of a signature block">
                  community wiki</label>
                </div>
                <div id="wmd-preview"></div>
                <div id="edit-block">
                  <input id="fkey" name="fkey" type="hidden" value="b00609a1a5f2966a687eca3f84e4dd64" /> <input id=
                  "author" name="author" type="text" />
                </div>
              </div>
              <div class="form-item">
                <table>
                  <tr>
                    <td class="vm">
                      <label for="openid_identifier">OpenID Login</label> <input id="openid_identifier" name=
                      "openid_identifier" class="openid-identifer" type="text" size="40" maxlength="200" value=""
                      tabindex="104" />
                      <div class="form-item-info">
                        Get an <a href="http://openid.net/get/" target="_blank">OpenID</a>
                      </div>
                    </td>
                    <td class="orcell">
                      <div class="orword">
                        or
                      </div>
                      <div class="orline"></div>
                    </td>
                    <td class="vm">
                      <div>
                        <label for="display-name">Name</label> <input id="display-name" name="display-name" type="text"
                        size="30" maxlength="30" value="" tabindex="105" />
                      </div>
                      <div>
                        <label for="m-address">Email</label> <input id="m-address" name="m-address" type="text" size=
                        "40" maxlength="100" value="" tabindex="106" /> <span class="edit-field-overlay" style=
                        "color:#999; font-weight:normal">never shown</span>
                      </div>
                      <div>
                        <label for="home-page">Home Page</label> <input id="home-page" name="home-page" type="text"
                        size="40" maxlength="200" value="" tabindex="107" />
                      </div>
                    </td>
                  </tr>
                </table>
              </div>
              <div class="form-submit cbt">
                <input id="submit-button" type="submit" value="Post Your Answer" tabindex="110" />
              </div>
            </form>
            <h2 class="space">
              Not the answer you're looking for? Browse other questions tagged <a href=
              "/questions/tagged/htmlagilitypack" class="post-tag" title="show questions tagged 'htmlagilitypack'" rel=
              "tag">htmlagilitypack</a> or <a href="/questions/ask">ask your own question</a>.
            </h2>
          </div><img src="/posts/2593147/ivc/1707" class="dno" alt="" />
        </div>
Sky Sanders
I cannot be sure that the passed in html is 100% valid xml.
Jan Limpens
@illdev - Thats ok, I edited my answer.
Sky Sanders
looks cool!That's from the devx article? - otherwise what is the Document class?
Jan Limpens
@illdev - no, not that I know of, is just a lib that I have used in the past. Document is part of the Mark.Tidy lib. Is the Tidy wrapper.
Sky Sanders
Didn't know that one, looks nice! Did you use it somewhere more extensively - is it heavy?
Jan Limpens
I have used Raget's HtmlTidy for many years, which is an c lib, this is just a managed wrapper. All in all I would say that it is fast, lean and will probably get you where you need to go.
Sky Sanders