views:

559

answers:

4

I have an HTML content which is entered by user via a richtext editor so it can be almost anything (less those not supposed to be outside the body tag, no worries about "head" or doctype etc). An example of this content:

<h1>Header 1</h1>
<p>Some text here</p><p>Some more text here</p>
<div align=right><a href="x">A link here</a></div><hr />
<h1>Header 2</h1>
<p>Some text here</p><p>Some more text here</p>
<div align=right><a href="x">A link here</a></div><hr />

The trick is, I need to extract first 100 characters of the text only (HTML tags stripped). I also need to retain the line breaks and not break any word.

So the output for the above will be something like:

Header 1
Some text here

Some more text here

A link here

Header 2
Some text here

Some

It has 98 characters and line breaks are retained. What I can achieve so far is to strip the all HTML tags using Regex:

Regex.Replace(htmlStr, "<[^>]*>", "")

Then trim the length using Regex as well with:

Regex.Match(textStr, @"^.{1,100}\b").Value

My problem is, how to retaining the line break?. I get an output like:

Header 1
Some text hereSome more text here
A link here
Header 2
Some text hereSome more text

Notice the joining sentences? Perhaps someone can show me some other ways of solving this problem. Thanks!

Additional Info: My purpose is to generate plain text synopsis from a bunch of HTML content. Guess this will help clarify the this problem.

+1  A: 

For info, stripping html with a regex is... full of subtle problems. The HTML Agility Pack may be more robust, but still suffers from the words bleeding together:

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
string text = doc.DocumentNode.InnerText;
Marc Gravell
I've tried Agility Pack. I'm not too worried about stripping the HTML tags as the content and layout are not too fancy. As like you said, words still bleeding together.
o.k.w
A: 

One way could be to strip html in three steps:

Regex.Replace(htmlStr, "<[^/>]*>", "") // don't strip </.*>
Regex.Replace(htmlStr, "</p>", "\r\n") // all paragraph ends are replaced w/ new line
Regex.Replace(htmlStr, "<[^>]*>", "") // replace remaining </.*>
Arun Mahapatra
If the paragraph tag has a trailing line break, I'll have to make sure no additional break is introduced. I'll also have to take care or any block elements like DIV and HR etc. The list goes on and on.
o.k.w
+2  A: 

I think how I would solve this is to look at it as though it were a simple browser. Create a base Tag class, make it abstract with maybe an InnerHTML property and a virtual method PrintElement.

Next, create classes for each HTML tag that you care about and inherit from your base class. Judging from your example, the tags you care most about are h1, p, a, and hr. Implement the PrintElement method such that it returns a string that prints out the element properly based on the InnerHTML (such as the p class' PrintElement would return "\n[InnerHTML]\n").

Next, build a parser that will parse through your HTML and determine which object to create and then add those objects to a queue (a tree would be better, but doesn't look like it's necessary for your purposes).

Finally, go through your queue calling the PrintElement method for each element.

May be more work than you had planned, but it's a far more robust solution than simply using regex and should you decided to change your mind in the future and want to show simple styling it's just a matter of going back and modifying your PrintElement methods.

Phairoh
That's probably a better solution - if you treat p and div tags as they should be (block level elements), then replacing with new lines should work quite nicely.
Zhaph - Ben Duguid
Wow, definitely much more work than i would have allocated. As mentioned earlier, my ultimate purpose is to extract top X number of characters and display as plain text without breaking any word and have the corresponding line break as how the HTML content would have been rendered in the browser. But thanks Phairoh for coming out with something I wouldn't have thought of. +1 : )
o.k.w
A: 

Well, I need to close this though not having the ideal solution. Since the HTML tags used in my app are very common ones (no tables, list etc) with little or no nesting, what I did is to preformat the HTML fragments before I save them after user input.

  • Remove all line breaks
  • Add a line break prefix to all block tags (e.g. div, p, hr, h1/2/3/4 etc)

Before I extract them out to be displayed as plain-text, use regex to remove the html tag and retain the line-break. Hardly any rocket science but works for me.

o.k.w