I have an HTML content which is entered by user via a richtext editor so it can be almost anything (less those not supposed to be outside the body tag, no worries about "head" or doctype etc). An example of this content:
<h1>Header 1</h1>
<p>Some text here</p><p>Some more text here</p>
<div align=right><a href="x">A link here</a></div><hr />
<h1>Header 2</h1>
<p>Some text here</p><p>Some more text here</p>
<div align=right><a href="x">A link here</a></div><hr />
The trick is, I need to extract first 100 characters of the text only (HTML tags stripped). I also need to retain the line breaks and not break any word.
So the output for the above will be something like:
Header 1 Some text here Some more text here A link here Header 2 Some text here Some
It has 98 characters and line breaks are retained. What I can achieve so far is to strip the all HTML tags using Regex:
Regex.Replace(htmlStr, "<[^>]*>", "")
Then trim the length using Regex as well with:
Regex.Match(textStr, @"^.{1,100}\b").Value
My problem is, how to retaining the line break?. I get an output like:
Header 1 Some text hereSome more text here A link here Header 2 Some text hereSome more text
Notice the joining sentences? Perhaps someone can show me some other ways of solving this problem. Thanks!
Additional Info: My purpose is to generate plain text synopsis from a bunch of HTML content. Guess this will help clarify the this problem.