ansaurus

Question

Getting a substring of text containing HTML tags

Answer 1

A:

You could loop over the html string to detect the angle brackets and build up an array of tags and whether there was a matching closing tag for each one. The problem is, HTML allows for non closing tags, such as img, br, meta - so you'd need to know about those. You would also need to have rules to check the order of closing, because just matching an open with a close doesn't make valid HTML - if you open a div, then a p and then close the div and then close the p, that isn't valid.

Sohnee 2009-04-17 07:26:05

can you please give me some sample code?

2009-04-17 07:29:06

Answer 2

+1 A:

Your requirement is very unclear so most of this is guesswork. Also, you have provided no code which would help to clarify what it is you want to do.

One solution could be:

a. Find the text between the  and the  tags. You can use the following Regex for this or use a simple string search:

\<p\>(.*?)\</p\>

b. In the found text, apply a Substring() to extract the required text.

c. Put back the extracted text between the  and the  tags.

Cerebrus 2009-04-17 07:34:18

But i think he has just given P tag as an example. He might have to pull out substring from any type of tag.

rahul 2009-04-17 07:37:41

Yes, Now i modified the question to make more clear

2009-04-17 07:44:04

@phoenix: Your intuition is quite possibly true.

Cerebrus 2009-04-17 08:08:01

Answer 3

+2 A:

You need to teach your code how to understand that your string is actually HTML or XML. Just treating it like a string won't allow you to work with it the way you want to. This means first transforming it to the correct format and then working with that format.

Use an XSL stylesheet

If your HTML is well-formed XML, load it into an XMLDocument and run it through an XSL stylesheet that does something like the following:

<xsl:template match="p">
  <xsl:value-of select="substring(text(), 0, 10)" />
</xsl:template>

Use an HTML parser

If it's not well-formed XML (as in your example, where you have a sudden  in the middle), you'll need to use a HTML parser of some kind, such as HTML Agility Pack (see this question about C# HTML parsers).

Don't use regular expressions, since HTML is too complex to parse using regex.

Rahul 2009-04-17 08:15:30

ansaurus

tags:

views:

answers:

Getting a substring of text containing HTML tags

Use an XSL stylesheet

Use an HTML parser

related questions