tags:

views:

128

answers:

3

I have a simple requirement to extract text in html. Suppose the html is

<h1>hello</h1> ... <img moduleType="calendar" /> ...<h2>bye</h2> 

I want to convert it into three parts

<h1>hello</h1> 
<img moduleType="calendar" />
<h2>bye</h2> 

The aim is to extract text in two categories, simple html and special tags with <img moduleType="Calendar".

A: 

It depends on the language and context you are using. I do something similar on my CMS, my approach is first find tags and then attributes.

Get tags

"<img (.*?)/>"

Then I search through the result for specific attributes

'title="(.*?)"'

If you want to find all attributes you could easily change the explicit title to the regex [a-z], or non-whitespace character, and then loop through those results as well.

Owen Allen
Fighting against the downvotes you'll get -- Welcome to SO ;-) Include known problems/limitations in your answer. HTML parsing with regular expressions is almost always stomped on.
pst
+1  A: 

Don't do that; HTML can be broken in many beautiful ways. Use beautiful soup instead.

florin
A: 

I actually try to do similar thing as asp.net compiler to compile the mark up into server control tree, regular expression is heavily used by asp.net compiler. I have a temporary solution, although not nice, but seems ok.

//string source = "<h1>hello</h1>";
string source = "<h1>hello<img moduleType=\"calendar\" /></h1> <p> <img moduleType=\"calendar\" /> </p> <h2>bye</h2> <img moduleType=\"calendar\" /> <p>sss</p>";
Regex exImg = new Regex("(.+?)(<img.*?/>)");

var match = exImg.Match(source);
int lastEnd = 0;
while (match.Success)
{
    Console.WriteLine(match.Groups[1].Value);
    Console.WriteLine(match.Groups[2].Value);
    lastEnd = match.Index + match.Length;
    match = match.NextMatch();
}
Console.WriteLine(source.Substring(lastEnd, source.Length - lastEnd ));


Fred Yang