views:

1153

answers:

4

How would you write a regular expression to convert mark down into HTML? For example, you would type in the following:

This would be *italicized* text and this would be **bold** text

This would then need to be converted to:

This would be <em>italicized</em> text and this would be <strong>bold</strong> text

Very similar to the mark down edit control used by stackoverflow.

Clarification

For what it is worth, I am using C#. Also, these are the only real tags/markdown that I want to allow. The amount of text being converted would be less than 300 characters or so.

+5  A: 

A single regex won't do. Every text markup will have it's own html translator. Better look into how the existing converters are implemented to get an idea on how it works.

http://en.wikipedia.org/wiki/Markdown#Converters

jop
I just came across the following article at http://kore-nordmann.de/blog/do_NOT_parse_using_regexp.html.
mattruma
might be a good idea to add this link to your original post.
jop
+2  A: 

The best way is to find a version of the Markdown library ported to whatever language you are using (you did not specify in your question).


Now that you have clarified that you only want STRONG and EM to be processed, and that you are using C#, I recommend you take a look at Markdown.NET to see how those tags are implemented. As you can see, it is in fact two expressions. Here is the code:

private string DoItalicsAndBold (string text)
{
    // <strong> must go first:
    text = Regex.Replace (text, @"(\*\*|__) (?=\S) (.+?[*_]*) (?<=\S) \1", 
                          new MatchEvaluator (BoldEvaluator),
                          RegexOptions.IgnorePatternWhitespace | RegexOptions.Singleline);

    // Then <em>:
    text = Regex.Replace (text, @"(\*|_) (?=\S) (.+?) (?<=\S) \1",
                          new MatchEvaluator (ItalicsEvaluator),
                          RegexOptions.IgnorePatternWhitespace | RegexOptions.Singleline);
    return text;
}

private string ItalicsEvaluator (Match match)
{
    return string.Format ("<em>{0}</em>", match.Groups[2].Value);
}

private string BoldEvaluator (Match match)
{
    return string.Format ("<strong>{0}</strong>", match.Groups[2].Value);
}
apathetic
It really shouldn't matter what language ... there should just be a simple regular expression to handle the condition.
mattruma
I added some clarification to the question.
mattruma
+1  A: 

I don't know about C# specifically, but in perl it would be:
s/
\*\*(.*?)\*\*/
\< bold>$1\<\/bold>/g
s/
\*(.*?)\*/
\< em>$1\<\/em>/g

tloach
A: 

I came across the following post that recommends to not do this. In my case though I am looking to keep it simple, but thought I would post this per jop's recommendation in case someone else wanted to do this.

mattruma