tags:

views:

528

answers:

1

H guys, I've got this wiki formatting algorithm which I am using at Stacked to create HTML out of "wiki syntax" and I am not really sure if the current one I am using is good enough, optimal or contains bugs since I am not really a "Regex Guru". Here is what I am currently using;

// Body is wiki content...
string tmp = Body.Replace("&", "&amp;").Replace("<", "&lt;").Replace(">", "&gt;");
// Sanitizing carriage returns...
tmp = tmp.Replace("\\r\\n", "\\n");

// Replacing dummy links...
tmp = Regex.Replace(
" " + tmp,
"(?<spaceChar>\\s+)(?<linkType>http://|https://)(?&lt;link&gt;\\S+)",
"${spaceChar}<a href=\"${linkType}${link}\"" + nofollow + ">${link}</a>",
RegexOptions.Compiled).Trim();

// Replacing wiki links
tmp = Regex.Replace(tmp,
"(?<begin>\\[{1})(?<linkType>http://|https://)(?&lt;link&gt;\\S+)\\s+(?&lt;content&gt;[^\\]]+)(?&lt;end&gt;[\\]]{1})",
"<a href=\"${linkType}${link}\"" + nofollow + ">${content}</a>",
RegexOptions.Compiled);

// Replacing bolds
tmp = Regex.Replace(tmp,
"(?<begin>\\*{1})(?<content>.+?)(?<end>\\*{1})",
"<strong>${content}</strong>",
RegexOptions.Compiled);

// Replacing italics
tmp = Regex.Replace(tmp,
"(?<begin>_{1})(?<content>.+?)(?<end>_{1})",
"<em>${content}</em>",
RegexOptions.Compiled);

// Replacing lists
tmp = Regex.Replace(tmp,
"(?<begin>\\*{1}[ ]{1})(?<content>.+)(?<end>[^*])",
"<li>${content}</li>",
RegexOptions.Compiled);
tmp = Regex.Replace(tmp,
"(?<content>\\<li\\>{1}.+\\<\\/li\\>)",
"<ul>${content}</ul>",
RegexOptions.Compiled);

// Quoting
tmp = Regex.Replace(tmp,
"(?<content>^&gt;.+$)",
"<blockquote>${content}</blockquote>",
RegexOptions.Compiled | RegexOptions.Multiline).Replace("</blockquote>\n<blockquote>", "\n");

// Paragraphs
tmp = Regex.Replace(tmp,
"(?<content>)\\n{2}",
"${content}</p><p>",
RegexOptions.Compiled);

// Breaks
tmp = Regex.Replace(tmp,
"(?<content>)\\n{1}",
"${content}<br />",
RegexOptions.Compiled);

// Code
tmp = Regex.Replace(tmp,
"(?<begin>\\[code\\])(?<content>[^$]+)(?<end>\\[/code\\])",
"<pre class=\"code\">${content}</pre>",
RegexOptions.Compiled);

// Now hopefully tmp will contain perfect HTML

For those who thinks it's difficult to see the code here, you can also check it out here...

Here is the complete "wiki syntax";

Syntax here:

Link; [http://x.com text]

*bold* (asterisk on both sides)

_italic_ (underscores on both sides)

* Listitem 1
* Listitem 2
* Listitem 3
(the above is asterixes but so.com also creates lists from it)

2 x Carriage Return is opening a new paragraph

1 x Carriage Return is break (br)

[code]
if( YouDoThis )
  YouCanWriteCode();
[/code]


> quote (less then operator)

If there are some "Regex gurus" who would like to review this Regex logic I'd appreciate it a lot :)

+4  A: 

Don't use regular expressions for this task, it is dangerous and will not make you happy. User input can be broken (deliberately or accidentally) in ways beyond imagination, no regex will be able to cover all conceivable cases.

A parser that has some notion of context and nesting is much better here.

Can you post a complete sample of your allowed syntax so people can start giving you an idea how to parse it?


EDIT: You could look into the possibility of using a (potentially modified) Markdown parser for this. There is an open source variant for .NET available: Markdown.NET, at least looking at the source code might be worthwhile. Maybe modifying it to suit your needs is not too hard.

Tomalak