views:

62

answers:

5

Hello. I am working with ASP.NET and need to manage with a string typed by the user in order to extract some information. The user enters a normal text, words and numbers, but sometimes he may type a mathematical expression in MATHML, these expressions are always an xml string enclosed by the tag. I want to extract from the typed text every math segment. For example, let's consider the user typed this text:

string input = "My name is Dorry and here is a math expression: <math>---some math1---</math> ah, there is another expression: <math>---some math2---</math> and do not forget this too <math>---some math3---</math>.".

Well, The first regex solution I came up with is this:

string pattern1 = @"\<math(.+)\<\/math\>";

To get matches I obviousely use:

Regex r = new Regex(pattern1, RegexOptions.IgnoreCase);
string[] res = r.Matches(input);

And it seemd working, too bad, it does not because this expression, instead of getting me an array (using Reges.Matches) filled with three strings ("---some math1---", "---some math2---", "---some math3---"); it gets me an array with one element only: "---some math1--- ah, there is another expression: ---some math2--- and do not forget this too ---some math3---". Can you see? it takes the first and the last and merges everything in the middle WITHOUT CARING of some other or elements in the way!

Well, I suppose this is a well known issue about regular expressions; is there a solution? how to tell the regex engine to be a little more... aware?

Thank you very much in advance.

A: 

If you're using the .NET BCL Regex class, you should be able to use balanced groups to achieve what you need:

http://blog.stevenlevithan.com/archives/balancing-groups

Lucero
A: 

Hi,

You can use <math>[\s\S]*?</math> regex. It worked fine with the example string provided by you. It gave me 3 matches as follows :

<math>---some math1---</math>

<math>---some math2---</math>

<math>---some math3---</math>

I hope this is what you want to get.

Shekhar
Yeah it matches... thank you very much
Andry
+1  A: 
  1. Using regular expressions for matching XML-/HTML-like tags is usually a bad idea and very error-prone. I don't know if the balanced groups .NET regexes provide solve this, so just be warned.

  2. Your problem has bitten many many others before - regexes are greedy by default. .+ can match everything (including </math>), so it matches the whole input. Then, because the regex did not match completely, it starts backtracing until the rest of the regex can match. And so the </math> subpattern matches only the last closing tag. To make the regex non-greedy, add a ? after the + (or * for that matter).

delnan
Well, I found a correct pattern... about what you said... I'll be aware of it and research more to better understand where regex is considered a good solution for and a good practice. Thank you for your information
Andry
A: 

Give this a go..

string pattern1 = @"\<math[\s\S]*?<\/math\>";
Regex r = new Regex(pattern1, RegexOptions.IgnoreCase);
MatchCollection res = r.Matches(input);

Nick

Nicholas Mayne
Thank you Nick, it runs correctly... thanks again
Andry
A: 

This is the regex you need:

  <math>.*?</math>

It matches every pair of math tags.

If the opening tag might contain attributes, use this regex instead:

  <math\b[^><]*>.*?</math>
Vantomex
OK, thanks, that's good too :)
Andry