tags:

views:

348

answers:

8

In (Visual Basic, .NET):

  Dim result As Match = Regex.Match(aStr, aMatchStr)
  If result.Success Then
      Dim result0 As String = result.Groups(0).Value
      Dim result1 As String = result.Groups(1).Value
  End If

with: aStr equal to (whitespace is normal space and there is 7 spaces between "n" and "(" ):

"AMEVDIEERPK + 7 Oxidation       (M)"

I don't understand why result1 becomes an empty string for aMatchStr equal to

"\s*(\d*).*?Oxidation\s+\(M\)"

but becomes "7" for aMatchStr equal to

"\s*(\d*)\s*Oxidation\s+\(M\)"

(result0 becomes equal to "AMEVDIEERPK + 7 Oxidation       (M)")

(This is from MSQuant, <http://msquant.sourceforge.net/>, MascotResultParser.vb, function modificationParseMatch() - <http://shrinkster.com/1352>).

A: 

". * ?" in this example will always match zero characters, since "* ?" does shortest possible match. As a result, since the thing right before the 'O' is a space, "\ d *" can match 0 digits.

(sorry about the spaces in the quotes; the auto-formatter was eating my syntax)

http://msdn.microsoft.com/en-us/library/3206d374.aspx

Brian
+2  A: 

I think it's because the matching starts at the first character and moves on from there...

For your first RE:

Does "AMEVDIEERPK + 7 Oxidation (M)" match "\s*(\d*).*?Oxidation\s+(M)"?  Yes.. stop matching.

For your second:

Does "AMEVDIEERPK + 7 Oxidation (M)" match "\s*(\d*)\s*Oxidation\s+(M)"?  No...
Does "MEVDIEERPK + 7 Oxidation (M)" match "\s*(\d*)\s*Oxidation\s+(M)"?  No...
Does "EVDIEERPK + 7 Oxidation (M)" match "\s*(\d*)\s*Oxidation\s+(M)"?  No...
...
Does " 7 Oxidation (M)" match "\s*(\d*)\s*Oxidation\s+(M)"?  Yes

If for the first RE you'd used \d+ instead of \d* you'd have got a better result.

Edit: this is not exactly how REs work, but you get the idea

Greg
+3  A: 

\s* Zero or more whitespace

(\d*) Zero or more digits (captured)

.*? Any characters (non greedy, so up to the next match

Oxidation Matches the word Oxidation

\s+(M) Matches with one or more whitespace then (M)

The problem here is that you are matching 0 or more of any characters prior to the word Oxidation, including any possible digits, eating the digits which might match the previous \d

\s*(\d*)\s*Oxidation\s+(M)

The difference here is that you are specifying whitespace only before the Oxidation. Not eating the digits.

Change the \d* to \d+ to catch the numbers

Xetius
Why does the .*? match any digits? Surely \d* is greedy, leaving nothing to match. The real problem is that .*? matches the prefix AMEVDIEERPK
MSalters
MSalters: \s* matches empty [0-0] (no whitespace at beginning), \d* matches empty [0-0] (no digit at beginning), .*? matches anything ungreedy ("AMEVDIEERPK + 7 ") [0-16], Oxidation matches, \s+ matches whitespace[32-35], \(M\) matches (M)
Piskvor
A: 

Thanks for the quick responses!

The numbers in the input are left out if there is only one (peptide) modification instead of 7 as in the previous example, e.g.:

"AMEVDIEERPK + Oxidation (M)"

and there would be no match if "\d+" was used. But maybe I should use two regular expressions, one for each of these two cases. This would increase the complexity of the program somewhat (as I want to avoid memory garbage from constructing regular expression for each string to be matched), but is acceptable.

What I really wanted to do was to let the user specificy a match rule without requiring the rule to match from the beginning of the (peptide) modification (that's why I tried to introduce the non-greedy match).

Right now the user's rule is prepended with "\s*(\d*)\s*" and the user must thus specifify "Oxidation\s+(M)" to match. Specifying e.g. "dation\s+(M)" will not work.

A: 

To answer your second message, you (or your user) can specify \w*dation\s+\(M\) to match either Oxydation (M) or Gradation (M) or dation (M).

PhiLho
A: 

With the syntax update, it seems we don't need to worry about the difference between \d+ and \d*. There's always a + sign present, even if there are no digits. Matching this + constrains the regex to the point that it works as expected:

"\s*    // whitespace before +
 \+     // The + sign itself
 \s*    // whitespace after +
 (\d*)  // optional digits
 .*?    // any non-digit between the last digit and Oxidation (M)
 Oxidation\s+\(M\)"

Since the + must be matched first, and must be matched precisely once, the AMEVDIEERPK prefix cannot be matched by .*?.

MSalters
A: 

I settled on using "\w*" for now. The user will be required to specify matching for any white space, but it covers the majority of cases for this particular application and how it is commonly used.

So for the example the regular expression is then:

\s*(\d*)\s*\w*Oxidation\s+(M)

A: 

I am sorry, there is more to the syntax...

The plus sign can not be relied on. It separates the (peptide) sequence and the (peptide) modifications. There can be more than one modification for each sequence. Sample with two modifications (there is 7 spaces between "2" and "L"):

"KLIDLTQFPAFVTPMGK + Oxidation (M); 2 Lysine-13C615N2 (K-full)"

The user could specify "\S+\s+(K-full)" for the second modification and "2" should be extracted.

Here are some more sample lines (after the plus sign):

" Phospho (ST); 2 Dimethyl (K); Dimethyl (N-term)"

" Phospho (ST); 2 Dimethyl:2H(4) (K); Dimethyl:2H(4) (N-term)"

" N-Acetyl (Protein)"

" 2 Dimethyl:2H(4) (K); Dimethyl:2H(4) (N-term)"

" N-Acetyl (Protein); 2 Lysine-13C615N2 (K-full)"

" Oxidation (M); N-Acetyl (Protein)"

" Oxidation (M); N-Acetyl (Protein); Lysine-13C615N2 (K-full)"

" N-Acetyl (Protein); Lysine-13C615N2 (K-full)"

" Oxidation (M); Lysine-13C615N2 (K-full)"

" Oxidation (M)"

" 2 Oxidation (M); Lysine-13C615N2 (K-full)"

A sample file with user defined rules can be found at (packed in 7-zip format):

<http://www.pil.sdu.dk/1/MSQuant/CEBIquantModes,2008-11-10.7z>