tags:

views:

637

answers:

4

I'd like to test if a regex will match part of a string at a specific index (and only starting at that specific index). For example, given the string "one two 3 4 five", I'd like to know that, at index 8, the regular expression [0-9]+ will match "3". RegularExpression.IsMatch and Match both take a starting index, however they both will search the entire rest of the string for a match if necessary.

string text="one two 3 4 five";
Regex num=new Regex("[0-9]+");

//unfortunately num.IsMatch(text,0) also finds a match and returns true
Console.WriteLine("{0} {1}",num.IsMatch(text, 8),num.IsMatch(text,0));

Obviously, I could check if the resulting match starts at the index I am interested in, but I will be doing this a large number of times on large strings, so I don't want to waste time searching for matches later on in the string. Also, I won't know in advance what regular expressions I will actually be testing against the string.

I don't want to:

  1. split the string on some boundary like whitespace because in my situation I won't know in advance what a suitable boundary would be
  2. have to modify the input string in any way (like getting the substring at index 8 and then using ^ in the regex)
  3. search the rest of the string for a match or do anything else that wouldn't be performant for a large number of tests against a large string.

I would like to parse a potentially large user supplied body of text using an arbitrary user supplied grammar. The grammar will be defined in a BNF or PEG like syntax, and the terminals will either be string literals or regular expressions. Thus I will need to check if the next part of the string matches any of the potential terminals as driven by the grammar.

+5  A: 

How about using Regex.IsMatch(string, int) using a regular expression starting with "\G" ("start of last match")?

That appears to work:

using System;
using System.Text.RegularExpressions;

class Test
{
    static void Main()
    {
        string text="one two 3 4 five";
        Regex num=new Regex(@"\G[0-9]+");

        Console.WriteLine("{0} {1}",
                          num.IsMatch(text, 8), // True
                          num.IsMatch(text, 0)); // False
    }
}
Jon Skeet
Interesting, if there's a way to artifically set the last match position then this might work out. Otherwise I don't think it will help because I will be jumping between different regular expressions and different locations.
Clusterflock
I had a chance to try this out and it seems to do exactly what I want. It treats the passed start index as the "start of last match" regardless of where the last match actually was. Perfect, thanks!
Clusterflock
And just to add a bit of info for anyone else that has this problem, http://www.regular-expressions.info/continue.html describes the \G anchor. It appears mean either "start of last match" or "start of match attempt" depending on the implementation. In some implementations it probably won't solve this problem, but it appears to be "start of match attempt" in C# and works nicely for matching at a specific location.
Clusterflock
+2  A: 

If you only want to search a substring of the text, grab that substring before the regex.

myRegex.Match(myString.Substring(8, 10));
Rob Elliott
See point 2 in the question.
Jon Skeet
Doesn't look like this modifies the input string, so +1. If point 2 isn't just about changing the input string, it needs to be edited.
ojrac
Well, it's modifying the input *to the regular expression*. Given the "doing this a large number of times on large strings" I wouldn't have thought a substring was an ideal solution.
Jon Skeet
It seems like he wants to match against a specific series of characters in a string. Why doesn't substring makes sense?
Rob Elliott
@Rob: Because it will involve copying large amounts of data repeatedly - and unnecessarily, given that you can tell the regex engine where to start looking for a match.
Jon Skeet
This would be too slow because I don't have any max length other than the size of the string, which could be tens to hundreds of megabytes.
Clusterflock
+2  A: 

I'm not sure I fully understand the question, but it seems to me that you can simply make the position part of the regular expression, e.g.

^.{8}[\d]

which will match if there are 8 characters between the start of the string and a digit.

Robert Rossney
This isn't ideal, because it would involve modifying the regex for each position I want to test against. It would also depend on the regex being smart enough to optimize ^.{8} into something that jumps immediately to position 8.
Clusterflock
A: 

If you know the max length of a potential match in the string you check for this would limit the scanning of the string.

If you're only checking for numbers this is probably easier than if you check for arbitrary expressions. The nature of Regex is to scan until the end in order to find a match. If you want to prevent scanning you need to include a length, or use something other than Regex.

string text = "one two 3 4 five";
Regex num = new Regex("[0-9]+");
int indexToCheck = 8;
int maxMatchLength = ...;
Match m = num.Match(text, indexToCheck, maxMatchLength);

Do you know anything about what types of expressions might be run against the strings, and will scanning the entire string be too much of an overhead?

num.Match will return the first hit if it exists, and then stop scanning. If you want more matches you would call m.NextMatch() to continue the scanning of matches.

Mikael Svenson
Unfortunately I don't know what the regular expressions will be beforehand and cannot provide a max length other than the rest of the string.
Clusterflock