views:

255

answers:

2

I have some strings, entered by users, that may look like this:

  1. ++7
  2. 7++
  3. 1++7
  4. 1+7
  5. 1++7+10++15+20+30++

Those are to mean:

  1. Anything up to and including 7
  2. Anything from 7 and up
  3. 1 and 7 and anything inbetween
  4. 1 and 7 only
  5. 1 to 7, 10 to 15, 20 and 30 and above

I need to parse those strings into actual ranges. That is I need to create a list of objects of type Range which have a start and an end. For single items I just set the start and end to the same, and for those that are above or below, I set start or end to null. For example for the first one I would get one range which had start set to null and end set to 7.

I currently have a kind of messy method using a regular expression to do this splitting and parsing and I want to simplify it. My problem is that I need to split on + first, and then on ++. But if I split on + first, then the ++ instances are ruined and I end up with a mess.

Looking at those strings it should be really easy to parse them, I just can't come up with a smart way to do it. It just have to be an easier (cleaner, easier to read) way. Probably involving some easy concept I just haven't heard about before :P


The regular expression looks like this:

private readonly Regex Pattern = new Regex(@"  ( [+]{2,} )?
          ([^+]+)
          (?:
            (?: [+]{2,} [^+]* )*
            [+]{2,} ([^+]+)
          )?
        ( [+]{2,} )?   ", RegexOptions.IgnorePatternWhitespace);

That is then used like this:

public IEnumerable<Range<T>> Parse(string subject, TryParseDelegate<string, T> itemParser)
{
    if (string.IsNullOrEmpty(subject))
        yield break;


    for (var item = RangeStringConstants.Items.Match(subject); item.Success; item = item.NextMatch())
    {
        var startIsOpen = item.Groups[1].Success;
        var endIsOpen = item.Groups[4].Success;
        var startItem = item.Groups[2].Value;
        var endItem = item.Groups[3].Value;

        if (endItem == string.Empty)
            endItem = startItem;

        T start, end;

        if (!itemParser(startItem, out start) || !itemParser(endItem, out end))
            continue;

        yield return Range.Create(startIsOpen ? default(T) : start,
                                  endIsOpen ? default(T) : end);
    }
}

It works, but I don't think it is particularly readable or maintainable. For example changing the '+' and '++' into ',' and '-' would not be that trivial to do.

+2  A: 

My problem is that I need to split on + first, and then on ++. But if I split on + first, then the ++ instances are ruined and I end up with a mess.

You could split on this regex first:

(?<!\+)\+(?!\+)

That way, only the 'single' +'s are being split on, leaving you to parse the ++'s. Note that I am assuming that there cannot be three successive +'s.

The regex above simple says: "split on the '+' only if there's no '+' ahead or behind it".

Edit:

After reading that there can be more than 2 successive +'s, I recommend writing a small grammar and letting a parser-generator create a lexer+parser for your little language. ANTLR can generate C# source code as well.

Edit 2:

But before implementing any solution (parser or regex) you'd first have to define what is and what isn't valid input. If you're going to let more than two successive +'s be valid, ie. 1+++++5, which is [1++, +, ++5], I'd write a little grammar. See this tutorial how that works: http://www.antlr.org/wiki/display/ANTLR3/Quick+Starter+on+Parser+Grammars+-+No+Past+Experience+Required

And if you're going to reject input of more than 2 successive +'s, you can use either Lasse's or my (first) regex-suggestion.

Bart Kiers
I have exactly the same regex on my clipboard... +1.
Kobi
How does a lexer+parser work?
Svish
See my 2nd edit.
Bart Kiers
Nice. Will try to read up on these things :) Thanks!
Svish
+2  A: 

Here's some code that uses regular expressions.

Note that the issue raised by Bart in the comments to your question, ie. "How do you handle 1+++5", is not handled at all.

To fix that, unless your code is already out in the wild and not subject to change of behaviour, I would suggest you change your syntax to the following:

  • use .. to denote ranges
  • allow both + and - for numbers, for positive and negative numbers
  • use comma and/or semicolon to separate distinct numbers or ranges
  • allow whitespace

Look at the difference between the two following strings:

  • 1++7+10++15+20+30++
  • 1..7, 10..15, 20, 30..

The second string is much easier to parse, and much easier to read.

It would also remove all ambiguity:

  • 1+++5 = 1++ + 5 = 1.., 5
  • 1+++5 = 1 + ++5 = 1, ..5

There's no way to parse wrong the second syntax.


Anyway, here's my code. Basically it works by adding four regex patterns for the four types of patterns:

  • num
  • num++
  • ++num
  • num++num

For "num", it will handle negative numbers with a leading minus sign, and one or more digits. It does not, for obvious reasons, handle the plus sign as part of the number.

I've interpreted "and up" to mean "up to Int32.MaxValue" and same for down to Int32.MinValue.

public class Range
{
    public readonly Int32 From;
    public readonly Int32 To;

    public Range(Int32 from, Int32 to)
    {
        From = from;
        To = to;
    }

    public override string ToString()
    {
        if (From == To)
            return From.ToString();
        else if (From == Int32.MinValue)
            return String.Format("++{0}", To);
        else if (To == Int32.MaxValue)
            return String.Format("{0}++", From);
        else
            return String.Format("{0}++{1}", From, To);
    }
}

public static class RangeSplitter
{
    public static Range[] Split(String s)
    {
        if (s == null)
            throw new ArgumentNullException("s");

        String[] parts = new Regex(@"(?<!\+)\+(?!\+)").Split(s);
        List<Range> result = new List<Range>();

        var patterns = new Dictionary<Regex, Action<Int32[]>>();

        patterns.Add(new Regex(@"^(-?\d+)$"),
            values => result.Add(new Range(values[0], values[0])));
        patterns.Add(new Regex(@"^(-?\d+)\+\+$"),
            values => result.Add(new Range(values[0], Int32.MaxValue)));
        patterns.Add(new Regex(@"^\+\+(-?\d+)$"),
            values => result.Add(new Range(Int32.MinValue, values[0])));
        patterns.Add(new Regex(@"^(-?\d+)\+\+(-?\d+)$"),
            values => result.Add(new Range(values[0], values[1])));

        foreach (String part in parts)
        {
            foreach (var kvp in patterns)
            {
                Match ma = kvp.Key.Match(part);
                if (ma.Success)
                {
                    Int32[] values = ma.Groups
                        .OfType<Group>()
                        .Skip(1) // group 0 is the entire match
                        .Select(g => Int32.Parse(g.Value))
                        .ToArray();
                    kvp.Value(values);
                }
            }
        }

        return result.ToArray();
    }
}

Unit-tests:

[TestFixture]
public class RangeSplitterTests
{
    [Test]
    public void Split_NullString_ThrowsArgumentNullException()
    {
        Assert.Throws<ArgumentNullException>(() =>
        {
            var result = RangeSplitter.Split(null);
        });
    }

    [Test]
    public void Split_EmptyString_ReturnsEmptyArray()
    {
        Range[] result = RangeSplitter.Split(String.Empty);
        Assert.That(result.Length, Is.EqualTo(0));
    }

    [TestCase(01, "++7", Int32.MinValue, 7)]
    [TestCase(02, "7", 7, 7)]
    [TestCase(03, "7++", 7, Int32.MaxValue)]
    [TestCase(04, "1++7", 1, 7)]
    public void Split_SinglePatterns_ProducesExpectedRangeBounds(
        Int32 testIndex, String input, Int32 expectedLower,
        Int32 expectedUpper)
    {
        Range[] result = RangeSplitter.Split(input);
        Assert.That(result.Length, Is.EqualTo(1));
        Assert.That(result[0].From, Is.EqualTo(expectedLower));
        Assert.That(result[0].To, Is.EqualTo(expectedUpper));
    }

    [TestCase(01, "++7")]
    [TestCase(02, "7++")]
    [TestCase(03, "1++7")]
    [TestCase(04, "1+7")]
    [TestCase(05, "1++7+10++15+20+30++")]
    public void Split_ExamplesFromQuestion_ProducesCorrectResults(
        Int32 testIndex, String input)
    {
        Range[] ranges = RangeSplitter.Split(input);
        String rangesAsString = String.Join("+",
            ranges.Select(r => r.ToString()).ToArray());

        Assert.That(rangesAsString, Is.EqualTo(input));
    }

    [TestCase(01, 10, 10, "10")]
    [TestCase(02, 1, 10, "1++10")]
    [TestCase(03, Int32.MinValue, 10, "++10")]
    [TestCase(04, 10, Int32.MaxValue, "10++")]
    public void RangeToString_Patterns_ProducesCorrectResults(
        Int32 testIndex, Int32 lower, Int32 upper, String expected)
    {
        Range range = new Range(lower, upper);
        Assert.That(range.ToString(), Is.EqualTo(expected));
    }
}
Lasse V. Karlsen
First, very impressive. But - I've already suggested replacing the characters (in a way), and Svish said there could be dates and a mysterious "Anything", not just numbers.
Kobi
Woah. Will have to go through this, hehe. Some interesting things. Not sure if I quite catch what is happening with those patterns and the dictionary... but I will have a look :p(Interesting way of writing those tests too. Maybe I will rewrite mine a bit. Didn't think of using TestCase like that. What is the testIndex for? Never seen that before...)
Svish
And, oh, yeah, that mysterious "Anything" isn't really that mysterious. But it can pretty much contain anything. Examples are numbers, dates, names, product ids, et cetera. No clue if they have a rule somewhere preventing people from entering a + into any of the names for example. But yeah... Not something I decided. I just have to work with it. It's kind of one of these "It's always been like that, so we can't change it"-things. I would love to make it more restrictive and better defined though...
Svish
The test index is just something I use internally, just used to add them. I have a UnitTestHelper with a static Context property that I set, this allows me to set breakpoints anywhere in my class library or code and just add a condition of "UnitTestHelper.Context == 4" to only break for a particular unit test.
Lasse V. Karlsen
As for "anything", I suggest you define the grammar first, and describe how to interpret the edge cases, *before* you write more code.
Lasse V. Karlsen