ansaurus

Question

Answer 1

+3 A:

Well, there probably will be an answer with a clever RegEx, I'll give it a try with my favorite string.Split() function.

As a first step you can split the input string on ';'

string[] datasets = inputString.Split(';');

For as far as your last point goes, it seems that a comma ',' does more or less the same, you can merge that with Split(';', ',') or keep them separate with

string[] parts = datasets[i].Split(',');

A part is then one of three cases : single number, a range or a stepped range.

You can probe that with string.IndexOf() and/or

string[] rangeParts = parts[j].Split('-');
string[] steppedParts = parts[j].Split(':');

The results should have Length 2 and 3 respectively.

The resulting strings should then be checked with TryParse(), and because of the use of punctuation characters you'd better fix the Culture:

bool valid = double.TryParse(parts[k], 
  System.Globalization.NumberStyles.AllowDecimalPoint, 
  System.Globalization.CultureInfo.InvariantCulture, out value);

Those are the parts, some assembly required.

Henk Holterman 2009-04-01 22:03:39

Answer 2

A:

I'm not sure I completely understand your question, but it sounds like you're looking for String.Split()

Not Sure 2009-04-01 22:03:48

Answer 3

+1 A:

There's no convention in C# for parsing ranges, so you're free to do whatever makes the most sense.

However you may wish to derive your notation from Interval Notation in math.

[2,4] - numbers between 2 and 4
(0,7] - numbers between 0 and 7, but not including 0

Robert Paulson 2009-04-01 22:10:56

Answer 4

+2 A:

I would sugest using regular expressions. At first I would split into sections with the following expression.

^((?<section>[^;]+)(;|$))+

Then split each section into subsections.

^((?<subsection>[^,]+)(,|$))+

Now match the three possible subsection types.

(?<value>^[0-9]+$)|
(?<range>^[0-9]+-[0-9]+$)|
(?<rangewithstep>^[0-9]+:\.[0-9]+:[0-9]+$)

Finally you must analyze range type subsections.

^(?<start>[0-9]+)-(?<end>[0-9]+)$

^(?<start>[0-9]+):(?<step>\.[0-9]+):(?<end>[0-9]+)$

Now it is a matter of parsing the extracted strings into numbers and adding them into arrays.

I putted everything together into a small console application that does the thing. It is far from perfect - no error handling, no nothing, just parsing a demo input. I merged some of the expressions mentioned before to get the code compacter and probably better.

using System;
using System.Text.RegularExpressions;
using System.Globalization;

namespace RangeParser
{
    class Program
    {
        static void Main(string[] args)
        {
            String input = "1-7,9,16:2:20;1-7; 3:.75 : 10;1,5,9;4-7";

            Match sections = (new Regex(@"^((?<section>[^;]+)(;|$))+")).Match(input.Replace(" ", ""));

            foreach (Capture section in sections.Groups["section"].Captures)
            {
                Console.Write("Section ");

                Match subsections = (new Regex(@"^((?<subsection>[^,]+)(,|$))+")).Match(section.Value);

                foreach (Capture subsection in subsections.Groups["subsection"].Captures)
                {
                    Match subsectionparts = (new Regex(@"^(?<start>[0-9]*\.?[0-9]+)(((:(?<step>[0-9]*\.?[0-9]+):)|-)(?<end>[0-9]*\.?[0-9]+))?$")).Match(subsection.Value);

                    if (subsectionparts.Groups["start"].Length > 0)
                    {
                        Decimal start = Decimal.Parse(subsectionparts.Groups["start"].Value, CultureInfo.InvariantCulture);
                        Decimal end = start;
                        Decimal step = 1;

                        if (subsectionparts.Groups["end"].Length > 0)
                        {
                            end = Decimal.Parse(subsectionparts.Groups["end"].Value, CultureInfo.InvariantCulture);

                            if (subsectionparts.Groups["step"].Length > 0)
                            {
                                step = Decimal.Parse(subsectionparts.Groups["step"].Value, CultureInfo.InvariantCulture);
                            }
                        }

                        Decimal current = start;

                        while (current <= end)
                        {
                            Console.Write(String.Format("{0} ", current));

                            current += step;
                        }
                    }
                }

                Console.WriteLine();
            }

            Console.ReadLine();
        }
    }
}

UPDATE

Modified to allow things like '1.5:0.2:3.6'.

UPDATE

Why using decimal instead of single or double?

The numbers in the input are decimal numbers and cannot be represented by single or double exactly, because they use a base 2 representation. So 0.1 is, for example, represented by the single value 0.100000001490116119384765625.

Single x = 0.0F;

for (int i = 0; i < 8; i++)
{
   x += 0.1F;
}

Console.WriteLine(x);

This programm will print 0.8000001 after only 8 iterations. After 1000 iterations the error grows to 0.00095 displaying 99.99905 instead of 100.0 and after one million iterations the result is 100,958.3 instead of 100,000.

There are no such errors for decimal, because decimal uses a base 10 representation and is able to exactly represent decimal numbers like 0.1.

Daniel Brückner 2009-04-01 22:11:05

Isn't this a situation where regex is a little overly ugly? A straight code solution would be simpler and more understandable, I think.

C. Ross 2009-04-02 19:48:22

May be. I decided for regex because you get all cases covered (if your regex is correct). If you do it with String.Split() and friends, you get a simpler solution if the input is valid. But catching all invalid inputs using string methods might really become horror.

Daniel Brückner 2009-04-02 21:02:42

When I start parsing a number to decimal I already know that it is a valid number and will not fail. Or think of inputs like '1-2-3;,-1:.:' - you will split them but crash later quite sure or return a meaningless result.

Daniel Brückner 2009-04-02 21:10:08

In David's code accessing part[2] will be out of bounds for the input '1:2', parsing may fail, and there are quite sure a few more uncatched errors. Catching them all will probably make the code more unreadable than the regex code.

Daniel Brückner 2009-04-02 21:12:58

Answer 5

A:

ok so when I split the string using Hank's first two parts (assuming there is a , in the datasets)

I can then go in and fill up an array with the remaining information.

for the - separated ones, I would take the value before the - and do a for loop from there to the after value.

for the : separated ones, I do almost the same thing, except instead of an i++ increment on the for loop update, I do an i+= (middle value).

To parse out the values before and after the - or the : characters I can just split again and know which indeces in the array correspond to what.

Thank you,

I will update this tomorrow with my final solution.

If Henk Holterman wants to update his solution with what I said above (description of parsing out the other parts), I will upvote on my home account. For some reason they block openID here.

Side note: I don't get why they won't let me accept the solution even as a guest, I should be able to if I provide my proper email address right?

2009-04-01 22:13:14

Answer 6

A:

First split the string on semicolon to get the separate sets. Then split each set by comma to get the separate numbers or ranges in the sets.

The strings that you have now can either be:

A single number, like 42
A range of numbers, like 1-7
A step range, like 1:.5:7

You can identify the second and third by checking if the string contains a hyphen or a colon. You would then split those strings and do some looping to add the numbers to the set.

By handling the numbers and ranges on the same level like this, they can be mixed exactly as you wanted.

Some tips:

Use double.TryParse to parse the numbers. Use the CultureInfo.InvariantCulture as format provider, it uses period as decimal separator.

You can use a List<double> to hold the numbers for each set. The final result can either be an array of lists, or you can use the ToArray method to create an array from the list if you want an array of arrays.

Guffa 2009-04-01 22:18:37

Answer 7

+2 A:

Here's some C# code that should do what you want:

    var results = ParseExpression("1-7;3:.25:10;1,5,9;4-7");

    private static List<List<float>> ParseExpression(string expression)
 {
  // "x-y" is the same as "x:1:y" so simplify the expression...
  expression = expression.Replace("-", ":1:");

  var results = new List<List<float>>();
  foreach (var part in expression.Split(';'))
   results.Add(ParseSubExpression(part));

  return results;
 }

 private static List<float> ParseSubExpression(string part)
 {
  var results = new List<float>();

  // If this is a set of numbers...
  if (part.IndexOf(',') != -1)
   // Then add each member of the set...
   foreach (string a in part.Split(','))
    results.AddRange(ParseSubExpression(a));
  // If this is a range that needs to be computed...
  else if (part.IndexOf(":") != -1)
  {
   // Parse out the range parameters...
   var parts = part.Split(':');
   var start = float.Parse(parts[0]);
   var increment = float.Parse(parts[1]);
   var end = float.Parse(parts[2]);

   // Evaluate the range...
   for (var i = start; i <= end; i += increment)
    results.Add(i);
  }
  else
   results.Add(float.Parse(part));

  return results;
 }

David 2009-04-01 22:21:15

I would suggest not to use float, because i += increment will introduce growing numeric errors with every iteration.

Daniel Brückner 2009-04-02 18:54:08

Answer 8

+1 A:

The following comment on my regex solution incite my to perform a analysis.

Isn't this a situation where regex is a little overly ugly? A straight code solution would be simpler and more understandable, I think. – C. Ross

My response was the following.

May be. I decided for regex because you get all cases covered (if your regex is correct). If you do it with String.Split() and friends, you get a simpler solution if the input is valid. But catching all invalid inputs using string methods might really become horror. When I start parsing a number to decimal I already know that it is a valid number and will not fail. Or think of inputs like '1-2-3;,-1:.:' - you will split them but crash later quite sure or return a meaningless result. In David's code accessing part[2] will be out of bounds for the input '1:2', parsing may fail, and there are quite sure a few more uncatched errors. Catching them all will probably make the code more unreadable than the regex code. – danbruc

So I decided to use Microsofts awesome tool PEX and analyse my regex approach and David's string opertion approach. I left David's code unmodified and replaced the console output in my solution with statements that build the result as List<List<Decimal>> just like David does.

To make a quite complete analysis feasible, I constraint PEX to generate only inputs shorter than 45 characters and use only the following 9 different characters.

019.;,-:!

There is no need to use all numbers, because they (should) behave all the same. I included 9 to make it easy to discover the overflow but 0 and 1 schould also be sufficent - PEX would probaly find 1000 instead of 999. I included 0 and 1 to discover an error with very tiny numbers like 0.000[...]001 but nothing appeared. I assume very small numbers are silently rounded to zero but I did not investigate this further. Or may be 44 (44 because of the precision of decimal of 28 to 29 digits plus some room for other characters) characters were just to short to generate a small enough number. The other characters are included, because they are the other valid characters in the input. Finally I included the exclamation mark as surrogate for invalid characters.

The result of the analysis proved me right. PEX found two bugs in my code. I do not check for null input (I skipped that intentionaly to concentrate on the important part) causing the well known NullReferenceException and PEX discovered that the input "999999999999999999999999999999" causes Decimal.Parse() to fail with an OverflowException. PEX also reports some false negative results. For example "!;9,;.0;990:!!:,900:09" was reported as an input causing a FormatException. Reruning the generated test yields no exception. It turns out that ".0" caused the test to fail during exploration. Looking at other failed tests reveals that Decimal.Parse() fails for (all) inputs starting with a decimal point during the exploration. But they are valid numbers and do not fail during normal execution. I am unable to explain this false positives.

And here is the result for one run of PEX against the string operation solution. Both implementation share the missing null check and the overflow exception. But the simple string operation solution is unable to handle many malformed inputs. They almost all result in a FormatException, but PEX discovered also the IndexOutOfRangeException I predicted.

FormatException:           "!,"
FormatException:           ","
FormatException:           "1,"
FormatException:           "!"
FormatException:           ";9"
FormatException:           "::"
FormatException:           "!.999009"
FormatException:           "!.0!99!9"
FormatException:           "0,9.90:!!,,,!,,,,,,!,,,0!!!9,!"
FormatException:           ""
FormatException:           "-99,9"
FormatException:           "1,9,,,!,,,,,,9,,,9,1,!9,,,,!,!"
FormatException:           "!:,"
FormatException:           "!9!:.!!,!!!."
FormatException:           "!:"
IndexOutOfRangeException:  "1:9"
FormatException:           "09..::!"
FormatException:           "9,0..:!.!,,,!,,,,,,!,,,!!-,!,!"
OverflowException:         "99999999999999999999999999999999999999999999"
FormatException:           "!."
FormatException:           "999909!!"
FormatException:           "-"
FormatException:           "9,9:9:999,,,9,,,,,,!,,,!9!!!,!"
FormatException:           "!9,"
FormatException:           "!.09!!0!"
FormatException:           "9-;"
FormatException:           ":"
FormatException:           "!.!9!9!!"
NullReferenceException:    null
FormatException:           ":,"
FormatException:           "!!"
FormatException:           "9;"

The question is now, how hard would it be to handle all this cases. The simple solution would be to guard the parsing instruction with try/catch clauses. I am not sure if this is sufficent to guarantee correct operation on the well formed part of the input. But may this is not required and a malformed input should cause an empty result, what would make it easy, to fix the solution.

Finally here are the code coverage results achievd. Note, that I analysed the regex solution using decimal and single because PEX was unable to instrument one method used inside Decimal.Parse().

ParseExpression(string)            100,00%  10/10 blocks
ParseSubExpression(string)          96,15%  25/26 blocks

ParseExpressionRegex(string)        95,06%  77/81 blocks
ParseExpressionRegexSingle(string)  94,87%  74/78 blocks

Conclusion for me - a regex solution should really be prefered. They are somewhat harder to design and understand, but they handle malformed inputs much robuster than a simple string operation based implementation. And just not to forget - I did not checke if the results returned are correct, at all. This is another case.

Daniel Brückner 2009-04-03 01:06:41

ansaurus

tags:

views:

answers:

C# string convention parsing

related questions