views:

237

answers:

6

I need to split a string similar to a path, which is delimited by dots. The tricky part is that the each subentry may also contain dots, which are escaped by another dot. Each entry may otherwise contain basically anything (including special characters such as space or :;/\|(), etc..)

Two examples:

"Root.Subpath.last/entry:with special;chars" -> [0] Root [1] SubPath [2] last/entry:with special;chars

"Root.Subpath..with..dots.Username" -> [0] Root [1] SubPath.with.dots [2] Username

Currently I am not using a regular expression for this, instead I am replacing any ".." with something else before running a split, and adding them back after the split. This works fine and everything, but its not super clean. However, mostly I am curious (or maybe annoyed about?) how to create a Regex for Regex.Split that does the same thing, as this was my first idea of approach. I provide my current solution to show what output I expect.

Split(path.Replace("..", REP_STR), ".") _
  .Select(Function(s as string) s.Replace(REP_STR, ".")).ToArray

I am using VB.NET.

+1  A: 

I don't know if VB.NET supports non-capturing groups, but in Java I would use this regular expression to split your string:

(?<=[^\.])\.(?=[^\.]|$)
Superfilin
Yes, .NET supports non-capturing groups
Kamarey
That are rather look-around assertions than non-capturing groups. See http://www.regular-expressions.info/lookaround.html and http://www.regular-expressions.info/brackets.html
Gumbo
look-ahead assertion is a special case of non-cpaturing group.
Superfilin
Generally speaking, regex groups are matched strings that are stored in memory so that they can be referenced later. When talking about non-capturing groups, that's usually the `(?:...)` style. While any `(?...)` construct can be considered a group, I would rather call lookaround groups "assertions", and even more specifically, "zero-width assertions", as they assert whether the match should continue while not matching characters (i.e. they have zero width.)
Blixt
In short: "non-capturing group" does not have to be zero width, therefore lookaround assertions should not be called "a special case of non-capturing groups", they should be called "zero-width assertions".
Blixt
You didn't convince me :). Both "non-capturing groups" and "zero-width assertions" do not capture anything for later reference as $1, $2, $3... That's why they are called non-capturing. Any of look-around assertions are non-capturing. That's why they are a special case of non-capturing groups.
Superfilin
However you classify lookarounds, the wording of this answer is misleading. The fact that a regex flavor supports non-capturing groups does not mean it also supports lookarounds. Nor does a flavor that supports lookaheads automatically support lookbehinds as well. They're three completely independent features that happen to look similar. (It's worth noting that in Perl 6, capturing groups, non-capturing groups, lookaheads and lookbehinds look very different from each other, reflecting their disparate natures.)
Alan Moore
A: 

I can't test VB.NET at home so this code is not test but I think it should work.


Dim Temp = ""
Dim aTempMaker as New RegEx("([^\.])\.([^\.])")
Dim aDeEscaper as New RegEx("\.\.")
Dim aSpliter   as New RegEx(Temp)

aStrs    = aSpliter.Split(aTempMaker.Replace(Text, "$1"+Temp+"$2"))
aResults = New String(aStrs.Length)

i = 0
For Each aStr In aStrs
    aResults(i) = aDeEscaper.Replace(aStr, ".")
Next
NawaMan
+1  A: 

I would not use a regular expression for matching items and returning them. Even if you make the perfect regular expression, you'll still need to replace the double dots with single dots afterwards.

You could use a regex such as (?<!\.)\.(?!\.) for splitting, but I would probably just stick with your current method as it is more efficient. Alternatively, write your own splitting function that will do the "de-dotting" at the same time.

Here's a custom function that might look long, but is probably still more efficient than replacing, splitting then replacing again (and more efficient than a regex too):

And yes, it's C#, because I don't know VB.NET, but for the most part the two languages are interchangeable.

public static string[] SplitPath(string path)
{
    List<string> pieces = new List<string>();

    int index = -1, last = 0;
    // Keep looping as long as there are dots.
    while ((index = path.IndexOf('.', index + 1)) >= 0)
    {
        // Don't do more checking on last character.
        if (index == path.Length - 1) break;

        // If next character is also a dot, skip.
        if (path[index + 1] == '.')
        {
            index++;
            continue;
        }

        // Add current piece.
        pieces.Add(path.Substring(last, index - last).Replace("..", "."));

        // Store start of next piece.
        last = index + 1;
    }

    // Add final piece, unless it is empty.
    if (last < path.Length - 1) pieces.Add(path.Substring(last).Replace("..", "."));

    return pieces.ToArray();
}
Blixt
Nice improvement on my snippet. I guess you can only do so much with a one-liner :P
Tewr
+2  A: 

The regex

(?<!\.)\.(?!\.)

will match a dot only if it is neither preceded nor followed by another dot.

Tim Pietzcker
+2  A: 

Here’s another regular expression that is a little more efficient since the look-behind assertion is only tested if a dot has already been found:

\.(?<!\.\.)(?!\.)
Gumbo
A: 

This will match the first dot in a odd-lengthed sequence of dots.

\.(?<!\.\.)(?=(\.\.)*[^.])

An example of splitting at this pattern:

// input
'Foo.Bar..Baz...Bop....Quux'

// becomes
0 => 'Foo'
1 => 'Bar..Baz'
2 => '.Bop....Quux'

Slightly confusing, but it works. It should be possible to split at the last dot in the sequence as well using a variable-width lookbehind, however they are not widely supported in regular expression libraries.

Alex Barrett
I know `System.Text.RegularExpressions` in .NET supports variable-width lookbehind, but I was testing this in PCRE :)
Alex Barrett