ansaurus

Question

Regular expression for splitting by char which might be escaped

Answer 1

+1 A:

I don't know if VB.NET supports non-capturing groups, but in Java I would use this regular expression to split your string:

(?<=[^\.])\.(?=[^\.]|$)

Superfilin 2009-09-23 07:43:08

Yes, .NET supports non-capturing groups

Kamarey 2009-09-23 07:57:58

That are rather look-around assertions than non-capturing groups. See http://www.regular-expressions.info/lookaround.html and http://www.regular-expressions.info/brackets.html

Gumbo 2009-09-23 08:08:50

look-ahead assertion is a special case of non-cpaturing group.

Superfilin 2009-09-23 08:13:44

Generally speaking, regex groups are matched strings that are stored in memory so that they can be referenced later. When talking about non-capturing groups, that's usually the `(?:...)` style. While any `(?...)` construct can be considered a group, I would rather call lookaround groups "assertions", and even more specifically, "zero-width assertions", as they assert whether the match should continue while not matching characters (i.e. they have zero width.)

Blixt 2009-09-23 08:19:28

In short: "non-capturing group" does not have to be zero width, therefore lookaround assertions should not be called "a special case of non-capturing groups", they should be called "zero-width assertions".

Blixt 2009-09-23 08:23:03

You didn't convince me :). Both "non-capturing groups" and "zero-width assertions" do not capture anything for later reference as $1, $2, $3... That's why they are called non-capturing. Any of look-around assertions are non-capturing. That's why they are a special case of non-capturing groups.

Superfilin 2009-09-23 09:21:58

However you classify lookarounds, the wording of this answer is misleading. The fact that a regex flavor supports non-capturing groups does not mean it also supports lookarounds. Nor does a flavor that supports lookaheads automatically support lookbehinds as well. They're three completely independent features that happen to look similar. (It's worth noting that in Perl 6, capturing groups, non-capturing groups, lookaheads and lookbehinds look very different from each other, reflecting their disparate natures.)

Alan Moore 2009-09-23 15:17:04

Answer 2

A:

I can't test VB.NET at home so this code is not test but I think it should work.


Dim Temp = ""
Dim aTempMaker as New RegEx("([^\.])\.([^\.])")
Dim aDeEscaper as New RegEx("\.\.")
Dim aSpliter   as New RegEx(Temp)

aStrs    = aSpliter.Split(aTempMaker.Replace(Text, "$1"+Temp+"$2"))
aResults = New String(aStrs.Length)

i = 0
For Each aStr In aStrs
    aResults(i) = aDeEscaper.Replace(aStr, ".")
Next

NawaMan 2009-09-23 07:46:09

Answer 3

+1 A:

I would not use a regular expression for matching items and returning them. Even if you make the perfect regular expression, you'll still need to replace the double dots with single dots afterwards.

You could use a regex such as (?<!\.)\.(?!\.) for splitting, but I would probably just stick with your current method as it is more efficient. Alternatively, write your own splitting function that will do the "de-dotting" at the same time.

Here's a custom function that might look long, but is probably still more efficient than replacing, splitting then replacing again (and more efficient than a regex too):

And yes, it's C#, because I don't know VB.NET, but for the most part the two languages are interchangeable.

public static string[] SplitPath(string path)
{
    List<string> pieces = new List<string>();

    int index = -1, last = 0;
    // Keep looping as long as there are dots.
    while ((index = path.IndexOf('.', index + 1)) >= 0)
    {
        // Don't do more checking on last character.
        if (index == path.Length - 1) break;

        // If next character is also a dot, skip.
        if (path[index + 1] == '.')
        {
            index++;
            continue;
        }

        // Add current piece.
        pieces.Add(path.Substring(last, index - last).Replace("..", "."));

        // Store start of next piece.
        last = index + 1;
    }

    // Add final piece, unless it is empty.
    if (last < path.Length - 1) pieces.Add(path.Substring(last).Replace("..", "."));

    return pieces.ToArray();
}

Blixt 2009-09-23 07:53:43

Nice improvement on my snippet. I guess you can only do so much with a one-liner :P

Tewr 2009-09-23 15:40:44

Answer 4

+2 A:

The regex

(?<!\.)\.(?!\.)

will match a dot only if it is neither preceded nor followed by another dot.

Tim Pietzcker 2009-09-23 07:57:23

Answer 5

+2 A:

Here’s another regular expression that is a little more efficient since the look-behind assertion is only tested if a dot has already been found:

\.(?<!\.\.)(?!\.)

Gumbo 2009-09-23 08:12:48

Answer 6

A:

This will match the first dot in a odd-lengthed sequence of dots.

\.(?<!\.\.)(?=(\.\.)*[^.])

An example of splitting at this pattern:

// input
'Foo.Bar..Baz...Bop....Quux'

// becomes
0 => 'Foo'
1 => 'Bar..Baz'
2 => '.Bop....Quux'

Slightly confusing, but it works. It should be possible to split at the last dot in the sequence as well using a variable-width lookbehind, however they are not widely supported in regular expression libraries.

Alex Barrett 2009-09-23 10:35:08

I know `System.Text.RegularExpressions` in .NET supports variable-width lookbehind, but I was testing this in PCRE :)

Alex Barrett 2009-09-23 10:36:51

ansaurus

tags:

views:

answers:

Regular expression for splitting by char which might be escaped

related questions