tags:

views:

222

answers:

5

Hi

I need to match a colon (':') in a string, but not when it's enclosed by quotes - either a " or ' character.

So the following should have 2 matches

something:'firstValue':'secondValue'    
something:"firstValue":'secondValue'

but this should only have 1 match

something:'no:match'
A: 

Uppps ... missed the point. Forget the rest. It's quite hard to do this because regex is not good at counting balanced characters (but the .NET implementation for example has an extension that can do it, but it's a bit complicated).

You can use negated character groups to do this.

[^'"]:[^'"]

You can further wrap the quotes in non-capturing groups.

(?:[^'"]):(?:[^'"])

Or you can use assertion.

(?<!['"]):(?!['"])
Daniel Brückner
+1  A: 

If the regular expression implementation supports look-around assertions, try this:

:(?:(?<=["']:)|(?=["']))

This will match any colon that is either preceeded or followed by a double or single quote. So that does only consider construct like you mentioned. something:firstValue would not be matched.

It would be better if you build a little parser that reads the input byte-by-byte and remembers when quotation is open.

Gumbo
This works quite good, but fails in the degenerated case something:'no match:'
Daniel Brückner
I agree with Gumbo - it's better to build a little parser
Jaco Pretorius
A: 

I've come up with the following slightly worrying construction:

(?<=^('[^']*')*("[^"]*")*[^'"]*):

It uses a lookbehind assertion to make sure you match an even number of quotes from the beginning of the line to the current colon. It allows for embedding a single quote inside double quotes and vice versa. As in:

'a":b':c::"':" (matches at positions 6, 8 and 9)

EDIT

Gumbo is right, using * within a look behind assertion is not allowed.

Peter van der Heijden
This expression will only match if the string starts with a single quote because of the assertion (?<=^('[^...
Daniel Brückner
@Daniel - ('[^']*')* matches zero or more instances of something between single quotes, so it does not have to start with a quote. Having said that mine is broken to, see my edit
Peter van der Heijden
In general, look-behind assertions don’t allow infinite quantifiers such as `*`.
Gumbo
A: 

Give a programmer a Regex, and he'll have one working application.

Give a programmer the ability to create Regex, and he'll have multiple working applications...

RegEx Buddy
Expresso

Daniel May
A: 

Regular expressions are stateless. Tracking whether you are inside of quotes or not is state information. It is, therefore, impossible to handle this correctly using only a single regular expression. (Note that some "regular expression" implementations add extensions which may make this possible; I'm talking solely about "true" regular expressions here.)

Doing it with two regular expressions is possible, though, provided that you're willing to modify the original string or to work with a copy of it. In Perl:

$string =~ s/['"][^'"]*['"]//g;
my $match_count = $string =~ /:/g;

The first will find every sequence consisting of a quote, followed by any number of non-quote characters, and terminated by a second quote, and remove all such sequences from the string. This will eliminate any colons which are within quotes. (something:"firstValue":'secondValue' becomes something:: and something:'no:match' becomes something:)

The second does a simple count of the remaining colons, which will be those that weren't within quotes to start with.

Just counting the non-quoted colons doesn't seem like a particularly useful thing to do in most cases, though, so I suspect that your real goal is to split the string up into fields with colons as the field delimiter, in which case this regex-based solution is unsuitable, as it will destroy any data in quoted fields. In that case, you need to use a real parser (most CSV parsers allow you to specify the delimiter and would be ideal for this) or, in the worst case, walk through the string character-by-character and split it manually.

If you tell us the language you're using, I'm sure somebody could suggest a good parser library for that language.

Dave Sherohman
I'm using C# but I thought that I could do it with a Regex (which is language independent)... I think it's better to just parse it without Regex tho
Jaco Pretorius
That's the trouble; a regex isn't language/library independent; the parts that are can't do this.
reinierpost