tags:

views:

89

answers:

2

I have the following input string:

key1 = "test string1" ; key2 = "test string 2"

I need to convert it to the following without tokenizing

key1="test string1";key2="test string 2"
+2  A: 

Using ERE, i.e. extended regular expressions (which are more clear than basic RE in such cases), assuming no quote escaping and having global flag (to replace all occurrences) you can do it this way:

s/ *([^ "]*) *("[^"]*")?/\1\2/g

sed:

$ echo 'key1 = "test string1" ; key2 = "test string 2"' | sed -r 's/ *([^ "]*) *("[^"]*")/\1\2/g'

C# code:

using System.Text.RegularExpressions;
Regex regex = new Regex(" *([^ \"]*) *(\"[^\"]*\")?");
String input = "key1 = \"test string1\" ; key2 = \"test string 2\"";
String output = regex.Replace(input, "$1$2");
Console.WriteLine(output);

Output:

key1="test string1";key2="test string 2"

Escape-aware version

On second thought I've reached a conclusion that not showing escape-aware version of regexp may lead to incorrect findings, so here it is:

s/ *([^ "]*) *("([^\\"]|\\.)*")?/\1\2/g

which in C# looks like:

Regex regex = new Regex(" *([^ \"]*) *(\"(?:[^\\\\\"]|\\\\.)*\")?");
String output = regex.Replace(input, "$1$2");

Please do not go blind from those backslashes!

Example

Input:  key1 = "test \\ " " string1" ; key2 = "test \" string 2"
Output: key1="test \\ "" string1";key2="test \" string 2"
przemoc
@prxemoc: What you're using are *Perl-derived* regexes (often called "Perl-compatible" or "PCRE"); ERE is something else entirely. And you can alleviate the backslash-itis by using C#'s verbatim string literals: `@" *([^ ""]*) *(""(?:[^\\""]|\\.)*"")?"`
Alan Moore
@Alan Moore: Yes and no. My RE parts of `sed`-like expressions are EREs, see [POSIX @regular-expressions.info](http://www.regular-expressions.info/posix.html). But they are also Perl REs, because Perl REs are extended EREs (almost, parts of EREs are not available in Perl REs, but those features are practically never used), see [Regular Expression Flavor Comparison @regular-expressions.info](http://www.regular-expressions.info/refflavors.html). It's similar with .NET REs (and yes, these REs are Perl-derived REs). Please do not make hasty judgements. Thanks for the `@`-tip!
przemoc
Point. I looked at your sed code and saw Perl instead.
Alan Moore
+5  A: 

You'd be far better off NOT using a regular expression.

What you should be doing is parsing the string. The problem you've described is a mini-language, since each point in that string has a state (eg "in a quoted string", "in the key part", "assignment").

For example, what happens when you decide you want to escape characters?

key1="this is a \"quoted\" string"

Move along the string character by character, maintaining and changing state as you go. Depending on the state, you can either emit or omit the character you've just read.

As a bonus, you'll get the ability to detect syntax errors.

stusmith