tags:

views:

112

answers:

3

Hi *,

I’m learning regular expressions and I’m playing a little with them. I proposed myself an exercise in which I have a method that removes the columns aliases in a SQL Select statement. This should work like this:

  • The method can remove aliases in a SQL select statement with the AS keyword: “select ColumnA AS A”
  • The method can remove aliases in a SQL select statement without the AS keyword: “select ColumnB B”
  • The method can remove aliases in a SQL select statement that contains “operation characters” (like the concatenation operation character): “select ‘Hello ‘ || ‘world!’ AS HelloWorld”

So far I have created two methods that only work on specific cases. The following code provides a summary of what I’ve done and about the problems that I’m facing.

static void Main(string[] args)
{
    string cols1 = "ColA as AliasA, ColB   AliasB  , As As ASasas, Asasasas as As";
    string cols2 = "'aaa' || 'bbb'  AS   AliasC , 'ccc' || 'ddd' AliasD";

    string answer1 = RemAliases(cols1);     // Works fine
    string answer2 = RemAliases2(cols2);    // Works fine
    string answer3 = RemAliases2(cols1);    // Doesn't work
    string answer4 = RemAliases(cols2);     // Doesn't work            
}

static string RemAliases2(string inputSql)
{
    string pattern1 = @"(.+)\s+AS\s+\w+";
    string replacement1 = "$1";
    string pattern2 = @"(.+)\s+\w+";
    string replacement2 = "$1";
    string result = Regex.Replace(inputSql, pattern1, replacement1, RegexOptions.IgnoreCase);
    result = Regex.Replace(result, pattern2, replacement2, RegexOptions.IgnoreCase);
    return result;
}

static string RemAliases(string inputSql)
{
    string pattern1 = @"(\w+)\s+AS\s+\w+";
    string replacement1 = "$1";
    string pattern2 = @"(\w+)\s+\w+";
    string replacement2 = "$1";
    string result = Regex.Replace(inputSql, pattern1, replacement1, RegexOptions.IgnoreCase);
    result = Regex.Replace(result, pattern2, replacement2, RegexOptions.IgnoreCase);
    return result;
}

I wasn’t expecting “RemAliases()” to work fine with “cols2”, because the “\w+” doesn’t match the “|” character. Although, I was expecting “RemAliases2()” to also work fine with “cols1”. Can someone please provide me some help in order to know why “RemAliases2()” doesn’t work fine for the “cols1” case? Please, feel free to provide any other kind of suggestions about the way I’ve used these regular expressions.

Thanks in advance.

PS: I’m using .NET 2.0

A: 

One problem with (.+)\s+AS\s+\w+ is that the (.+) is greedy. This means that it will keep right on going until it doesn't match any more characters, meaning a new line. If you put a ? after the +, it will make it lazy, so it will stop once it finds the first space because the space matches the \s.

The next problem is that the . matches white space also. So, when you have the ColB AliasB , case, it keeps right on going until it gets a " AS" to match the next part of the regex. In this case, that's found as part of the next group. So, you're really better off using the \w+ like you did in the RemAliases function.

That's as far as I've gone for now. I'll edit in more later if I find other things. In the meantime, since you're learning, here's a pretty good reference I usually use when I need to write some regex: Regex Reference

fire.eagle
A: 

As for why RemAliases2 doesn't work for your cols1 is that the .+ is greedy -- it takes as much as it can.

(.+) is going to take the entire line. And then the regex engine will step backward to try to match the remaining portions of the regex. So, the matches will be:

(.+) --> "ColA as AliasA, ColB   AliasB  , As As ASasas, Asasasas"
\s+  --> " "
AS   --> "as"
\s+  --> " "
\w+  --> "As"

Where I've broken out each part of the regex into separate lines, and shown the portion of your string that gets matched inside the ""'s after the -->.

You are applying each of the regexes in turn, but it applies to the entire string -- it just happens that because of the order of the text in your test string, it appears to work -- but it won't scale at all.

One possible better start (for the as case) would be:

(.+?)(\s+as\s+\w+\s*)(,|$)

I've changed the + to be non-greedy (+?), I've added the option of whitespace after the alias column name but before the comma (\s*), and I've added a comma alternating with the end of the line, to correctly end the expression (,|$) so you can iterate multiple times for each field in the select clause.

However, this only does one match, not multiple ones (note that I know regex'es but not C#, so I can't say exactly how this works in C#, but the concepts are pretty general). Either you need to iterate through the string multiple times, or call the function with a global flag. (In Java, you'd do it by calling replaceAll() instead of replace() -- I'm assuming C# has a similar construct.)

Applied globally, and inserting both $1 and $3 into your replacement string would give a modified cols1 of:

ColA, ColB AliasB , As, Asasasas

Then, you have the case w/o the AS, which is harder!

Mike R
A: 

For your regular expression to work reliably, you need to be much more diligent about spelling out what it should match, rather than trying to take a shortcut with .+. The explanation below is rather long-winded. It is the thought process that I followed when building the regex.

Based on your examples, you either have an identifier such as colA or a concatenation of single-quoted strings such as 'aaa' || 'bbb'. You can match the identifier with \w+ and the strings with '[^']*'(?:\s*\|\|\s*'[^']*')*. My regex for the strings allows any number of them to be concatenated, including no concatenation (i.e. just one single-quoted string).

To match either of these two we can use \b\w+\s+|'[^']*'(?:\s*\|\|\s*'[^']*')*\s*. I've added \s+ after the identifier because it must be separated from what follows with a space. For the concatenated strings, \s* makes the delimiting spaces optional.

The identifier or strings can optionally be followed by the keyword As. If the keyword is present is must be followed by a space. We can code this as (As\s+)?.

Finally, all this is followed by another identifier. This one is easily matched with \w+.

Putting it all together, we get this regex:

(\b\w+\s+|'[^']*'(?:\s*\|\|\s*'[^']*')*\s*)(As\s+)?\w+

I put a capturing group around the first part. We'll need that for the search-and-replace. Replacing this regex match with just the column name or string concatenation effectively removes the "as" part. The replacement text is simply $1.

Or in C#:

result = Regex.Replace(inputSql, 
    @"(\b\w+\s+|'[^']*'(?:\s*\|\|\s*'[^']*')*\s*)(As\s+)?\w+", "$1",
    RegexOptions.IgnoreCase);
Jan Goyvaerts