tags:

views:

178

answers:

4

I'm currently working with parsing some data from SQL Server and I'm in need of help with a Regex.

I have an assembly in Sql Server 2005 that helps me Replace strings using C# Regex.Replace() Method.

I need to parse the following.

    Strings:

    CAD 90890

    (CAD 90892)

    CAD G67859

    CAD 34G56

    CAD 3S56.

    AX CAD 890990

    CAD 783783 MX

    Needed Results:

    90890

    90892

    G67859

    34G56

    3S56

    890990

    783783 

SELECT TOP 25 CADCODE, dbo.RegExReplace(CADCODE, '*pattern*', '$1')
FROM dbo.CADCODES
WHERE CADCODE LIKE '%CAD%'

I need to get the proceeding string after the CAD word until it hits a white-space or anything that not a number or digit. I managed to get the digits but it really fails on others. I'm trying to get it to work but I can't find a real solution.

Thanks in advance.

Updated to reflect new Strings

AX CAD 890990

CAD 783783 MX

+1  A: 

Try this:

(\w+)\W*$

The pattern matches the last word - made of alphanumeric (and underscores).
Example: http://www.rubular.com/r/1zWQQVLZy1

Another option is to find a word with at least one digit - this one can match anywhere on the string, so you may need to handle multiple matches. In this case, you can add a capturing group around the whole pattern, or replace using $&.

[a-zA-Z_]*\d\w*

Example: http://www.rubular.com/r/XUrFNuPQUv

If you can't match (Regex.Match) and must use Regex.Replace, you can match the entire string start to end and replace it with the group you need:

RegExReplace(CADCODE, '^.*\b([a-zA-Z_]*\d\w*)\b.*$', '$1')
Kobi
What happens if some of the strings do not start with CAD, but are followed by alphanumeric characters?
Callum Rogers
@Callum Rogers - I capture them anyway, the question doesn't define whether that bit is important or not. I may ask you the same - what if the OP wants to capture in any case?
Kobi
[a-zA-Z_]*\d\w* Matches perfectly what I need to understand is how to get the output $1 for the replace, since I need to eliminate CAD completely.
JeremySpouken
well, if you want just the word following the CAD..., I dont know if I'd use Regex.Replace. You could just use Regex.Match and it would bring back the word that matched. OR..you could use (?=<something>), which matches a prefix, but excludes it from the output.
LoveMeSomeCode
A: 

Try this:

(?:\(CAD\)|CAD)\s+?([\dA-Z]+)

You can get the result from the capture group number 1.

Callum Rogers
A: 

The problem with regex is that it's always easy to get a good pattern if you have a limited sample set.

In your case, you use: \w{4}\w*

which just says, 4 alphanumerics, followed by 0 or more alphanumerics, so all the CAD sections would not match, nor would spaces or ().

LoveMeSomeCode
The problem with Regex is that it's a pain in the butt. Anyhow CAD can have multiple digits and even a word but what I really need is the WORD after CAD until it ends with anything that's not a number or letter from the alphabet [A-Za-z]. Just extract it and have it as an output in $1.
JeremySpouken
so if I'm understanding you right, you'll have CAD followed by some stuff, then 1 or more spaces, then you want the word after that? How about: CAD\w*\s+(\w+)
LoveMeSomeCode
See @Alan Moore approach it works mine was similar in the way that I was matching words but I couldn't get an output group. But as he suggests you need to match first the whole string first.
JeremySpouken
+1  A: 

I think this is what you're after:

^\W*\w*CAD\w*\W*(\w+)\W*$

The regex has to match the whole string so RegExReplace can replace it with $1, effectively stripping off the unwanted parts.

EDIT: Let me back up and make sure I've got this right. Because of the

WHERE CADCODE LIKE '%CAD%'

in your query, you already know every string contains the sequence CAD. That being the case, there's no need to complicate the regex by matching that sequence again. This should be all you need:

^.*?(\w+)\W*$
Alan Moore
Great approach and It works I just have a slight problem that wasn't on scope. I found out that I will have strings with "AX CAD 098909" instead of CAD as the beginning and also strings after "CAD 098909 MX".
JeremySpouken