views:

829

answers:

4

Hi everyone

I need to filter out junk data in SQL (SQL Server 2008) table. I need to identify these records, and pull them out.

  • Char[0] = A..Z, a..z
  • Char[1] = 0..9
  • Char[2] = 0..9
  • Char[3] = 0..9
  • Char[4] = 0..9

{No blanks allowed}

Basically, a clean record will look like this:

  • T1234, U2468, K123, P50054 (4 record examples)

Junk data looks like this:

  • T12.., .T12, MARK, TP1, SP2, BFGL, BFPL (7 record examples)

Can someone please assist with a SQL query to do a LEFT and RIGHT method and extract those characters, and do a LIKE IN or something?

A function would be great though!

Regards, Riaan de Lange South Africa

+2  A: 

Depends on your database. Many have regex functions (note examples not tested so check)

e.g. Oracle

SELECT x
 FROM table
 WHERE REGEXP_LIKE(x, '^[A-za-z][:digit:]{4}$')

Sybase uses LIKE

Mark
I believe REGEXP_LIKE is an Oracle convention, so it really does depend on which system is being used.
David Andres
@Mark: Borrowed this in my own answer. It's good and it works within Oracle. I extended it to {4,} to imply four or more matches so that strings like P1234 and P123423432 match equally as well.
David Andres
+4  A: 

The following should work in a few different systems:

SELECT * 
FROM TheTable
WHERE Data LIKE '[A-Za-z][0-9][0-9][0-9][0-9]%'
AND Data NOT LIKE '% %'

This approach will indeed match P2343, P23423JUNK, and other similar text but requires that the format is A0000*.

Now, if the OP implies a format of 1st position is a character and all succeeding positions are numeric, as in A0+, then use the following (in SQL Server and a good deal of other database systems):

SELECT *
FROM TheTable
WHERE SUBSTRING(Data, 1, 1) LIKE '[A-Za-z]'
AND SUBSTRING(Data, 2, LEN(Data) - 1) NOT LIKE '%[^0-9]%'
AND LEN(Data) >= 5

To incorporate this into a SQL Server 2008 function, since this appears to be what you'd like most, you can write:

CREATE FUNCTION ufn_IsProperFormat(@data VARCHAR(50))
RETURNS BIT
AS
BEGIN
    RETURN 
     CASE 
      WHEN SUBSTRING(@Data, 1, 1) LIKE '[A-Za-z]'
        AND SUBSTRING(@Data, 2, LEN(@Data) - 1) NOT LIKE '%[^0-9]%'
        AND LEN(@Data) >= 5 THEN 1 
       ELSE 0 
      END
END

...and call into it like so:

SELECT * 
FROM TheTable
WHERE dbo.ufn_IsProperFormat(Data) = 1

...this query needs to change for Oracle queries because Oracle doesn't appear to support bracket notation in LIKE clauses:

SELECT *
FROM TheTable
WHERE REGEXP_LIKE(Data, '^[A-za-z]\d{4,}$')

This is the expansion gbn is doing in his answer, but these versions allow for varying string lengths without the OR conditions.

EDIT: Updated to support examples in SQL Server and Oracle for ensuring the format A0+, so that A1324, A2342388, and P2342 match but A2342JUNK and A234 do not.

The Oracle REGEXP_LIKE code was borrowed from Mark's post but updated to support 4 or more numeric digits.

Added a custom SQL Server 2008 approach which implements these techniques.

David Andres
This it accepts "D1234junk"
gbn
@gbn: You're right, but I couldn't tell be the OP if this is permitted or not. I know the string can be longer than five characters, but it's not clear what the range of values can be judging from the post. Also, I've edited the post to handle the rule about empty spaces. It's not clean, but it is compliant.
David Andres
+2  A: 

Given that you're allowing between 3 and 6 digits for the number in your examples then it's probably better to use the ISNUMERIC() function on the 2nd character onwards:

SELECT *
FROM TheTable
-- start with a letter
WHERE Data LIKE '[A-Za-z]%'
    -- everything from 2nd character onwards is a number
    AND ISNUMERIC( SUBSTRING( Data, 2, 50 ) ) = 1
    -- number doesn't have a decimal place
    AND Data NOT LIKE '%.%'

For more information look at the ISNUMERIC function on MSDN.

Also note that:

  • I've limited the 2nd part with the number to 50 characters maximum, change this to suit your needs.
  • Strictly speaking you should check for currency symbols etc, as ISNUMERIC allows them, as well as +/- and some others

A better option might be to create a function that checks that each character after the first is between 0 and 9 (or 1 and 0 if you're using ASCII codes).

Timothy Walters
This could work with a change from ISNUMERIC(SUBSTRING(Data, 2, 50)) to ISNUMERIC(SUBSTRING(Data, 2, LEN(Data) - 1)). You also need to verify the length of the string, which appears to require at least five characters.
David Andres
+2  A: 

You can't use Regular Expressions in SQL Server, so you have to use OR. Correcting David Andres' answer...

WHERE
    (
    Data LIKE '[A-Za-z][0-9][0-9][0-9]'
    OR
    Data LIKE '[A-Za-z][0-9][0-9][0-9][0-9]'
    OR
    Data LIKE '[A-Za-z][0-9][0-9][0-9][0-9][0-9]'
    )

David's answer allows "D1234junk" through

You also only need "[A-Z]" if you don't have case sensitivity

gbn
gbn: See the OP where "P50054" is considered valid but is also longer than five characters. Your approach is valid but requires infinite expansion.
David Andres
@David: You really need a regex to allow undefined number characters on the right. However, the title says 3-5 numbers on right so this approach will do what is expected (ignoring NULLs)
gbn
@gbn: Take a look at my answer update. In systems that support the bracket notation within the like operator (which apparently Oracle does not), you can use NOT LIKE '%[^0-9]%' to ensure that only numeric digits are valid.
David Andres
@David: I use the double negative trick too, didn't think of it here though. Good idea.
gbn
@gbn: Thanks, didn't occur to me at first either.
David Andres