views:

1173

answers:

6

I have a table with a couple thousand rows. The description and summary fields are NTEXT, and sometimes have non-ASCII chars in them. How can I locate all of the rows with non ASCII characters?

+1  A: 

First build a string with all the characters you're not interested in (the example uses the 0x20 - 0x7F range, or 7 bits without the control characters.) Each character is prefixed with |, for use in the escape clause later.

-- Start with tab, line feed, carriage return
declare @str varchar(1024)
set @str = '|' + char(9) + '|' + char(10) + '|' + char(13)

-- Add all normal ASCII characters (32 -> 127)
declare @i int
set @i = 32
while @i <= 127
    begin
    -- Uses | to escape, could be any character
    set @str = @str + '|' + char(@i)
    set @i = @i + 1
    end

The next snippet searches for any character that is not in the list. The % matches 0 or more characters. The [] matches one of the characters inside the [], for example [abc] would match either a, b or c. The ^ negates the list, for example [^abc] would match anything that's not a, b, or c.

select *
from yourtable
where yourfield like '%[^' + @str + ']%' escape '|'

The escape character is required because otherwise searching for characters like ], % or _ would mess up the LIKE expression.

Hope this is useful, and thanks to JohnFX's comment on the other answer.

Andomar
You may want to add a few(or all) of the characters below 32 as well, especially important would be Carriage Return(13), Line Feed (10), and Tab (9).
Chris Shaffer
Good point added
Andomar
A: 

It's probably not the best solution, but maybe a query like:

SELECT *
FROM yourTable
WHERE yourTable.yourColumn LIKE '%[^0-9a-zA-Z]%'

Replace the "0-9a-zA-Z" expression with something that captures the full ASCII set (or a subset that your data contains).

Chris Shaffer
Wouldn't this just match rows that contain any ASCII character, as opposed only ASCII characters?
Andomar
The ^ marker at the front of the expression means NOT, so no. It would get any row that had at least one character that wasn't in the ranges specified.
JohnFx
How can I put the full ascii set in that expression? it's HTML data that I'm looking at so "/><' etc... is in there.
TheSoftwareJedi
The answer I placed checks against the full ascii set, it should work with >/< because those get escaped.
Andomar
@TheSoftwareJedi - you would need to add the characters to the list (eg, '%[^0-9a-zA-Z<>/"'']%'). Andomar's solution programmatically builds a complete list, you could use that.
Chris Shaffer
A: 

My previous answer was confusing UNICODE/non-UNICODE data. Here is a solution that should work for all situations, although I'm still running into some anomalies. It seems like certain non-ASCII unicode characters for superscript characters are being confused with the actual number character. You might be able to play around with collations to get around that.

Hopefully you already have a numbers table in your database (they can be very useful), but just in case I've included the code to partially fill that as well.

You also might need to play around with the numeric range, since unicode characters can go beyond 255.

CREATE TABLE dbo.Numbers
(
    number INT NOT NULL,
    CONSTRAINT PK_Numbers PRIMARY KEY CLUSTERED (number)
)
GO
DECLARE @i INT

SET @i = 0

WHILE @i < 1000
BEGIN
    INSERT INTO dbo.Numbers (number) VALUES (@i)

    SET @i = @i + 1
END
GO

SELECT *,
    T.ID, N.number, N'%' + NCHAR(N.number) + N'%'
FROM
    dbo.Numbers N
INNER JOIN dbo.My_Table T ON
    T.description LIKE N'%' + NCHAR(N.number) + N'%' OR
    T.summary LIKE N'%' + NCHAR(N.number) + N'%'
and t.id = 1
WHERE
    N.number BETWEEN 127 AND 255
ORDER BY
    T.id, N.number
GO
Tom H.
The way I understand it, ASCII is 7 bit and varchar is 8 bit. So varchar can still store a lot of characters that aren't ascii, like ä or é.
Andomar
I'm seeing the same results. This doesn't work.
TheSoftwareJedi
Extended ASCII is 8 bit, which is what some people are referring to when they say "ASCII". I'll edit the post to limit to normal ASCII as well.
Tom H.
THis won't work for the % or _ characters? And isn't an inner join slower than a LIKE statement (like in my answer)?
Andomar
A: 

-- This is a very, very inefficient way of doing it but should be OK for -- small tables. It uses an auxiliary table of numbers as per Itzik Ben-Gan and simply -- looks for characters with bit 7 set.

SELECT  *
FROM    yourTable as t
WHERE   EXISTS ( SELECT *
                 FROM   msdb..Nums as NaturalNumbers
                 WHERE  NaturalNumbers.n < LEN(t.string_column)
                        AND ASCII(SUBSTRING(t.string_column, NaturalNumbers.n, 1)) > 127)
Paul Harrington
+1  A: 

Technically, I believe that an NCHAR(1) is a valid ASCII character IF & Only IF UNICODE(@NChar) < 256 and ASCII(@NChar) = UNICODE(@NChar) though that may not be exactly what you intended. Therefore this would be a correct solution:

;With cteNumbers as
(
 Select ROW_NUMBER() Over(Order By object_id) as N
 From sys.system_columns c1, sys.system_columns c2
)
Select Distinct RowID
From YourTable t
 Join cteNumbers n ON n <= Len(TXT)
Where UNICODE(Substring(TXT, n.N, 1)) > 255
 OR UNICODE(Substring(TXT, n.N, 1)) <> ASCII(Substring(TXT, n.N, 1))

This should also be very fast.

RBarryYoung
ASCII is only up to 127. Also your numbers cte is weird - the final solution should use a preexisting numbers table instead of it. Otherwise, this is how I would do it.
Adam A
A: 

I have sometimes been using this "fast" statement to find "strange" chars

select 
    *
from 
    <Table>
where 
    <Field> != cast(<Field> as varchar(1000))
CC1960