Share via


T-SQL: How to Find Rows with Bad Characters

One of the commonly asked questions in Transact SQL Forum on MSDN is how to filter rows containing bad characters. Also, often times these bad characters are not known, say, in one of the recent posts the question was to filter all the rows where characters were greater than ASCII 127.

The first step towards solution is to realize that in order to quickly filter out something we may want to know the list of allowed characters first. 

Consider the following examples of how important is to know the "good characters" in order to filter the "bad" ones.

Let's suppose we only want alpha-numeric characters to remain and everything else should be considered bad rows.

For all our examples let's create the following table variable:

DECLARE @TableWithBadRows TABLE (
    Id INT  identity(1, 1) PRIMARY KEY
    ,description VARCHAR(max)
    );
  
INSERT INTO  @TableWithBadRows (description)
  
VALUES ('test1'), ('I am OK'), ('Filter me, please.');

Our pattern then will be 

SELECT * FROM @TableWithBadRows WHERE description LIKE '%[^a-z0-9]%';

where a-z means a range of all letters from a to z, 0-9 means range of all numbers from 0 to 9 and ^ means everything which is not like the following characters.

The above code will return 2 last rows. The second row is returned because it contains a space character which was not included in the list of allowed characters.

Now, what should we do if we want to keep all the "normal" characters and only disallow characters which are greater than ASCII 127? In this case, we may want to build the pattern in a loop.

Here is some code demonstrating this idea:

DECLARE @TableWithBadRows TABLE (
    Id INT  identity(1, 1) PRIMARY KEY
    ,description VARCHAR(max)
    );
  
INSERT INTO  @TableWithBadRows (description)
VALUES ('test1')
    ,('I am OK')
    ,('Filter me, please.')
    ,('Let them be & be happy')
    ,(CHAR(200))
    ,(CHAR(137))
    ,(CHAR(10) + CHAR(13) + 'Test more');
  
SELECT *
FROM @TableWithBadRows;
  
SELECT *
FROM @TableWithBadRows
WHERE description LIKE '%[^A-Z0-9%]%';
  
DECLARE @i INT = 32;
DECLARE @pattern VARCHAR(max) = '^a-Z0-9'
    ,@ch CHAR(1);
  
WHILE @i < 47
BEGIN
    SET @ch = CHAR(@i)
  
    IF @ch = '_'
        SET @pattern = @pattern + '[' + @ch + ']';
    ELSE
        IF @ch = '['
            SET @pattern = @pattern + @ch + @ch;
        ELSE
            SET @pattern = @pattern + @ch;
  
    SET @i = @i + 1;
END
  
SET @i = 58;
  
WHILE @i < 65
BEGIN
    SET @ch = CHAR(@i)
  
    IF @ch = '_'
        SET @pattern = @pattern + '[' + @ch + ']';
    ELSE
        IF @ch = '['
            SET @pattern = @pattern + @ch + @ch;
        ELSE
            SET @pattern = @pattern + @ch;
  
    SET @i = @i + 1;
END
  
SELECT @pattern
  
SELECT *
FROM @TableWithBadRows
WHERE description LIKE '%[' + @pattern +']%'

As you can see from the second select statement, the CHAR(200) (E) is not being filtered by the a-z filter as it is apparently considered a letter.

We may try adding binary collation to treat that letter as bad, e.g.

SELECT *
FROM @TableWithBadRows
WHERE description LIKE '%[^A-Za-z0-9% ]%'   COLLATE Latin1_General_BIN;

As you see, now this letter is considered bad and the row is returned.

This thread "Getting records with special characters" shows how to create a pattern when the bad characters are in the special table and also which characters ([,^,-) we need to escape.


Conclusion

I have shown several examples of filtering bad rows using various patterns. For more comprehensive and different functions please see links in the See Also section.


See Also

 


This article participated in the TechNet Guru for August competition.