Jaa


A Solution For Stripping Invalid XML Characters From varchar/text Data Structures

One of the more subtle aspects of converting (n)varchar or (n)text data to XML is the fact that XML is choosy about which characters are permitted and (n)/varchar/(n)text is not. Any T-SQL programmer who runs conversions of this type is likely to run into this issue. Here's a code block that resolves the issue.

The characters in question are what are commonly called "lower-order ASCII" characters, those below CHAR(32). Of these, only the TAB (CHAR(9)), LF (CHAR(10)), and CR(CHAR(13) are valid within XML. This solution uses trigger code to call a user-defined function to scrub the nvarchar columns, and a loop within the trigger to an ntext column.

Here's the UDF code:

CREATE

FUNCTION fnStripLowAscii (@InputString nvarchar(4000))
RETURNS nvarchar(4000)
AS
BEGIN
IF @InputString IS NOT NULL
BEGIN
DECLARE @Counter int, @TestString nvarchar(40)

SET @TestString = '%[' + NCHAR(0) + NCHAR(1) + NCHAR(2) + NCHAR(3) + NCHAR(4) + NCHAR(5) + NCHAR(6) + NCHAR(7) + NCHAR(8) + NCHAR(11) + NCHAR(12) + NCHAR(14) + NCHAR(15) + NCHAR(16) + NCHAR(17) + NCHAR(18) + NCHAR(19) + NCHAR(20) + NCHAR(21) + NCHAR(22) + NCHAR(23) + NCHAR(24) + NCHAR(25) + NCHAR(26) + NCHAR(27) + NCHAR(28) + NCHAR(29) + NCHAR(30) + NCHAR(31) + ']%'

SELECT @Counter = PATINDEX (@TestString, @InputString COLLATE Latin1_General_BIN)

WHILE @Counter <> 0
BEGIN
SELECT @InputString = STUFF(@InputString, @Counter, 1, NCHAR(164))
SELECT @Counter = PATINDEX (@TestString, @InputString COLLATE Latin1_General_BIN)
END
END
RETURN(@InputString)
END

Here's the trigger code:

CREATE

TRIGGER dbo.trSupplement_IU
ON dbo.Supplement
FOR INSERT, UPDATE
AS
SET NOCOUNT ON
SET TEXTSIZE 1048576

-- first do the text column. given the limitations of function calls, this must be done "brute force"

DECLARE @SupplementId int, @OldSupplementId int, @ptrval binary(16), @Pointer int, @OldPointer int, @Replace nchar(1), @TestString nvarchar(40), @TitleUpdated nvarchar(20), @PathUpdated nvarchar(20), @MDRUpdated nvarchar(20)

SELECT @OldSupplementId = -1, @Replace = NCHAR(164), @TestString = '%[' + NCHAR(0) + NCHAR(1) + NCHAR(2) + NCHAR(3) + NCHAR(4) + NCHAR(5) + NCHAR(6) + NCHAR(7) + NCHAR(8) + NCHAR(11) + NCHAR(12) + NCHAR(14) + NCHAR(15) + NCHAR(16) + NCHAR(17) + NCHAR(18) + NCHAR(19) + NCHAR(20) + NCHAR(21) + NCHAR(22) + NCHAR(23) + NCHAR(24) + NCHAR(25) + NCHAR(26) + NCHAR(27) + NCHAR(28) + NCHAR(29) + NCHAR(30) + NCHAR(31) + ']%'

WHILE EXISTS ( SELECT SupplementId FROM inserted WHERE SupplementId > @OldSupplementId)
BEGIN
SELECT @SupplementId = MIN(SupplementId) FROM inserted WHERE SupplementId > @OldSupplementId
-- first check the SupplementDescription..
IF EXISTS (SELECT SupplementId FROM dbo.Supplement s (nolock) WHERE SupplementId = @SupplementId AND SupplementDescription IS NOT NULL)
BEGIN
SET @Pointer = NULL
SELECT @ptrval = TEXTPTR(SupplementDescription),
@Pointer = PATINDEX (@TestString, SupplementDescription COLLATE Latin1_General_BIN)
FROM dbo.Supplement (nolock)
WHERE SupplementID = @SupplementId

WHILE (@Pointer <> 0)
BEGIN
SET @Pointer = @Pointer - 1
UPDATETEXT dbo.Supplement.SupplementDescription @ptrval @Pointer 1 @Replace

SELECT @Pointer = PATINDEX (@TestString, SupplementDescription COLLATE Latin1_General_BIN)
FROM dbo.Supplement (nolock)
WHERE SupplementID = @SupplementId
END
END
END

-- now do the character columns. this we can do with a join and a function call..
UPDATE s
SET SupplementTitle = CASE
NULLIF(i.SupplementTitle, d.SupplementTitle)
WHEN NULL THEN s.SupplementTitle
ELSE dbo.fnStripLowAscii(s.SupplementTitle)
END,
FileAttachmentPath = CASE
NULLIF(i.FileAttachmentPath, d.FileAttachmentPath)
WHEN NULL THEN s.FileAttachmentPath
ELSE dbo.fnStripLowAscii(s.FileAttachmentPath)
END,
MachineDataReference = CASE
NULLIF(i.MachineDataReference, d.MachineDataReference)
WHEN NULL THEN s.MachineDataReference
ELSE dbo.fnStripLowAscii(s.MachineDataReference)
END
FROM inserted i
JOIN dbo.Supplement s
ON i.SupplementId = s.SupplementId
LEFT OUTER JOIN deleted d
ON i.SupplementId = d.SupplementId

The trigger code is built to maximize performance in that the NULLIF tests in the UPDATE statement will only run the (relatively expensive) UDF if the inserted and deleted images of a particular column differ (if they don't differ, we can guarantee that the value has already been scrubbed). The UDF and the loop in the trigger for the ntext SupplementDescription column employ the same basic strategy of looping through the source value looking for any invalid character and replacing it with a new character (NCHAR(164)) until the last invalid character is found.

This code was developed for a SQL Server 2000 environment. It would function in a SQL Server 2005 environment, but better performance would likely be had with a CLR-based UDF.

Comments

  • Anonymous
    January 01, 2003
    A find shared by one friend leads to correspondence from another.. The redoubtable Adam Machanic left

  • Anonymous
    January 01, 2003
    Feedback is solicited on a programming issue

  • Anonymous
    January 01, 2003
    A recent discussion with several colleagues reminded me of a hard-won insight I've been meaning to share...

  • Anonymous
    July 15, 2005
    Note also that if you're interested in translating the reserved characters <>& etc., you can do that with a series of nested REPLACE statements.

  • Anonymous
    November 30, 2009
    thanks!, worked for me when receiving char #x0001

  • Anonymous
    September 19, 2012
    Instead you widen the valid characters, add this to the header of your xml document! <?xml version="1.0" encoding="ISO-8859-1"?> also remove the following characters with this python script... def clean():    file =  open('C:/where your file is located','r')    myfile = file.readlines()    file.close()    file =  open('C:rootwhere you want to save your filelocation','w')    for r in myfile:        r = r.replace("rn", "");        r = r.replace("r", "");        r = r.replace("n", "");        r = r.replace("r", "");        r = r.replace("\r\n", "");        r = r.replace("\r", "");        r = r.replace("\n", "");        r = r.replace("u0085", "");        r = r.replace("u000A", "");        r = r.replace("u000B", "");        r = r.replace("u000C", "");        r = r.replace("u000D", "");        r = r.replace("u2028", "");        r = r.replace("u2029", "");        r = r.replace("&quot;", "\&quot;");        file.write(r)    file.close() --------------python 3.2 Allows for a wider range of valid characters.