T-SQL: Using Bulk Insert to Import Inconsistent Data Format
Introduction
Some third-party applications produce reports as CSV files where each line is in a different format. These applications use an export format that parses each record on-the-fly, and each value in the record separately. The format for each value is set by the value itself.
For example, if there is data in the field, then it will export the data inside a quotation mark, but if there is no data then the data will be blank and without a quotation mark. Moreover, some applications do not use 'data type' when generating a report. If there is data and the data is numeric, then some applications might not use any quotation marks, and in the same field on a different record data that is not numeric will be exported with the quotation marks. We can imagine a single CSV file with a specific column exported in six different formats.
In order to use bulk insert directly, we have to make sure that all the data is consistent with one format.
The problem
We need to use bulk insert to import data from a CSV file into the SQL server database using pure T-SQL. A Bulk insert operation can use only one format file and our metadata must remain consistent with it. The first step using bulk insert is to find a set of "bulk insert" rules (like: End Of Line, End Of Column, Collation…) that fit all of the data. This is not the case sometimes as mentioned above.
If you got the answer in the forum that this can't be done using pure T-SQL then remember that I always say "never say never".
* In this article we are going to talk only about a pure T-SQL solution, as there are several solutions (such as using SSIS, CLR or a third party app) that can do it in different ways; sometimes in a better way.
Our Case Study
Our application exports the data as a CSV Comma Delimited file without a consistent format. In this example we will deal with a very common situation that fits these rules:
- Our application use column type (just to make it easier for this article we will focus on a string column). So a numeric column will never use quotation marks and a string column will use quotation marks on and off by these rules.
- If there is data in the field then it will export the data inside quotation marks (no matter if the data is numeric or not, as the column is string type)
- If there is no data then the data will be blank and without quotation marks.
The original sample data that we use looks like this:
ID
Phone Number
First Name
Last Name
1
9999999
ronen
Ariely
2
8888888
xxx1, xxx2
yyy
2
8888888
xxx1, xxx2
3
yyy
4
7777777
2222222
zzz
kkk
5
1111111
5000.5
5
According to the application export rules above, our CSV file looks like this:
1``,``9999999``,``"ronen"``,``"ariely"
2``,``8888888``,``"xxx1,xxx2"``,yyy
2``,``8888888``,``"xxx1,xxx2"``,
3``,,,``"yyy"
4``,``7777777``,,
,``2222222``,zzz,kkk
5``,``1111111``,``"5000.5"``,``"5"
* we can see in the last line that our application uses column type, so even when our value is numeric it will be inside quotation marks. But we have to remember that there are some more complex situations, like applications that do not use column type. Then the last line can look like: [5,5000.5,5]. And it can be more complex if the culture formats numbers similar to 5,000.5. Then our CSV line might look like this [5,5,000.5,5]
The solution:
* Remember that this is only a workaround for our specific case. For each set of data a slightly different solution might fit. The idea of how to get to the solution is what is important here.
STEP 1: Identify the import file format
In this step, we will run several tests with different format files. Our aim is to identify any potential problems and to find the best format file which will fit as many columns as we can from the start.
Finding the problematic columns and the consistent column format
First of all you have to find a record format that fits most of the data, as well as the columns that might have be in-consistent with this format. In order to do that we are going to run several tests and then we will implement the conclusion at the next step. We will start with a simple bulk insert and continue with some more complex formats. Using the ERROR messages and the results, we will identify the potential problems.
Let's try this in practice
Open Notepad and copy our CSV data into the file.
1``,``9999999``,``"ronen"``,``"ariely"
2``,``8888888``,``"xxx1,xxx2"``,yyy
2``,``8888888``,``"xxx1,xxx2"``,
3``,,,``"yyy"
4``,``7777777``,,
,``2222222``,zzz,kkk
5``,``1111111``,``"5000.5"``,``"5"
Save the file as "C:\ArielyBulkInsertTesting\Test01.csv"
* make sure that you use ANSI format when you save the file (you can use a different format like UNICODE but for this example, we shall use ANSI).
Open Notepad and copy our XML format data into the file.
* Using a file format can help for more complex formats. We highly recommended always to use a file format.
<xml version="1.0"?>
<BCPFORMAT xmlns="http://schemas.microsoft.com/sqlserver/2004/bulkload/format"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<RECORD>
<FIELD ID="1" xsi:type="CharTerm" TERMINATOR="," />
<FIELD ID="2" xsi:type="CharTerm" MAX_LENGTH="7" TERMINATOR="," />
<FIELD ID="3" xsi:type="CharTerm" COLLATION="Hebrew_CI_AS" MAX_LENGTH="15" TERMINATOR="," />
<FIELD ID="4" xsi:type="CharTerm" COLLATION="Hebrew_CI_AS" MAX_LENGTH="15" TERMINATOR="\r\n" />
</RECORD>
<ROW>
<COLUMN SOURCE="1" NAME="ID" xsi:type="SQLINT"/>
<COLUMN SOURCE="2" NAME="PhoneNumber" xsi:type="SQLINT"/>
<COLUMN SOURCE="3" NAME="FirstName" xsi:type="SQLNVARCHAR"/>
<COLUMN SOURCE="4" NAME="LastName" xsi:type="SQLNVARCHAR"/>
</ROW>
</BCPFORMAT>
Save the file as "C:\ArielyBulkInsertTesting\Test01.xml"
Open SSMS and run this DDL to create our table:
CREATE TABLE #test (
ID int
, PhoneNumber int
, FirstName varchar(15)
, LastName varchar(15)
)
GO
Try to use this simple bulk insert query to import our data:
BULK INSERT #test FROM 'C:\ArielyBulkInsertTesting\Test01.csv'
WITH (
FORMATFILE='C:\ArielyBulkInsertTesting\Test01.xml'
, MAXERRORS = 1
, KEEPNULLS
--, DATAFILETYPE = 'native'
--, CODEPAGE='RAW'
--, ROWTERMINATOR='\r\n'
--, FIRSTROW = 3
);
GO
No error… have you got a good feeling? Let's check our data
select
* ``from
#test
GO
Results:
ID
PhoneNumber
FirstName
LastName
1
9999999
"ronen"
"ariely"
2
8888888
"xxx1
xxx2,"yyy"
2
8888888
"xxx1
xxx2",
3
NULL
NULL
"yyy"
4
7777777
NULL
NULL
NULL
2222222
"zzz
kkk"
5
1111111
"5000.5"
"5"
Compare our results to the original data... Ops… that's bad…
In our case we can see that the first and second columns have no problem, but the problems start on the third column and continue to fourth column. First of all we have some quotation marks in the results. Moreover the third column was split in several records and part of the data moved into the fourth column. Actually, as our format file says that the third column ends on the comma, then every time we have a comma as part of the string data it will be split. That make sense.
When we have string data we surround the content in quotes. If our data had "a consistent format" then all string data would be enclosed in quotes, even empty data.
Well formatted data CSV
Let's demonstrate a well formatted data CSV. Save this data as as "C:\ArielyBulkInsertTesting\Test02.csv"
1``,``9999999``,``"ronen"``,``"ariely"
2``,``8888888``,``"xxx1,xxx2"``,``"yyy"
2``,``8888888``,``"xxx1,xxx2"``,``""
3``,,``""``,``"yyy"
4``,``7777777``,``""``,``""
,``2222222``,``"zzz"``,``"kkk"
5``,``1111111``,``"5000.5"``,``"5"
In that case the solution was very simple. We could use this format file:
<xml version="1.0"?>
<BCPFORMAT xmlns="http://schemas.microsoft.com/sqlserver/2004/bulkload/format"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance>
<RECORD>
<FIELD ID="1" xsi:type="CharTerm" TERMINATOR=',' />
<FIELD ID="2" xsi:type="CharTerm" MAX_LENGTH="7" TERMINATOR=',\"' />
<FIELD ID="3" xsi:type="CharTerm" COLLATION="Hebrew_CI_AS" MAX_LENGTH="15" TERMINATOR='\",\"' />
<FIELD ID="4" xsi:type="CharTerm" COLLATION="Hebrew_CI_AS" MAX_LENGTH="15" TERMINATOR='\"\r\n' />
</RECORD>
<ROW>
<COLUMN SOURCE="1" NAME="ID" xsi:type="SQLINT" />
<COLUMN SOURCE="2" NAME="PhoneNumber" xsi:type="SQLINT" />
<COLUMN SOURCE="3" NAME="FirstName" xsi:type="SQLNVARCHAR" />
<COLUMN SOURCE="4" NAME="LastName" xsi:type="SQLNVARCHAR" />
</ROW>
</BCPFORMAT>
Save the file as "C:\ArielyBulkInsertTesting\Test02.xml"
Clear our table of previous data:
truncate
table
#test
GO
Now execute the bulk insert and the data should be placed in the table correctly. If our data was formatted in this way (with consistent format) then we would not need this article :-)
BULK INSERT #test FROM 'C:\ArielyBulkInsertTesting\Test02.csv'
WITH (
FORMATFILE='C:\ArielyBulkInsertTesting\Test02.xml'
, MAXERRORS = 1
--, KEEPNULLS
--, DATAFILETYPE = 'native'
--, CODEPAGE='RAW'
--, ROWTERMINATOR='\r\n'
--, FIRSTROW = 3
);
GO
Let's continue to work on our "real" data! Clear the data
truncate
table
#test
GO
In some cases we might build a format file which bring us error messages. We already know that the data will not fit all records. This test will give us more info using the error message. Try to use this format file (C:\ArielyBulkInsertTesting\Test03.xml):
<xml version="1.0"?>
<BCPFORMAT xmlns="http://schemas.microsoft.com/sqlserver/2004/bulkload/format"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<RECORD>
<FIELD ID="1" xsi:type="CharTerm" TERMINATOR="," />
<FIELD ID="2" xsi:type="CharTerm" MAX_LENGTH="7" TERMINATOR=',\"' />
<FIELD ID="3" xsi:type="CharTerm" COLLATION="Hebrew_CI_AS" MAX_LENGTH="15" TERMINATOR=',' />
<FIELD ID="4" xsi:type="CharTerm" COLLATION="Hebrew_CI_AS" MAX_LENGTH="15" TERMINATOR='\r\n' />
</RECORD>
<ROW>
<COLUMN SOURCE="1" NAME="ID" xsi:type="SQLINT"/>
<COLUMN SOURCE="2" NAME="PhoneNumber" xsi:type="SQLINT"/>
<COLUMN SOURCE="3" NAME="FirstName" xsi:type="SQLNVARCHAR"/>
<COLUMN SOURCE="4" NAME="LastName" xsi:type="SQLNVARCHAR"/>
</ROW>
</BCPFORMAT>
Execute our bulk insert
BULK INSERT #test FROM 'C:\ArielyBulkInsertTesting\Test01.csv'
WITH (
FORMATFILE='C:\ArielyBulkInsertTesting\Test03.xml'
, MAXERRORS = 10
--, KEEPNULLS
--, DATAFILETYPE = 'native'
--, CODEPAGE='RAW'
--, ROWTERMINATOR='\r\n'
--, FIRSTROW = 3
);
GO
We get an error message which can help us a lot in this case:
Msg 4864, Level 16, State 1, Line 1 Bulk load data conversion error (type mismatch or invalid character for the specified codepage) for row 4, column 2 (PhoneNumber).
Moreover, we can see that the rest of the records in our data were inserted (using SQL 2012). In some cases using data with a few inconsistent records, this can be the best way, as most of the data is inserted. Now we can just check which records do not exist in the table and fix it. The error message includes the number of the problematic first row.
In conclusion, the best format file we found succeeds in inserting the first and second columns without any problem. We recognized that the problems start on the third column.
STEP 2: insert the data into temporary table
This is the main step, as now we can use bulk insert to import the data into SQL Server. Since we found that our data does not have a consistent format, we are going to use a temporary table to import the data.
* We don’t have to use a temporary table, as we can just use OPENROWSET to get the data and do the parsing on-the-fly. I will show this in step 3.
The basic idea is to bring all the data before the problematic point (in our case the first and second columns) into separate columns, as they should appear in the final table. Then the rest of the data from the problematic point to the end of the problematic point (or to the end of the line if there is no other way) into one column. So in our case, the third and fourth columns will be imported as one column.
Let's do it. We will use this format file (save as C:\ArielyBulkInsertTesting\Test04.xml), which is similar to "Test01.xml" file, without the third column:
<xml version="1.0"?>
<BCPFORMAT xmlns="http://schemas.microsoft.com/sqlserver/2004/bulkload/format"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<RECORD>
<FIELD ID="1" xsi:type="CharTerm" TERMINATOR="," />
<FIELD ID="2" xsi:type="CharTerm" MAX_LENGTH="7" TERMINATOR=',' />
<FIELD ID="4" xsi:type="CharTerm" COLLATION="Hebrew_CI_AS" MAX_LENGTH="30" TERMINATOR='\r\n' />
</RECORD>
<ROW>
<COLUMN SOURCE="1" NAME="ID" xsi:type="SQLINT"/>
<COLUMN SOURCE="2" NAME="PhoneNumber" xsi:type="SQLINT"/>
<COLUMN SOURCE="4" NAME="FirstName_LastName" xsi:type="SQLNVARCHAR"/>
</ROW>
</BCPFORMAT>
And execute this query (drop old table, create new table with 3 columns, bulk insert data, select and show the data):
DROP TABLE #test
GO
CREATE TABLE #test (
ID int
, PhoneNumber int
, FirstName_LastName varchar(30)
)
GO
BULK INSERT #test FROM 'C:\ArielyBulkInsertTesting\Test01.csv'
WITH (
FORMATFILE='C:\ArielyBulkInsertTesting\Test04.xml'
, MAXERRORS = 10
--, KEEPNULLS
--, DATAFILETYPE = 'native'
--, CODEPAGE='RAW'
--, ROWTERMINATOR='\r\n'
--, FIRSTROW = 3
);
GO
select * from #test
GO
our results look like this:
ID PhoneNumber FirstName_LastName
----------- ----------- ------------------------------
1 9999999 "ronen","ariely"
2 8888888 "xxx1,xxx2","yyy"
2 8888888 "xxx1,xxx2",
3 NULL ,"yyy"
4 7777777 ,
NULL 2222222 "zzz","kkk"
5 1111111 "5000.5","5"
The goal of this article is to give an optimal use of pure T-SQL in order to import data which is not well formatted into the database, in an appropriate and effective structure for parsing(step 3). I will not elaborate on step 3 of parsing the data. There can be hundreds of different ways to do this for each case.
I'll show some sample solutions in step 3. Those solutions are not necessarily optimal parsing solutions, but represent solutions in which I use different string functions for parsing our data. Usually, when parsing data in SQL Server it is best to use CLR functions.
Step 3: parsing the data into the final table
Now that we imported the data, all that we need to do is parse the last column. These queries can do the job in our case:
select
ID, PhoneNumber , FirstName_LastName
, FN = case
when CHARINDEX('",', FirstName_LastName, 1) > 0
then LEFT (
RIGHT(FirstName_LastName, LEN(FirstName_LastName) - 1)
, CHARINDEX('",', FirstName_LastName, 1) - 2
)
else ''
END
, LN = case
when CHARINDEX(',"', FirstName_LastName, 1) > 0
then SUBSTRING(
FirstName_LastName
, CHARINDEX(',"', FirstName_LastName, 1) + 2
, LEN(FirstName_LastName) - CHARINDEX(',"', FirstName_LastName, 1) - 2 )
else ''
END
from #test
go
-- i use @ char but you should use any combination of chars that cannot be in the data value!
-- i can clean in one time all " char as i know it is not part of my data
select ID, PhoneNumber , FirstName_LastName
, SUBSTRING(Temp, 0, charindex('@',Temp) ) FN
, SUBSTRING(Temp, charindex('@',Temp) + 1, LEN(Temp) - charindex('@',Temp)) LN
from (
select
ID, PhoneNumber , FirstName_LastName
, Temp = REPLACE(REPLACE(REPLACE(REPLACE (FirstName_LastName, '","', '@'), '",','@'),',"','@'),'"','')
from #test
) T
go
After we found a way to parse the data, we can use a simple SELECT INTO query to move the data from the temporary table to the final table.
Usually, if this is not a onetime operation then I prefer to use one query do it all without declaring a temporary table. I do need these steps to find my Bulk Insert query & format (step 1+2) and to find the parsing function (step 3). Next, I convert my queries into an OPENROWSET import query like this (in our case study)
--FINAL TABLE
CREATE TABLE #FINAL (
ID int
, PhoneNumber int
, FirstName varchar(15)
, LastName varchar(15)
)
GO
insert #FINAL
select
ID, PhoneNumber --, FirstName_LastName
, FN = case
when CHARINDEX('",', FirstName_LastName, 1) > 0
then LEFT (
RIGHT(FirstName_LastName, LEN(FirstName_LastName) - 1)
, CHARINDEX('",', FirstName_LastName, 1) - 2
)
else ''
END
, LN = case
when CHARINDEX(',"', FirstName_LastName, 1) > 0
then SUBSTRING(
FirstName_LastName
, CHARINDEX(',"', FirstName_LastName, 1) + 2
, LEN(FirstName_LastName) - CHARINDEX(',"', FirstName_LastName, 1) - 2 )
else ''
END
FROM OPENROWSET(
BULK N'C:\ArielyBulkInsertTesting\Test01.csv'
, FORMATFILE = 'C:\ArielyBulkInsertTesting\Test04.xml'
) a
GO
select * from #FINAL
GO
Summary
The basic idea is to bring all the data in the problematic columns (or until the end of line if there is no other way) into one column. We can use a temporary table to store the data. Then we can parse the temporary column using any way that suits us. We can use T-SQL functions or CLR functions like SPLIT. We can clean some characters using replace. We can find characters using CHARINDEX, and so on. This is all depends on your specific data. It has nothing to do with bulk insert anymore :-)
We must separate the operation into two parts:
- Insert the data using bulk insert into the data base (temporary table or using OPENROWSET) in such way that we will be able to use it for step two
- Parsing and splitting the text on the last column into the final columns
* This article elaborates step 1.
Comments
* A more complex case study in which I used this logic can be seen in the MSDN forum in this link:
* Usually it is much better to do the parsing using CLR functions. If you are not convinced by my recommendation then you can check this link: http://www.sqlperformance.com/2012/07/t-sql-queries/split-strings
* If you can export the file in a consistent format fit with bulk insert than you should do it! This is only a workaround solution.
* If you can build a well formatted import file in advance, from the original import file, using a small application which will format a new file, then do it! This is a much better solution as most languages do a better job of parsing text than SQL Server (T-SQL).
* If you can manage the order of the columns during the exporting, then try to make sure that you move all the problematic columns to the end. This will help us to use the bulk insert in a more optimal way as we will need to parse fewer columns in step 3
* Why not import all the data into one column in a temp table instead of STEP 1 & STEP 2?
This is always an option but probably not a good one. In our case study we use a very simple table structure with 4 columns and only 7 records, but in real life we might get a table with 20 columns or more and several million records. If we have 2 columns (out of 20 columns) with potential problems and we can order the columns so those columns come last, than we can import the most of the data (18 columns) into the final data structure, and we will need to import only the last two columns into one column for parsing. It is much better to separate the data into as many columns as we can and minimize the use of parsing. Parsing is a CPU intensive operation. Parsing the data after importing will probably take longer.
When you have to use complex parsing it is much better to use CLR solutions. As we mention in the start this is a pure T-SQL solution.
Resources & More Information
- This article is based on Ronen Ariely's blog at http://ariely.info/Blog/tabid/83/EntryId/122/Using-Bulk-Insert-to-import-inconsistent-data-format-using-pure-T-SQL.aspx
- Create a Format File (SQL Server) http://msdn.microsoft.com/en-us/library/ms191516.aspx
- Using SSIS there is a simple build in way http://www.mssqltips.com/sqlservertip/1316/strip-double-quotes-from-an-import-file-in-integration-services-ssis/
* There are no other references for this solution that I know of and most forum questions that I found where closed without an answer, or by sending the questioner to a different solution like using SSIS, or a third party application, or by saying that it cannot be done using bulk insert and pure t-SQL.
Some Forum questions
* This article answer several forum questions (more than 15 which I found using google, and I checked only the first several pages of search results) that remained unanswered for far too long. I did not find any solutions or answers except my own based on this logic. This is a very easy solution but we have to think outside the box to get it :-)