Udostępnij za pośrednictwem


Case Study: Import data from XML to SQL (Part-1)

 

XML to SQL (or to any relational database) could be real pain if XML has schema with lots of nodes and sub nodes.

How about XML schema file which can generate around 50 tables?

It becomes difficult to create relational schema as XML has parent child relationship which does not get converted in SQL relational mapping easily many times if schema is complex.

 

How does SSIS take care of this?

SSIS adds one unique column in each table and uses this unique key to keep track of all mappings.

So if XML source has 5 columns then SSIS will create table with 6 columns, new column will be identity column.

 

How are the values for this new column decided?

This identity column does not have linear values in most of the case.

For each complex type in XML schema it generated new id and then it depends on number of rows it has

<?xml version="1.0"?>

<Gate specVersion="2.09">

<Header>  ---------> ID=1 (only one row ) 

<SenderID></SenderID>

<ReceiverID></ReceiverID>

<DocTypeID>MasterData</DocTypeID>

<DocCount>1</DocCount>

<Date type="generation">07/26/201313:01</Date>       ---------> ID=2 (only one value)

<ExtractRequest>    ---------> ID=3

<User UserId="abc@abc.com" FirstName="abc" LastName=" xyz">      ---------> ID=4

<ContactMethods> ---------> ID=5

<ContactMethod sequenceNum="1" type="Phone">XXX-XXX-XXXX</ContactMethod>   ---------> ID=6 (first row)

<ContactMethod sequenceNum="2" type="Email">abc@abc.com</ContactMethod>   ---------> ID=7(second row)

<ContactMethod sequenceNum="3" type="SMSEmail"></ContactMethod>     ---------> ID=8

<ContactMethod sequenceNum="4" type="Fax"></ContactMethod>        ---------> ID=9

</ContactMethods>

</User>

 

I hope above example will help in understanding how values are assigned.

 

So it you will see in ContactMethod table you will see that first ID values is 6 and not 1.

Also next value might not be 7.

So this creates its own challenge, also this not very good if you have multiple XML files.

Why?

Because for each new XML file ID value gets reset and it does not consider how many rows are already there in database.

 

If you want to tackle this issue using SSIS you have to use lots of Merge join, Sort, Script task, Execute Query task etc etc. (Basically it is not very simple, or let me say it is a worst thing on earth you would like to do)

Do you want to see how does it look for importing 5 tables for such case?

 

 

Now how it would look for 50. (Let me answer that with scary!!!)

 

So how do we import data for such scenario?

I created c# .net code which can do this, it is not very straight forward code but much simpler than what SSIS package would have looked.

I will break down that code and explain what it does and how it could be made robust so that it can handle most of the scenarios.

 

Consider basic data where each node appears only once and just have to take care of primary key values and mapping. (And we are dealing with only one XML file right now so not checking how many rows database already have, will do that in 2nd part of this case study)

Also this code takes care of table creation if they are not present and if table is present but column is not then it will add column to table.

 

 XmlReader xr = XmlReader.Create(<File path>); //create xmlreader for xml file
 XDocument xd = XDocument.Load(xr); //create xdocument as it will help us in changing all node names
 

 

What if you don’t have schema file and want to create schema at run time?

Below code will take for this for you.

 Process p = new Process();
 p.StartInfo.UseShellExecute = false;
 p.StartInfo.RedirectStandardOutput = true;
 p.StartInfo.FileName = @"C:\Program Files (x86)\Microsoft SDKs\Windows\v7.0A\Bin\xsd.exe";
 p.StartInfo.Arguments = @"<XMl file path with file name> /outputdir:<Folder where you want to save XSD file>";
 p.Start();
 p.WaitForExit();
 string output = p.StandardOutput.ReadToEnd();
 Console.WriteLine("OUTPUT FROM XSD.EXE : " + output); 
 

 Now we will proceed with rest of piece where we will create table(with right data type) at run time and bulk insert data (pay attention guys this is really interesting...at least i found it interesting!!!)

 

 //create dataset which will have all data
 DataSet ds1 = new DataSet("XML data1");
 
 //load schema first, this is updates schema will attch with email ds1.ReadXmlSchema(@"D:\userdocs\Ken_XML\FAC_SSIS_Identities\output.xsd");
 
 ds1.EnforceConstraints = false;
 
 //load data in dataset
 ds1.ReadXml(@"<XML file>");
 
 //connect to sql server and appropriate database see if table already present, if not present create new table after that load data from dataset to sql using bulkcopy
 try
 {
 SqlConnection conn = new SqlConnection("Server=localhost;Initial Catalog= XML_data;Trusted_Connection=True;");
 conn.Open();
 foreach (DataTable dt in ds1.Tables)
 {
 //check if table is present or not
 string exists = null;
 try
 {
 SqlCommand cmd = new SqlCommand("SELECT * FROM sysobjects where name = '" + dt.TableName + "'", conn);
 exists = cmd.ExecuteScalar().ToString();
 }catch (Exception exce)
 {
 exists = null;
 }
 // selecting each column of the datatable to create a table in the database
 Console.WriteLine("Bulk Insert Started table:" + dt.TableName);
 SqlBulkCopy bulk = new SqlBulkCopy(conn);
 bulk.DestinationTableName = "[" + dt.TableName + "]";
 foreach (DataColumn dc in dt.Columns)
 {
 bulk.ColumnMappings.Add(dc.ColumnName, dc.ColumnName);
 string type = "";
 //Getting right data type for column is very importatnt as it can create problem later if wrong data type is chosen
 //for mapping we are using below key value pair
 //Dictionary<string, string> typemappings = new Dictionary<string, string>();
 //typemappings.Add("Decimal", "Numeric");
 //typemappings.Add("String", "varchar(255)");
 //typemappings.Add("Int32", "Int");
 typemappings.TryGetValue(dc.DataType.FullName.Split('.')[1], out type);
 if (type == null)
 type = "varchar(255)";
 if (exists == null)
 {
 SqlCommand createtable = new SqlCommand("CREATE TABLE [" + dt.TableName + "] ([" + dc.ColumnName + "] " + type + ")", conn);
 createtable.ExecuteNonQuery();
 exists = dt.TableName;
 }
 else
 {
 try
 {
 SqlCommand addcolumn = new SqlCommand("ALTER TABLE [" + dt.TableName + "] ADD [" + dc.ColumnName + "] " + type, conn);
 addcolumn.ExecuteNonQuery();
 }catch (Exception ex2) { }
 }
 }
 //load data in sql
 bulk.WriteToServer(dt);
 Console.WriteLine("Bulk Insert completed table:" + dt.TableName);
 }
 conn.Close()
 }catch (Exception ex)
 {
 Console.WriteLine(ex.Message.ToString() + "\n" + ex.StackTrace.ToString());
 }
 finally
 {
 ds1.Clear();
 ds1.Dispose();
 }
 

So this piece of code is enough to populate data in database along with creation of tables and all related columns.

In next part we will modify code to make it more robust so that it can take care of

1)      Duplicate nodes in XML file

2)      Creating new XML file with replaced node names

3)      Enumerate through all XML files in given folder

4)      Get how many rows are there in database and decide next ID value accordingly

 

Let me know if this process doesn’t make any sense to you or if there is anything for which you need some explanation.

I will be adding link to complete code at the end of second part.

Comments

  • Anonymous
    September 14, 2013
    Hello again… In Case Study: Import data from XML to SQL (Part-1) we tried to save XML data

  • Anonymous
    November 26, 2013
    Hi Dilkush, It was a nice read. Currently, I am also working on populating data in XML format into SQL Server. It has several nesting of loops (5+) and has many types (50+) . I am able to populate the data correctly into the required 50+ tables. However, it is taking too much of memory consumption (99% utilization of 20GB RAM) during the process and hence risks the SQL server performance during that period. The XML file size is over 1.5GB. Is there any way I can configure the DFT to be less memory intensive and not affect SQL Server performance. Thanks :)

  • Anonymous
    November 26, 2013
    are you having any transformations in between? Usually SSIS does not fetch all rows together it does in batches, only problem would be if you have blocking transformations along with multicast that will shoot up memory consumption. Could you check how much memory dtexec.exe is taking (dtsdebughost.exe in case of BIDS)? Also make sure you run package in 64 bit mode?

  • Anonymous
    November 26, 2013
    The comment has been removed

  • Anonymous
    November 26, 2013
    The comment has been removed