Why Bulk Import?
Most IT shops using SQL Server need to load/import large amount of data obtained from external sources or from another Database Server into SQL Sever. It is typcially done using the optimized insert path provided by SQL Server through Bulk Insert and BCP commands. Customers often wonder what optimizations are available through Bulk Insert/BCP and under what conditions? In this series of Blog posts, I will walk you through all bulk import optimizations with examples.
To understand the optimized insert path better, let us first take a look at the normal insert path is used by the SQL Server to insert one or more rows. As you may recall, the SQL Server guarantees ACID (Atomicity, Consistency, Isolation, and Durability) propertiesfor database transactions. Locking and Logging are some of building blocks used by SQL Server to provide ACID properties. The detailed discussion of how SQL Server implements ACID properties is beyond the scope of this paper. For the discussion, it suffices to say that when a data row is inserted, it needs to be locked for the duration of the transaciton to prevent concurrent transactions from accessing it. Similarly, the inserted data row needs to be logged so that the transaction can be rolled back or forward depending if the transaction needs to be undone or redone. At a high level, the steps to insert a row into a table are
· Lock the row. It is done by taking a row lock.
· Write a log record. This log record contains the full copy of the row being inserted besides other fields needed by the log record
· Row is inserted on the data page.
These steps are repeated for each row and for each index defined on the table. Now imagine that you are importing a large number of rows. In this case the overhead of each these steps can be significant. For example, let us consider importing 10 million rows into a table with an average rowsize of 300 bytes. To achieve this, SQL Server needs to acquire 10 million locks and write both data and log in the excess of 3 GB each. Note that each data row is written twice; first in the log record and then on the data page. Ideally you would like the data row to be written only once but clearly it is not the case here. Besides this, other overhead includes
· Acquiring/releasing 10 million locks takes precious memory and CPU. Yes, you can minimize this by giving a TABLE LOCK hint
· Writing 10 million log records adds to the contention in logging as there is only single point of insertion log records. Also, writing 10 million log records take the proportional space in the log and if you are inserting all the rows in one transacition, it can potentially lead to out of space in log because only the inactive portion of the log can be backed up. You can, of course, mitigate this issue by inserting 10 million rows in smaller groups say 10000 rows each.
· Constraint Checking: if the target table (i.e. the table you are importing the data into) has constraints defined, they need to be checked for each insert. You can optionally disable the constraints.
· Triggers: if the target table has a trigger defined on insert operation, the trigger needs to be executed for each insert. Like constraints, you can optionally disable triggers.
As you can see there is a fair amount of overhead for inserting a row. This is where the optimized bulk import comes to your rescue.
Comments
- Anonymous
February 05, 2008
The big news of the week is Windows Server 2008 RTM'ing. Read more here. The storage engine team has begin a series on bulk insert. It is definitely recommended reading. Part 1, Part 2, Part 3. The S ...