Condividi tramite


ETL World Record!

Today at the launch of SQL Server 2008, you may have seen the references to world-record performance doing a load of data using SSIS. Microsoft and Unisys announced a record for loading data into a relational database using an Extract, Transform and Load (ETL) tool. Over 1 TB of TPC-H data was loaded in under 30 minutes. I wanted to provide some background material in the form of a Q&A on the record, since it’s hard to give many details in the context of a launch event. We are also planning a paper that talks about all this, so think of this article as a place-holder until the full paper comes along. I hope you find this background information useful.

- Len Wyatt

How fast was the data load?

More than one terabyte of data was parsed from flat files, transferred over the network and loaded into the destination database in less than 30 minutes, a world record beating all previously published results using an ETL tool. That is a rate in excess of 2 TB per hour (650+ MB/second). To be precise, 1.18TB of flat file data was loaded in 1794 seconds. This is equivalent to 1.00TB in 25 minutes 20 seconds or 2.36TB per hour.

Why is this important?

Businesses have ever-increasing volumes of data stored in many heterogeneous systems. Thay want to know that an ETL tool they choose will be able to support any data volumes they might require. Microsoft has been making a significant investment in SQL Server Integration Services (SSIS), and this record illustrates the capability of SQL Server Integration Services 2008, SQL Server 2008 and the Unisys ES7000 to handle a significant volume of data at a dramatic speed.

Why not just do a bulk load of the data?

It is rare in businesses today that data is always available on the destination system, and does not need to be standardized or corrected for errors before loading. These rare cases are the times that bulk loading data makes sense. Data integration can involve complex transformation rules, error checking and data standardization techniques. ETL tools like SSIS can perform these functions such as moving data between systems, reformatting data, integrity checking, key lookups, tracking lineage, and more. SSIS has proven itself to be a versatile ETL tool, and now it is shown to be the fastest one as well.

What data did you choose to load?

DBGEN tool from the TPC-H benchmark was used to generate 1.18 TB of source data. The data were partitioned by DBGEN, allowing it to be loaded in parallel from multiple systems. DBGEN generates data on customers, parts, suppliers, orders and line items. It is broadly representative of a wholesale business. The data contain a variety of data types, including dates, money amounts, integers, strings and flags.

Please note that the ETL loading results are not TPC-H benchmark results and should not be compared to TPC-H benchmark results.

Was this a certified benchmark?

There is no commonly accepted benchmark for ETL tools. Microsoft thinks there should be. Industry standard benchmarks can lead to healthy competition, better products, and better publication of the techniques used to get high performance. Microsoft would welcome the opportunity to join with others in the industry to define a common benchmark that reflects the real-world uses of ETL tools.

The use of TPC-H data for this project was a convenience. This is not a TPC-H benchmark result.

How does this compare to your competitors?

Multiple competitors have published results based on TPC-H data. Informatica has the fastest time previously reported, loading 1 TB in over 45 minutes. SSIS has now beaten that time by more than 15 minutes.

There are other claims of fast times that have been made, but on non-standard data sets and without enough information to allow any meaningful comparison. This is part of the reason Microsoft would support the creation of an industry standard ETL benchmark.

What system configuration was used?

The database server ran on a Unisys ES7000/one Enterprise Server , with 32 socket dual core Intel® XeonTM  3.4 Ghz (7140M) processors , 256 GB RAM and 8 dual port 4Gbit HBA’s . The SQL Server data was stored on an EMC Clariion CX3-80 SAN with 165 (146 GB/15 krpm) spindles. The database server ran a pre-release build of SQL Server 2008 Enterprise Edition (V10.0.1300.4, built just before the “February 2008 CTP”) on the Windows Server 2008 x64 Datacenter Edition operating system.

 

Four servers acted as data sources, modeling the fact that data comes from a variety of systems in a modern enterprise. Each source server ran SSIS packages that sent data across the network to the database server. The source servers ran SSIS from SQL Server build V10.0.1300.4, on the Windows Server 2008 operating system. Source data came from flat files, as it was generated by DBGEN.

For the source servers, 4 Unisys ES3220L servers with Windows2008 x64 Enterprise Edition were used. Each server is equipped with 2 x 2.0GHz quad core Intel® processors, 4GB RAM, a dual port 4Gbit Emulex HBA and Intel PRO1000/PT network card. The source data was read from 2 x EMC Clariion CX600 SAN’s with 45 spindles each.

The Source servers were connected to the ES7000/one server database server with private dual port 1Gb Ethernet connections.

Why use multiple source systems?

Modern large businesses are complex operations. Large data sets are often the result of multiple data feeds. This made the test more realistic by mimicking a real world ETL scenario.

What do the SSIS packages look like?

There was just one package, though the source systems ran multiple instances of it. It is quite simple: There is one control flow for each “stream” of data generated by DBGEN. The control flow has one data flow for each table, each data flow reading data from a flat file source and writing to the SQL Server database via OLEDB. Using this data set there is a one-to-one column mapping between the flat file data and the database tables.

Did Windows Server 2008 figure in to this?

A lot of innovative engineering work in Windows Server 2008, including significant improvements in memory management, PCI and block storage I/O, and core networking, helped achieve this great performance. Because of these advances, Windows Server 2008 sustained about 960 megabytes per second over the Ethernet network, during processing of one large table.

Were secret internal tricks were needed to make this work?

No secret internal tricks or special builds were needed. Although this project used a pre-release version, it was a regular SQL2008 Enterprise Edition build. No special code in the product was used. Everything we did could be replicated by others.

The main thing done in the relational database was to use “soft NUMA” and port mapping to get a good distribution of work within the system. This is a published technique; you can find articles about it on MSDN. We also set the –x flag on starting SQL Server. This reduces the time SQL Server spends collecting performance statistics at run-time.

In SSIS we made sure the data types used in the SSIS data flows matched the types used in SQL Server, so the data did not need to be converted again after the initial conversion of strings read from flat files. Fast Parse is set on the text file fields where it applied.

The network connections on the server used the built-in Intel PRO/1000 GbE controllers. Released versions of network drivers were used, and Ethernet jumbo frames were configured to better support this bulk streaming scenario. Window Server 2008’s new TCP/IP receive window autotuning was set to “restricted”. The IntPolicy tool was used to ensure the ES7000 server NICs’s interrupts & DPCs occurred on a CPU affinitized to the same NUMA node as the NIC.

A complete list of settings and optimizations will be included in the paper when it is released.

Comments

  • Anonymous
    February 27, 2008
    Very schweet! Congratulations to the team!! SQL Server Performance : ETL World Record! An ETL standard

  • Anonymous
    February 27, 2008
    Very schweet! Congratulations to the team!! SQL Server Performance : ETL World Record! An ETL standard

  • Anonymous
    February 27, 2008
    They mentioned the TPC benchmark results at the SQL Server 2008 launch today. The final paragraph mentions

  • Anonymous
    February 27, 2008
    They mentioned the TPC benchmark results at the SQL Server 2008 launch today. The final paragraph mentions

  • Anonymous
    February 27, 2008
    I just heard about this from Len Wyatt, who's a Principle Program Manager on the SQL Server Performance

  • Anonymous
    February 27, 2008
    Microsoft has published a performance benchmark on loading 1TB of data using ETL stuff in SSIS 2008.

  • Anonymous
    February 27, 2008
    Si je vous dis chargement et parsing de 1 téra-octets de données depuis des fichiers à plat puis déversement

  • Anonymous
    February 27, 2008
    Si je vous dis chargement et parsing de 1 téra-octets de données depuis des fichiers à plat puis déversement

  • Anonymous
    February 28, 2008
    Loading of 1TB under 30 minutes

  • Anonymous
    February 28, 2008
    Loading of 1TB under 30 minutes

  • Anonymous
    February 28, 2008
    Using SQL Server 2008 and SSIS 2008 the SQL Server Performance team working with Unisys and SQL CAT have

  • Anonymous
    February 28, 2008
    Using SQL Server 2008 and SSIS 2008 the SQL Server Performance team working with Unisys and SQL CAT have

  • Anonymous
    February 28, 2008
    Σήμερα, κατά το επίσημο launch των Windows Server 2008, SQL Server 2008 και Visual Studio 2008 ανακοινώθηκε

  • Anonymous
    March 02, 2008
    Great work Len, one comment though. If you're testing throughput of the SSIS pipeline engine the wouldn't it make more sense to remove all other possible variables (as far as possible anyway). For example, if you're loading into a database then who knows, SQL Server itself could be a bottleneck. Seems to me that the closest measure that you'll ever get of pipeline engie is if you take the contents of a raw file and dump it ito another raw file. Just a thought. Also, did you use the SQL Server Destination or the OEL DB Destination? And did you use an OLE DB or ADO.Net Connection Manager? Why were the source files not local to the machine? Thanks Jamie

  • Anonymous
    March 03, 2008
    If our goal was "testing throughput of the SSIS pipeline engine" then you are right we would have done this differently.  In fact, we know that SSIS had capacity to spare in terms of raw throughput.  However, our goal was to model a more realistic environment.  We felt that in the real world, if you had a terabyte of data coming in, it might well come from multiple source systems.  It would involve a variety of data types (not just the simple ones) and would have to be moved across the network. Certainly the idea of reading and writing text files and even raw files did occur to us.  If we did that, would it tell a customer anything useful?  I suggest that it would not, because that is rarely the problem that customers are trying to solve.  A file copy would probably beat SSIS at that, anyway.  I beleive that a good ETL benchmark would model a rich environment including many of the things customers really do, such as surrogate key lookups, error checking, and type conversions.  The SSIS engine may not be the gating factor in such a scenario, but it's what customers need done.  In the long run, an industry benchmark addressing these things should be a goal.

  • Len
  • Anonymous
    March 06, 2008
    Le 27 février dernier, Microsoft et Unisys annonçait SQL Server 2008 comme record mondial des ETL pour

  • Anonymous
    August 10, 2008
    In case anyone missed the acquisition: Microsoft acquired DATAllegro back in July. Press is here http://www.microsoft.com/Presspass/press/2008/jul08/07-24DataWarehousingPR.mspx

  • Anonymous
    September 18, 2008
    Under Windows Server 2000 and 2003 (RTM), the interrupts from a single network adaptor (NIC) cannot be

  • Anonymous
    October 06, 2008
    Microsoft BI Conference 2008: Day Two thrice

  • Anonymous
    October 07, 2008
    Včera byla v Seattlu zahájena Microsoftí BI Conference 2008 , jedna ze dvou velkých akcí,

  • Anonymous
    October 21, 2008
    I would like to introduce briefly how SSIS 2008 have achieved great performance on loading TB data into

  • Anonymous
    November 20, 2008
    Questa mattina la keynote è stata fatta dall'amico Rushabh Meta che è l'Executive Vice

  • Anonymous
    December 05, 2008
    The SQL CAT team published a Top 10 SQL Server Integration Services Best Practices list back in October,

  • Anonymous
    February 12, 2009
    By popular request, the SQL Customer Advisory Team has collected our lessons learned about tuning data

  • Anonymous
    February 12, 2009
    By popular request, the SQL Customer Advisory Team has collected our lessons learned about tuning data

  • Anonymous
    February 13, 2009
    I previously posted a note about the SQL Server Enterprise Architecture Summit (SEAS) event held in Durban

  • Anonymous
    February 24, 2009
    Author: Carl Rabeler Reviewers: Mark Souza, Prem Mehra, Lindsey Allen, Mike Weiner, James Podgorski,

  • Anonymous
    February 24, 2009
    Author: Carl Rabeler Reviewers: Mark Souza, Prem Mehra, Lindsey Allen, Mike Weiner, James Podgorski,

  • Anonymous
    March 06, 2009
    在去年(2008年)2月的SQL Server 2008启动日的时候,我们宣布了一项纪录:我们使用SSIS在30分钟的时间内加载了1TB的数据!在 SQL Server Performance Blog

  • Anonymous
    July 22, 2010
    Do we have any record for SSIS 2008 R2?

  • Anonymous
    June 17, 2013
    This is absolutely wonderful to know that my fav. technology is the best of its time... Awesome job Microsoft! Keep up the good work... I'm waiting for the result of such a test using SSIS 2012...