Accelerating Oracle -> SQL Server Migrations with Fusion-io ioMemory
We have recently completed several migrations at customer sites using Fusion-io ioMemory Devices to greatly accelerate throughput and reduce downtime. A correctly designed and configured infrastructure and migration strategy can achieve throughput of between 1TB to 2TB per hour. In our customer tests we achieved several cases of 6TB Oracle systems exported & imported to SQL Server in 7 hours.
What are Fusion-io devices & are they supported with SQL Server?
Fusion-io devices are ultra-high performance Solid-State Devices. Unlike conventional mechanical disks these devices have no moving parts. Fusion-io devices plug directly into the PCIe bus in a server and therefore do not pass through the SAN fabric components such as HBAs, FC switches and frontend controllers on the SAN. A single Fusion-io device IOPS capability is measured in hundreds of thousands of IOPS – the largest/fastest Fusion-io device has more than 1,200,000 IOPS. One standard mechanical disk has between 150-200 IOPS.
Fusion-io SSD disks and selected other SSD brands are fully supported for use with SQL Server. The SQL Server Development Lab has extensively tested and validated Fusion-io devices.
Why are Fusion-io devices needed to speed up OS/DB Migrations? Isn’t SAN disk fast enough?
Over the last 5-6 years the capabilities of Intel hardware, the Windows OS and SQL Server have evolved exponentially. SAP systems larger than 10TB to 15TB were exclusively found on UNIX platform 7-10+ years ago. Modern Intel Westmere EX 4 processor servers benchmark over 75,000 SAPS and close to 14,000 users. 8 processor configurations benchmark more than 130,000 SAPS and 24,000 users. Today there are many SAP customers with DB sizes larger than 10TB. We have a large number of UNIX/Oracle customers that wish to move onto cheaper lower cost commodity hardware.
In response to this requirement Microsoft released traceflag 610 for SQL Server to “shutdown” almost all transaction logging during OS/DB Migrations. Traceflag 610 is documented in Note 1241751 - SQL Server minimal logging extensions. Traceflag 610 is discussed in section 9 of my OS/DB Migration FAQ document and is generally recommended for all migrations to SQL Server. This feature dramatically reduces both the size of the transaction log and the IO throughput to the transaction log file, thereby speeding up the migration process. Traceflag 610 works when the database is in BULK_LOGGED and SIMPLE mode.
There is one case where Traceflag 610 is ineffective and SQL Server will switch to standard SIMPLE logging. Very large 10-15TB Oracle systems are almost always exported using Oracle ROW ID table splitting and with an unsorted export. The SAP on Oracle schema unlike SAP on SQL Server schema does not store data in a clustered (sorted) index. Very large SAP systems typically take a very long time to export if the export is configured to sort the data. Section 2 of my OS/DB Migration FAQ document has more details.
Traceflag 610 is only effective if the import source is already sorted, therefore we have found very high LOGWRITE and LOGBUFFER waits. The reason for high LOGWRITE and LOGBUFFER wait times is that SQL Server is sorting all the data in the clustered index to ensure all rows are in an ascending sequence as per the primary index. As new rows get inserted with R3LOAD, there is a high chance that considerable data movement is occurring as rows are shuffled to remain in ascending sequence. Eventually with large OS/DB migrations this leads to very high utilization on the transaction log which increases LOGWRITE times and a secondary effect of high LOGBUFFER times.
The good news is that Fusion-io and other solid state devices completely mitigate the overhead of the additional logging due to an unsorted import source.
How to detect excessive transaction logging?
SQL Server Wait Statistics can be used to analyse if transaction logging or other factors are impacting performance.
To analyse performance fully review this document run this command. This will purge the current Wait Statistics. It is usual for the LOGWRITE performance to be somewhat lower during a OS/DB migration because the Log Write size will be around 60kb rather than the usual 8kb
DBCC SQLPERF ( "sys.dm_os_wait_stats" , CLEAR )
Note: Do not run this command on a running SAP system as this command will cause serious disruption to the SAP SQL Server performance collector and related statistics. This command should be used only while optimizing a R3LOAD import
Now run the OS/DB migration import. When the import finishes immediately run this command:
SELECT * FROM sys.dm_os_wait_stats
Copy and paste the Wait Statistics from SQL Management Studio into Excel. It is useful to add the columns “Wait/Requests” and to express the Wait Time as a percentage of total Wait Time.
The below example displays 60% of Wait Time was around transaction logging. If WRITELOG and LOGBUFFER are more than 5-10% in total it is likely the performance of the transaction log is a limiting factor.
Wait Type |
Requests |
Wait Time |
Signal Wait Time |
Wait/Requests |
Time % |
LOGBUFFER |
107572144 |
480955987 |
573028200 |
4.47 |
38.63% |
WRITELOG |
54366633 |
288127653 |
146760600 |
5.30 |
23.14% |
ASYNC_NETWORK_IO |
130867116 |
110600897 |
336770500 |
0.85 |
8.88% |
LCK_M_IX |
6506 |
103455497 |
5509479 |
15901.55 |
8.31% |
SOS_SCHEDULER_YIELD |
27292022 |
76591533 |
283617 |
2.81 |
6.15% |
PAGELATCH_SH |
461479 |
40322038 |
501463 |
87.38 |
3.24% |
LAZYWRITER_SLEEP |
1634918 |
40099749 |
459286 |
24.53 |
3.22% |
LATCH_EX |
13971 |
36515448 |
1244014 |
2613.66 |
2.93% |
PAGEIOLATCH_EX |
2696925 |
7986746 |
67403 |
2.96 |
0.0064 |
LATCH_SH |
19696 |
6667832 |
1076536 |
338.54 |
0.54% |
The example below shows an OS/DB Migration Import and indicates that Network was the largest bottleneck on this system consuming 38.92% of all time. WRITELOG and LOGBUFFER are performing well in this case due to the addition of a Fusion-io card.
Wait Type |
Requests |
Wait Time |
Signal Wait Time |
Wait/Requests |
Time % |
Total |
699504300 |
1753331000 |
573028200 |
2.51 |
100.00% |
ASYNC_NETWORK_IO |
418040400 |
682462100 |
146760600 |
1.63 |
38.92% |
SOS_SCHEDULER_YIELD |
175555400 |
336949100 |
336770500 |
1.92 |
19.22% |
LAZYWRITER_SLEEP |
16000280 |
293505800 |
5509479 |
18.34 |
16.74% |
CXPACKET |
16429890 |
101365500 |
283617 |
6.17 |
5.78% |
IO_COMPLETION |
12262320 |
43917180 |
501463 |
3.58 |
2.50% |
PAGELATCH_UP |
463536 |
6751748 |
459286 |
14.57 |
0.39% |
WRITELOG |
1800964 |
6746422 |
1244014 |
3.75 |
0.38% |
LOGBUFFER |
116672 |
133657 |
67403 |
1.15 |
0.01% |
SLEEP_BPOOL_FLUSH |
6078143 |
5279898 |
1076536 |
0.87 |
0.30% |
PAGEIOLATCH_SH |
20012 |
213189 |
3905 |
10.65 |
0.01% |
PAGEIOLATCH_UP |
8832 |
195787 |
6573 |
22.17 |
0.01% |
MSQL_XP |
10077 |
195254 |
0 |
19.38 |
0.01% |
PREEMPTIVE_OS_GET |
10077 |
195027 |
0 |
19.35 |
0.01% |
A common scenario while importing during SAP migrations is that the waits on LOGBUFFER wait can become high whereas WRITELOG waits still look fine with less than 5ms latency. The reason for this is due to SQL Server limiting the total number of outstanding Log Write IOs. For details on the limits see this article from SQL CAT Team.
This results in worker threads of SQL Server possibility waiting to write into the log buffers. Since all log buffers might be outstanding, there is a short wait time for a log buffers to be available. Hence speeding up I/O latency against the transaction log volume will reduce the chance of SQL Server hitting the limits of outstanding I/O or log volumes. With that the wait on a free log buffer would be reduced.
Some common Wait conditions during R3LOAD imports and their possible solutions are listed below:
WRITELOG/LOGBUFFER : Put Transaction Log on Fusion-io disk
ASYNC_NETWORK_IO : Use 10G network cards with RSS
CXPACKET : Do not set MAXDOP to 0, rather use a value such as 4
SOS_SCHEDULER_YIELD : Use powerful new Intel/AMD based servers
The advantage of Fusion-io cards in reducing I/O latency is twofold. First is that these devices reduce IO latency by factors of hundreds or thousands compared to mechanical disks. Second advantage is the fact that the cards are connected directly with the PCIe slot without any further controller card or switches to the storage. Therefore the I/O latency is reduced dramatically because no IO travels through the SAN fabric. The entire bandwidth of the SAN fabric & cache is available for writing to the SQL Datafiles.
Windows Performance Monitor Disk Counters will show between 1-5ms per IO READ or WRITE (0.001-0.005 seconds) for mechanical disks on midrange SAN. Fusion-io devices with the same workload can show 0.01-0.001ms per IO READ or WRITE (0.00001-0.000001 seconds).
How to deploy Fusion-io devices?
Fusion-io and other solid-state devices require addition configuration and installation steps to achieve maximum throughput. The exact deployment guidelines are slightly different between different solid-state vendors, therefore this blog will not go into details of drive formatting and configuration. We will release a subsequent blog and whitepaper on the detailed configuration of Fusion-io disks (and other solid-state vendors) specifically for optimizing the OS/DB Migration effort.
Thanks to Amit Jethva
Contributors: Sumeet, Mathew & Joshua from Fusion-io
Comments
- Anonymous
September 12, 2012
Any more input on Fusion-IO and SQL? Drive set up, etc? Thanks, Tom Beauregard Lubrizol Corp tabe@lubrizol.com