SQL Server Health Check, Physical disk I/O operations, BizTalk Server design and settings

BizTalk Server performance relies on several hardware (disks, cpu(s)...), software (SQL Server, BizTalk Host Instance…), network components and the application design by itself.
Each of these above component involved in the architecture may impact heavily the BizTalk server performance.
The basic recommendation of an architecture that will ensure the high availability is to have at least:
_ a group of 2 BizTalk servers within a group
_ a cluster SQL server instance hosting the BizTalk server databases
_ a Master secret server running as a generic resource on the SQL server instance
In the next pages, I'll be reviewing each component and I'll mention what to check in order to make sure if the component is really the bottleneck in the chain of the components that forms the architecture.

Perform some health check operations on the SQL server instance
Sometimes the CPU spikes to 100 percent due to several reasons: CPUs, SQL requests, large messages,… and the BizTalk messages processing performance starts to show a slowness over the time. Other times, users report some problems related to the memory pressure and I/O issues.
From a troubleshooting perspective, please consider to walk through the steps below:

  1. Are the "statistics" disabled for the BizTalkMsgBoxDb database?
    Statistics should be disabled on the BizTalkMsgBoxDB database. By default, the statistics are disabled for the BizTalkMsgBoxDb database during the BizTalk server configuration.
    The MessageBox_UpdateStats_BizTalkMsgBoxDb BizTalk job will update the statistics for the BizTalkMsgBoxDb database. This job is using the "sp_updatestats" stored procedure to
    update statistics. The statistics are database related. So, the statistics are updated behind the scene by the MessageBox_UpdateStats_BizTalkMsgBoxDb BizTalk job.
    As such this job should be enabled.
    The statistics can be enabled or disabled by using the "sp_autostats" stored procedure.

  2. Does the "max degree of parallelism" option is correctly set?
    Run " sp_configure " stored procedurefrom the management studio MMCquery and make sure the "run_value" column of the "max degree of parallelism" option is set to 1, like:
    max degree of parallelism 0 64 1 1
    By default, this setting is set to 1 during the BizTalk installation/configuration and should not be modified.

  3. Does the AWE option is enabled?
    If AWE is enabled (can be checked out via sp_configure SP), the memory allocated through theAWE mechanism is not reported by "Task Manager" or in the Process: Private Bytes performance counter.
    So, you need to use SQL Server specific counters or Dynamic Management Views to obtain this information.
    Address Windowing Extensions (AWE) is an API that allows a 32-bit application to manipulate physical memory beyond the inherent 32-bit address limit.
    AWE mechanism technically is not necessary on 64-bit platform. It is, however, present there. Memory pages that are allocated through the AWE mechanism are referred as locked pages on the 64-bit platform.
    To find out how much memory SQL Server has allocated through the AWE mechanism, you can run the following query:
    select sum(awe_allocated_kb)/1024 as 'AWE_allocated_mem_MB' from sys.dm_os_memory_clerks

    To get information about memory objects that are currently allocated by SQL server use the sys.dm_os_memory_objects DMV.
    This DMV is very useful to analyze the memory usage and to identify possible memory leaks.

    The query below can be used to identify the amount of memory in Kilo Bytes (KB) used by each memory object type:
    select type, SUM(pages_allocated_count * page_size_in_bytes) /1024 as 'KB_Used' from sys.dm_os_memory_objects
    group by type
    order by KB_Used DESC

  4. Should I need to configure SQL Server to use more threads – sp_configure 'max server threads'?
    This is really a situation where you should consider if there is a need to increase the SQL worker threads or not. Before taking the decision, run the query
    select AVG (work_queue_count) from sys.dm_os_schedulers where
    status = 'VISIBLE ONLINE'

    If the average result value is above 1 then you certainly may benefit from adding more threads to the system but you have to respect these 2 conditions:

    1. The current load is not CPU bounded ( see point 6 below )
    2. The current load doesn't experience any other heavy waits
  5. Does the SQL Server instance experience I/O problems?
    The answer to this question is to monitor the length of I/O queues by running:
    select pending_disk_io_count from sys.dm_os_schedulers

    If over time the numbers returned are keeping growing or you are experiencing periodic jumps/spike or the numbers stay relatively high most likely the system is I/O bound.
    In order to identify the cause you will have to dive further and mainly to monitor the physical disk (see below or next post on physical disk).

  6. Is the work load really CPU(s) bound?
    In some cases, you are wondering if the system CPU(s) can really handle the entire load without impacting heavily the global performance of your design. To answer to this question, you will have to find out average length of a "runnable" queue on the system (SQL Server instance)
    Run the query:
    select AVG (runnable_tasks_count) from sys.dm_os_schedulers where status = 'VISIBLE ONLINE'

    If a number of "runnable" tasks per each scheduler are always greater than 1 and all of your queries have a correct plan then you will need more CPUs (or performing ones).
    As mentioned above, before suspecting the CPUs system, be sure that the SQL optimizer is not generating a bad query plan. Having the statistics out of date or you are trying to perform a "plumber" optimization will lead the optimizer to generate a bad query plan.

  7. What's the size of each database within the SQL Server instance including the system databases?
    Running the SQL query below will report the data and log file size for each database
    execute master.sys.sp_MSforeachdb 'USE [?]; EXEC sp_helpfile'
    If you need further information about the data details like: reserved, index, unused …
    Run the below query:
    exec sp_msforeachdb N'use [?]; exec sp_spaceused @updateusage = N''TRUE'';'

    BizTalk server functioning is like replication: Publish and consume
    Customer applications publish messages into the BizTalkMsgBox database and the subscribers (Orchestrations and send ports) consume messages from this database.
    Once the messages processing is completed, the BizTalk server jobs will purge the messages and the tracking sub-service named: Tracking Data Decoding Service (TDDS) will move the tracking data to the BizTalkDTADb database. TDDS sub-service is just a set of threads running into the BTSNTSvc.Exe process (Which is commonly named: BizTalk Host Instance)
    So, BizTalkMsgBoxDb database is an intermediary location hence its size should remain a small one. Once the messages processing is completed, the database tables should be cleaned up. Of course, the log file of the database will be truncated only once there is a backup up operations due to FULL recovery model setting. Make sure the Backup BizTalk Server (BizTalkMgmtDb) job is not disabled and it works fine. If the size of these databases are very large one (several GBytes) the performance of the BizTalk messages processing will be impacted.
    If the BizTalkDTADb is very large please consider to run the purge and archive as mentioned in the BizTalk server online documentation. Search for topic: "How to Purge Data from the BizTalk Tracking Database"

  8. What's the size of each table of any database within the SQL Server instance?
    Run the query below and it will return the size of each table. Do prefer the result in a "text output" mode.
    exec sp_msforeachtable 'exec sp_spaceused[?]'
    The query will report the details of: rows, reserved, data, index_size, unused
    As mentioned above, normally all the BizTalkMsgBoxDB tables should have small sizes otherwise there is something that does not work as expected like: BizTalk jobs are not up and running, the tracking is not enabled or it's not working fine, partner are not reachable …
    Based on what tables are very large, it will help to identify the root cause of the problem.

  9. TempDB system database tunning?
    In SQL 2005 edition, TEMPDB system database plays an important role as it uses temporary workspace for operations involving temporary tables, rows versioning, table variables….
    In a nutshell, TEMPDB is used by SQL Server 2005 to store intermediate result sets.
    Of course the end user is not explicitly using these operations; however the SQL server instance is implicitly using the tempdb for operations like online index operations, row versioning, DML….
    It is difficult to estimate the tempdb space requirement for an application. All methods are not precise. It requires experience and experiment to gain a satisfactory result.
    Basic recommendation should lead to always have a safety factor of about 20% more space. Therefore it's important to optimize the performance of the tempdb system database.
    You can optimize the performance of the tempdb database by combining:
    _ Pre-allocate adequate space (not using the auto grow default settings)
    _ Separate the log file like any other database
    _ Use multiple data files (one per processor or processor-core (but not hyperthreads) , but don't
    exceed 4 files in a total).SQL Server uses one thread per tempdb database file to perform
    concurrent I/O operations.
    _ Use faster disk if your database solution uses heavily tempdb database.
    _ Use appropriate Disk raid solution
    _ Use the local disk system if the tempdb database is heavily utilized.
    Use the following general guidelines when setting the FILEGROWTH increment for tempdb:
    Tempdb file size
                    File growth increment for tempdb
    0 to 100 MB 10 MB
    100 to 200 MB 20 MB
    500 or more 10%
    Useful links for more details:
    https://msdn.microsoft.com/en-us/library/bb743398.aspx
    https://social.msdn.microsoft.com/Search/en-US/?query=WorkingWithTempDB&ac=8

  10. Do I need indexes rebuilt or re-organized?
    Over the time, indexes need to be rebuilt or re-organized otherwise the messages processing performance will be impacted.
    To be sure about the necessary operation that should be performed on the indexes, run the SQL query:
    select * from sys.dm_db_index_physical_stats(NULL, NULL, NULL, NULL, NULL)

    Have a look at the "avg_fragmentation_in_percent" column value and proceed as follow:
    avg_fragmentation_in_percent value Action
    > 5% and < = 30% ALTER INDEX REORGANIZE
    > 30% ALTER INDEX REBUILD WITH (ONLINE = ON)

  11. How many active SQL requests currently running and how long they take to complete?
    To identify all active SQL requests currently consuming CPU, DiskIO, Blocked by another SPID, etc.., please run:
    Sp_who2 'active'
    For each active SPID, this request will report mainly: CPU time, BlkBy, DiskIO, name of command
    that corresponds to the SPID.
    If you want to know the query executed by any SPID returned by the sp_who2 query, proceed as follow:
    dbcc inputbuffer(spid)

    You may also want to use the sys_dm_exec_requests DMV.
    This DMV returns information about each request that is currently executing within the SQL server instance.
    One row is returned for every executing query. The DMV is very useful to determine the operation of queries that take a long time to complete.
    The following query returns details about all the user queries currently executing on the instance of the SQL Server.
    select session_id, command, total_elapsed_time, status, reads,writes, start_time, sql_handle from sys.dm_exec_requests
    where session_id > 50
    order by total_elapsed_time DESC

    To determine the text of the corresponding SQL statement returned by sys_dm_exec_requests DMV, use the sys.dm_exec_sql_text DMV that takes the sql_handle as input parameter
    select text from sys.dm_exec_sql_text(0x03000C00A090EB542946E500449900000100000000000000)

  12. Is there a massive dead locks encountered?
    By enabling the switch 1222 (or 1204 if it's SQL 2000 edition), the SQL Server instance will log in the errorlog file all the dead locks it encounters during the SQL requests execution.
    To enable the trace, run:
    Dbcc traceon (1222,-1)
    Facing some dead locks it quite normal behavior. However if there is a massive dead locks then you should dig further.

  13. What's the result of the top queries CPU, IO…?
    Sometimes, it's recommended to start to look at this request before any further investigations. This step will help for you to better isolate the problem.
    Open SQL Server Management Studio MMC and connect to the SQL Server instance
    1. Right click on the SQL Server Instance and select properties
    2. Select Reports
    3 . Select Standard Reports and you will find Server Dashboard, Performance…. Topics

  14. Collect data with SQL server Best Practices Analyzer (BPA) tool
    The SQL Server 2005 Best Practices Analyzer (BPA) gathers data from Microsoft Windows and SQL Server configuration settings.
    This tool can be downloaded from Microsoft download location.
    BPA uses a predefined list of SQL Server 2005 recommendations and best practices to determine if there are potential issues in the database environment.

  15. Collect SQL profiler trace and include execution plan
    ProgramèStart è Run è Profiler90 (if it's SQL 2000 then type "profiler") On the File menu, select New Trace and connect to the server instance that you want to monitor.
    On the Trace Properties screen, select the Events Selection tab to select the events that you want to monitor. In the lower right portion of the screen, select the check box for Show All Events. This will expand all of the available events to monitor in the trace.
    Navigate to the Performance section and select the Showplan XML event. This event will capture statement execution plans in an XML format, which you can view graphically in SQL Profiler or SQL Server Management Studio
    ==>Analyze SQL profiler trace with readtrace tool
    The SQL profiler trace can be quickly analyzed by the readtrace tool by showing the SQL requests consuming ressources.
    The tool can be downloaded from the https://www.microsoft.com/downloads/details.aspx?familyid=7EDFA95A-A32F-440F-A3A8-
    5160C8DBE926&displaylang=en
    link

 

Physical disk I/O operations

Under a heavy work load and if you have a slow I/O subsystem, the BizTalk messages processing will start to show a slowness and users will start to experience peformance problems such as slow response times, tasks that abort due to timeouts,...
To identify I/O bottlenecks, you will need to look into multiple performance counters to cross check the outcome of your findings.
Don't rely ONLY on one performance counter or on a short period of time of analysis.
The following performance counters will help you to determine if you are experiencing I/O bottlenecks or very slow hard disk:

PhysicalDisk Object: Avg. Disk Queue Length
This performance counter represents the average number of physical read and write requests that were queued on the selected physical disk.
If your I/O system is overloaded, more read/write operations will be waiting.and queued.
If your disk queue length frequently exceeds a value of 2 during peak usage, then you might have an I/O bottleneck .

Avg. Disk Sec/Read
This performance counter represents the average time, in seconds, of a read of data from the disk.

Less than 10 ms - very good
Between 10 - 20 ms – Ok
Between 20 - 50 ms - slow, needs attention
Greater than 50 ms – Serious I/O bottleneck

Avg. Disk Sec/Write
This performance counter is the average time, in seconds, of a write of data to the disk.
For value and numbers, Please refer to the guideline in the previous bullet.

Physical Disk: %Disk Time
This performance counter is the percentage of elapsed time that the selected disk drive was busy servicing read or writes requests.
A general guideline is that if this value is greater than 50 percent, it represents an I/O bottleneck.

Avg. Disk Reads/Sec
This counter is the rate of read operations on the disk. You need to make sure that this number is less than 85 percent of the disk capacity.
The disk access time increases exponentially beyond 85 percent capacity.

Avg. Disk Writes/Sec
This counetr is the rate of write operations on the disk. Make sure that this number is less than 85 percent of the disk capacity.
The disk access time increases exponentially beyond 85 percent capacity.

Of course, when using the above counters, you need to adjust the values for RAID configurations.
The following formulas.should be respected:

Let’s say, you have a RAID-1 system with two physical disks and you performance counters show the following values:

Disk Reads/sec 80
Disk Writes/sec 70
Avg. Disk Queue Length 5

In that case, you are encountering (80 + (2 * 70))/2 = 110 I/Os per disk and your disk queue length = 5/2 = 2.5
which indicates a border line I/O bottleneck

If the physical disks ar dediacted to the SQL Server databases then you may want to identify the I/O bottlenecks by examining the latch waits.

These latch waits account for the physical I/O waits when a page is accessed for reading or writing and the page is not available in the buffer pool. When the page is not found in the buffer pool, an asynchronous I/O is posted and then the status of the I/O is checked. If I/O has already completed, the worker proceeds normally. Otherwise, it waits on PAGEIOLATCH_EX or PAGEIOLATCH_SH, depending upon the type of request.
The following DMV query can be used to find I/O latch wait statistics.

Select wait_type, waiting_tasks_count, wait_time_ms

from sys.dm_os_wait_stats

where wait_type like 'PAGEIOLATCH%'
order by wait_type

How to resolve the I/O bottleneck?
Once the I/O bottleneck is detected, you can take one or more of the following actions:
1. Check the SQL Server memory configuration if it’s not configured with less memory
Examine the following performance counters to determine if there is a memory pressure or not
Buffer Cache hit ratio
Page Life Expectancy

Checkpoint pages/sec
Lazywrites/sec
2.

Increase I/O bandwidth
_ Add more physical drives to the current disk arrays and/or replace your current disks with faster drives
_ Add faster or additional I/O controllers
_ Examine execution plans and see which plans lead to have more I/O being consumed
The following DMV query can be used to find which batches/requests are generating the most I/O
select top 5
(total_logical_reads/execution_count) as avg_logical_reads,
(total_logical_writes/execution_count) as avg_logical_writes,
(total_physical_reads/execution_count) as avg_phys_reads,
Execution_count,
statement_start_offset as stmt_start_offset,
sql_handle,
plan_handle
from sys.dm_exec_query_stats
order by
(total_logical_reads + total_logical_writes) Desc

Comments