Getting 31552 ->Exception 'SqlException': Timeout expired very frequently in SCOM server
In the SCOM server we might start Getting 31552 ->Exception 'SqlException': Timeout expired very frequently and We will also see that the SQL server hosting the database role of the SCOM server going high on utilization of resources and remaining for a long time.
Event Type: Error
Event Source: Health Service Modules
Event Category: Data Warehouse
Event ID: 31552
Date: 7/20/2009
Time: 1:33:04 AM
User: N/A
Computer: <RMS SERVER NAME>
Description:
Failed to store data in the Data Warehouse.
Exception 'SqlException': Timeout expired. The timeout period elapsed prior to completion of the operation or the server is not responding.
One or more workflows were affected by this.
Workflow name: Microsoft.SystemCenter.DataWarehouse.StandardDataSetMaintenance
Instance name: Client Monitoring data set
Instance ID: {ED9228A4-9CAD-5782-CA20-7067D119EAFA}
Management group: <Management GROUPNAME>
For more information, see Help and Support Center at <https://go.microsoft.com/fwlink/events.asp>.
This can happen if the alert are not processed and is on the staging table. In order to overcome this issue we can do the following.
IMP Note: As there are database changes make sure that you have a valid database backup before you go ahead and perform these actions.
Run the following quries
select count(*) from Alert.AlertStage
select count (*) from Event.eventstage
select count (*) from Perf.PerformanceStage
select count (*) from state.statestage
The count on these tables will have count increasing and decresing. IF you find that any particular table is having huge count in lakhs like the one I had for the alert stage table. Follow the following.
In my case I had around 16 lakh + rows in alert stage table which will never get processed. So what I did is backup of the table and cleared the backlogs. For the same used the following
SELECT count(*) from ALert.AlertStage
SELECT * INTO ALert.AlertStage_backup FROM ALert.AlertStage
SELECT count(*) from ALert.AlertStage_backup
TRUNCATE TABLE ALert.AlertStage
SELECT count(*) from ALert.AlertStage
Then to make sure that the alert data is processing we ran the query
exec StandardDatasetMaintenance @DatasetId=' 6AF799BD-2CCC-41CF-97FD-058E2CBF9248'. This GUID was available for the alert data when we ran the query select * from StandardDataset
Query Result.
============
6AF799BD-2CCC-41CF-97FD-058E2CBF9248 Alert 0 0 1 AlertProcessStaging 01:00 240 2009-07-20 11:17:19.690
6F9FBA1B-EAE3-4071-AFA9-2101A81AA463 Event 0 0 1 EventProcessStaging 01:00 240 2009-07-20 11:17:16.033
06268CB1-F9C4-4195-AA67-7BB2EFD5224D State 0 48 1 StateProcessStaging 01:00 240 2009-07-20 11:16:29.797
7EAE1679-17F5-4BA6-B685-9867A88C00E1 CM 0 28 1 AemProcessStaging 00:00 240 2009-07-20 11:17:25.017
C7E7D06C-6D00-46F2-819D-CB7D206EC361 Perf 0 48 1 PerformanceProcessStaging 01:00 240 2009-07-20 11:17:25.203
If we have any error when we run exec StandardDatasetMaintenance @DatasetId=' 6AF799BD-2CCC-41CF-97FD-058E2CBF9248'. Then need to troubleshoot on that error which the results in the query windows displays. Else we can try running the count query (SELECT count(*) from ALert.AlertStage) and make sure that the data is coming in and going out as well.
While running the exec StandardDatasetMaintenance @DatasetId=' 6AF799BD-2CCC-41CF-97FD-058E2CBF9248', check if we are getting errors . If we have are getting error
“Sql execution failed. Error 8115, Level 16, State 1, Procedure -, Line 1, Message: Arithmetic overflow error converting IDENTITY to data type int.”
Then follow these steps of backing up the alert staging table
Modify the IDENTITY values for the Alert_GUID table by the running the following query, since it was well beyond 800 million
For correcting this
Select max(AlertRowId) from Alert_GUID
This will give you a value . Please note this value.
DBCC CHECKIDENT(‘Alert.Alert_GUID’,RESEED,<maximum value of the result in the previous query + 1>)
Again try to run the stored procedure and ensure its success.
Once the same is done you can see that the CPU utilization will also come down as well as the frequency of 31552 errors.Now if you want to put back this data you can do it bit by bit using the following quires.
WHILE (SELECT COUNT(*) FROM ALert.AlertStage_backup) > 0
BEGIN
INSERT TOP(1000) INTO ALert.AlertStage
([AlertGuid],[AlertProblemGuid],[ManagedEntityRowId]
,[AlertName],[AlertDescription],[Severity],[Priority]
,[Category],[WorkflowRowId],[MonitorAlertInd],[DateTime]
,[RaisedDateTime],[SiteName],[RepeatCount] ,[AlertStringGuid]
,[ParameterHash],[DBCreatedDateTime],[DWCreatedDateTime],[DWLastModifiedDateTime])
SELECT
[AlertGuid],[AlertProblemGuid],[ManagedEntityRowId]
,[AlertName],[AlertDescription],[Severity],[Priority]
,[Category],[WorkflowRowId],[MonitorAlertInd],[DateTime]
,[RaisedDateTime],[SiteName],[RepeatCount] ,[AlertStringGuid]
,[ParameterHash],[DBCreatedDateTime],[DWCreatedDateTime],[DWLastModifiedDateTime]
FROM ALert.AlertStage_backup
DELETE TOP(1000) FROM ALert.AlertStage_backup
This will take a huge time depending on the number of rows.Once processed please drop table FROM ALert.AlertStage_backup.
What this will do it will put 1000 alerts at a time to the backup table to the staging table to process. And the completion of this will take time depending on the number of backlogs.
If alert data is not so important for you can drop this table ALert.AlertStage_backup.
The same can be checked for other (event, performance) staging tables as well.
Comments
- Anonymous
July 22, 2014
Excellent KB. - Anonymous
August 12, 2015
I am not finding Alert_GUID table in Datawarehouse. Could you please let me know where I need to run this query.