Partager via


Unable to move my cluster group from node A to Node B and cluster.log analysis

 

I have a 2 node MSCS quorum based cluster and was unable to move my cluster group from node 17 to node16 manually from cluadmin. So let’s have a look and the very first thing one will do is look at cluster.log and event logs…we at Microsoft will grab a quick cluster  mps reports to see detailed information. So I quickly ran cluster mps on both nodes and that grabbed all the log files in cab file which I can look into now.

Link for cluster mps

https://download.microsoft.com/download/b/b/1/bb139fcb-4aac-4fe5-a579-30b0bd915706/MPSRPT_CLUSTER.EXE

now from my cluster mps reports I quickly outlined a few things that will come handy in my analysis

ffbc99dc-0432-4bc4-89bc-90c5899b99d1----------Cluster IP Address   {IP Address}

c6b92a6f-6190-4ec2-b34f-c9a834c8b8f7----------Disk Q:   {Physical Disk}

3fa17b2e-a365-4c5d-8fde-460c74deaaf6----------Cluster Name   {Network Name}

==========================================================================================

                                                      Cluster Disk Driver Parameters

==========================================================================================

                Available Disk Signatures

                -------------------------

                Current Used Disk Signatures

                ----------------------------

                1. E098B1A3

                2. E098B1A2

                3. C39BA6F5

                4. AF4763FD

                5. 09C073AC

                6. 09C073A8

clip_image002

Cluster.log Node 17

00000468.00000128::2009/03/18-12:19:10.651 INFO [FM] FmpDoMoveGroup: Entry—we moved the cluster group from node 17 to node 16 and whenever we do this we see FmpDoMoveGroup in the cluster.log so you can search for DoMove if you want to check in cluster.log when we moved the group and what happened after that

00000468.00000128::2009/03/18-12:19:10.651 INFO [FM] FmpMoveGroup: Entry

00000468.00000128::2009/03/18-12:19:10.651 INFO [FM] FmpPickNodeFromPreferredListAtRandom: Picking node for group 9d4fae4b-7dba-44f1-992a-0ecf1502e654 [Cluster Group1], suggested node 1...cluster group was on node 2 originally and here node 1 is being suggested as this is a 2 node cluster

00000468.00000128::2009/03/18-12:19:10.651 INFO [FM] FmpPickNodeFromPreferredListAtRandom: Node 1 for group 9d4fae4b-7dba-44f1-992a-0ecf1502e654 is user preferred...this guid belongs to cluster group and as you see ---00000c50.00000c30::2009/01/06-14:06:03.226 OBRENAME "Group" "9d4fae4b-7dba-44f1-992a-0ecf1502e654" "Cluster Group"—I got this information from cluster.oml file

00000468.00000128::2009/03/18-12:19:10.651 INFO [FM] FmpPickNodeFromPreferredListAtRandom: Selected node 1 for group 9d4fae4b-7dba-44f1-992a-0ecf1502e654...

00000468.00000128::2009/03/18-12:19:10.651 INFO [FM] FmpMoveGroup: Moving group 9d4fae4b-7dba-44f1-992a-0ecf1502e654 to node 1 (1)—we are moving cluster group to node 1

00000468.00000128::2009/03/18-12:19:10.651 INFO [FM] FmpOfflineResource: Cluster Name depends on Cluster IP Address. Shut down first.—as cluster IP is dependent on cluster name we need to bring cluster name offline before cluster ip

00000468.00000128::2009/03/18-12:19:10.651 INFO [FM] FmpOfflineResource: Offline resource <Cluster Name> returned pending

00000874.00000c04::2009/03/18-12:19:10.651 INFO [RM] RmpSetResourceStatus, Posting state 3 notification for resource <Cluster Name>

00000468.00000774::2009/03/18-12:19:10.651 INFO [FM] NotifyCallBackRoutine: enqueuing event

00000874.00000c04::2009/03/18-12:19:10.651 INFO Network Name <Cluster Name>: Resource is now offline

00000874.00000b80::2009/03/18-12:19:10.651 INFO IP Address <Cluster IP Address>: Taking resource offline...

00000874.00000b80::2009/03/18-12:19:10.651 INFO IP Address <Cluster IP Address>: Deleting IP interface 4.

00000874.00000b80::2009/03/18-12:19:10.651 INFO IP Address <Cluster IP Address>: Address 172.23.96.221 on adapter Intel(R) PRO/1000 CT Network Connection offline.

00000874.00000b80::2009/03/18-12:19:10.651 INFO IP Address <exchange IP Address>: All resources offline - cleaning up

00000874.00000b90::2009/03/18-12:19:10.651 ERR  IP Address <exchange IP Address>: WorkerThread: GetClusterNotify failed with status 6.

To check what this function GetClusterNotify do let’s have a look at msdn

https://msdn.microsoft.com/en-us/library/aa369623(VS.85).aspx

clip_image004

00000874.000004c4::2009/03/18-12:19:10.666 INFO Physical Disk <Disk Q:>: Offline, Dismounting volume \Device\Harddisk1\Partition1.—now we are dismounting the quorum

00000874.00000828::2009/03/18-12:19:10.666 INFO Physical Disk: [PnP] Event GUID_IO_VOLUME_DISMOUNT for Q (Partition1) - Received

00000874.00000828::2009/03/18-12:19:10.666 INFO Physical Disk: [PnP] Event GUID_IO_VOLUME_DISMOUNT for Q (Partition1) - Processed

00000874.000004c4::2009/03/18-12:19:10.666 INFO Physical Disk <Disk Q:>: Offline, Dismount complete, volume \Device\Harddisk1\Partition1.

00000874.000004c4::2009/03/18-12:19:10.666 INFO Physical Disk <Disk Q:>: DiskCleanup started.

00000874.000004c4::2009/03/18-12:19:10.666 INFO Physical Disk <Disk Q:>: [DiskArb] StopPersistentReservations is called---here cluster node 2 release arbitration on quorum so that other node can reserve it

Persistent Reserve refers to a set of Small Computer Systems Interface-3 (SCSI-3) standard commands and command options which provide SCSI initiators with the ability to establish, preempt, query, and reset a reservation policy with a specified target device. The functionality provided by the Persistent Reserve commands is a superset of the reserve/release commands.

00000874.000004c4::2009/03/18-12:19:10.666 INFO Physical Disk <Disk Q:>: [DiskArb] Stopping reservation thread.

00000874.00000a4c::2009/03/18-12:19:10.666 INFO Physical Disk <Disk Q:>: [DiskArb] CompletionRoutine, status 0.

00000874.000004c4::2009/03/18-12:19:10.666 INFO Physical Disk <Disk Q:>: [ArbCleanup] Verifying sector size.

00000874.000004c4::2009/03/18-12:19:10.666 INFO Physical Disk <Disk Q:>: [ArbCleanup] Reading arbitration block.

00000874.000004c4::2009/03/18-12:19:10.666 INFO Physical Disk <Disk Q:>: [DiskArb] Successful read  (sector 12)

00000874.000004c4::2009/03/18-12:19:10.666 INFO Physical Disk <Disk Q:>: [ArbCleanup] Writing arbitration block.

00000874.000004c4::2009/03/18-12:19:10.666 INFO Physical Disk <Disk Q:>: [DiskArb] Successful write (sector 12) [:0] (0,00000000:00000000).

00000874.000004c4::2009/03/18-12:19:10.666 INFO Physical Disk <Disk Q:>: [ArbCleanup] Returning status 0.

00000874.000004c4::2009/03/18-12:19:10.666 INFO Physical Disk <Disk Q:>: [DiskArb] StopPersistentReservations is complete.

00000874.000004c4::2009/03/18-12:19:10.666 INFO Physical Disk <Disk Q:>: DisksDismountDrives: letter mask is 00010000.

00000874.000004c4::2009/03/18-12:19:10.666 INFO Physical Disk <Disk Q:>: DiskCleanup returning final error 0

00000874.00000828::2009/03/18-12:19:10.666 INFO Physical Disk: [PnP] Event GUID_IO_VOLUME_UNLOCK for Q (Partition1) - Received

00000874.00000828::2009/03/18-12:19:10.666 INFO Physical Disk: [PnP] Event GUID_IO_VOLUME_UNLOCK for Q (Partition1) - Processed

00000874.000004c4::2009/03/18-12:19:10.666 INFO Physical Disk <Disk Q:>: Offline, Returning final error 0.—error 0 means successfull

00000874.00000828::2009/03/18-12:19:10.666 INFO Physical Disk <Disk Q:>: [PnP] Stop watching PnP events for disk 9c073a8—this is the disk signature of quorum drive

00000874.00000828::2009/03/18-12:19:10.666 WARN Physical Disk <Disk Q:>: [PnP] RemoveDisk: WatchedList is empty

00000874.00000828::2009/03/18-12:19:10.666 INFO Physical Disk <Disk Q:>: [PnP] Stop watching disk 9c073a8 - processed

00000874.00000130::2009/03/18-12:19:10.666 INFO Physical Disk <Disk Q:>: DiskCleanup started.

00000874.00000130::2009/03/18-12:19:10.666 INFO Physical Disk <Disk Q:>: [DiskArb] StopPersistentReservations is called.

00000874.00000130::2009/03/18-12:19:10.666 INFO Physical Disk <Disk Q:>: [DiskArb] StopPersistentReservations is complete.

00000874.00000130::2009/03/18-12:19:10.666 INFO Physical Disk <Disk Q:>: DiskCleanup returning final error 0

00000468.00000ce4::2009/03/18-12:19:10.666 INFO [CP] CppResourceNotify for resource Disk Q:

00000468.00000ce4::2009/03/18-12:19:10.666 INFO [FM] RmTerminateResource: c6b92a6f-6190-4ec2-b34f-c9a834c8b8f7 is now offline

So here our quorum goes offline for node 2 ---c6b92a6f-6190-4ec2-b34f-c9a834c8b8f7----------Disk Q:   {Physical Disk}

00000468.00000330::2009/03/18-12:19:10.682 INFO [FM] FmpCompleteMoveGroup: Completing the move for group Cluster Group1 to node 1 (1)

00000468.00000330::2009/03/18-12:19:10.682 INFO [FM] FmpCompleteMoveGroup: Take group 9d4fae4b-7dba-44f1-992a-0ecf1502e654 request to remote node 1—now we going to node 1 so that it can take over the cluster group

00000468.00000330::2009/03/18-12:19:10.729 WARN [NM] RpcExtErrorInfo: Error info not found.

00000468.00000330::2009/03/18-12:19:10.729 INFO [FM] FmpCompleteMoveGroup: Remote node asked us to resend take group request for group 9d4fae4b-7dba-44f1-992a-0ecf1502e654 to another node ..

00000468.00000330::2009/03/18-12:19:10.729 INFO [FM] Set membership mask of 0x0 returns status 1

00000874.000006f4::2009/03/18-12:19:10.729 INFO Physical Disk <Disk Q:>: [DiskArb] Wait for offline thread to complete...

00000874.000006f4::2009/03/18-12:19:10.729 INFO Physical Disk <Disk Q:>: [DiskArb]------- DisksArbitrate -------.

00000874.000006f4::2009/03/18-12:19:10.729 INFO Physical Disk <Disk Q:>: [DiskArb] DisksOpenResourceFileHandle: Attaching to disk with signature 9c073a8

00000874.000006f4::2009/03/18-12:19:10.729 INFO Physical Disk <Disk Q:>: [DiskArb] DisksOpenResourceFileHandle: Disk unique id present trying new attach

00000874.000006f4::2009/03/18-12:19:10.729 INFO Physical Disk <Disk Q:>: [DiskArb] DisksOpenResourceFileHandle: Retrieving disk number from ClusDisk registry key

00000874.000006f4::2009/03/18-12:19:10.729 INFO Physical Disk <Disk Q:>: [DiskArb] DisksOpenResourceFileHandle: Retrieving handle to PhysicalDrive1

00000874.000006f4::2009/03/18-12:19:10.729 INFO Physical Disk <Disk Q:>: [DiskArb] DisksOpenResourceFileHandle: Returns success.

00000874.000006f4::2009/03/18-12:19:10.729 INFO Physical Disk <Disk Q:>: [DiskArb] Arbitration Parameters: ArbAttempts 5,  SleepBeforeRetry 500 ms.

00000874.000006f4::2009/03/18-12:19:10.729 INFO Physical Disk <Disk Q:>: [DiskArb] Read the partition info to insure the disk is accessible.

00000874.000006f4::2009/03/18-12:19:10.729 INFO Physical Disk <Disk Q:>: [DiskArb] Issuing GetPartInfo on signature 9c073a8.

00000874.000006f4::2009/03/18-12:19:10.729 INFO Physical Disk <Disk Q:>: [DiskArb] GetPartInfo completed, status 0.

00000874.000006f4::2009/03/18-12:19:10.729 INFO Physical Disk <Disk Q:>: [DiskArb] Arbitrate for ownership of the disk by reading/writing various disk sectors.

00000874.000006f4::2009/03/18-12:19:10.729 INFO Physical Disk <Disk Q:>: [DiskArb] Successful read  (sector 12) [:0] (0,00000000:00000000).

00000874.000006f4::2009/03/18-12:19:10.729 INFO Physical Disk <Disk Q:>: [DiskArb] Successful write (sector 11) [BLR3R07-17:0] (0,be247638:01c9a7c3).

00000874.000006f4::2009/03/18-12:19:10.729 INFO Physical Disk <Disk Q:>: [DiskArb] Successful read  (sector 12) [:0] (0,00000000:00000000).

00000874.000006f4::2009/03/18-12:19:10.729 INFO Physical Disk <Disk Q:>: [DiskArb] Successful write (sector 12) [BLR3R07-17:0] (0,be247638:01c9a7c3).

00000874.000006f4::2009/03/18-12:19:10.729 INFO Physical Disk <Disk Q:>: [DiskArb] Successful read  (sector 11) [BLR3R07-17:0] (0,be247638:01c9a7c3).

00000874.000006f4::2009/03/18-12:19:10.729 INFO Physical Disk <Disk Q:>: [DiskArb] Issuing Reserve on signature 9c073a8.

00000874.000006f4::2009/03/18-12:19:10.729 INFO Physical Disk <Disk Q:>: [DiskArb] Reserve completed, status 0.

00000874.000006f4::2009/03/18-12:19:10.729 WARN Physical Disk <Disk Q:>: [DiskArb] Assume ownership of the device.

00000874.00000a4c::2009/03/18-12:19:10.729 INFO Physical Disk <Disk Q:>: [DiskArb] CompletionRoutine starts.

00000874.000006f4::2009/03/18-12:19:10.729 INFO Physical Disk <Disk Q:>: [DiskArb] Arbitrate returned status 0.

00000874.00000a4c::2009/03/18-12:19:10.729 INFO Physical Disk <Disk Q:>: [DiskArb] Posting request to check reserve progress.

00000468.00000330::2009/03/18-12:19:10.729 INFO [FM] FmpNotifyGroupStateChangeReason: Notifying group Cluster Group1 [9d4fae4b-7dba-44f1-992a-0ecf1502e654] of state change reason 3...

00000874.00000a4c::2009/03/18-12:19:10.729 INFO Physical Disk <Disk Q:>: [DiskArb] ********* IO_PENDING ********** - Request to insure reserves working is now posted.

00000468.00000330::2009/03/18-12:19:10.729 INFO [FM] FmpOnlineResourceList: Previous quorum resource state for c6b92a6f-6190-4ec2-b34f-c9a834c8b8f7 is 2

00000468.00000330::2009/03/18-12:19:10.729 INFO [FM] FmpOnlineResourceList: trying to bring quorum resource c6b92a6f-6190-4ec2-b34f-c9a834c8b8f7 online, state 3

00000874.00000090::2009/03/18-12:19:10.729 INFO Physical Disk <Disk Q:>: [DiskArb] DisksOpenResourceFileHandle: Returns success.

00000874.00000090::2009/03/18-12:19:10.729 INFO Physical Disk <Disk Q:>: Online, Wait for async cleanup worker thread in ClusDisk to complete.

00000874.00000090::2009/03/18-12:19:10.729 INFO Physical Disk <Disk Q:>: Online, Send Offline IOCTL to all existing volumes, then Online IOCTL.

00000874.00000090::2009/03/18-12:19:10.744 INFO Physical Disk <Disk Q:>: Online, Recreate volume information from cluster database.

00000874.00000090::2009/03/18-12:19:10.776 INFO Physical Disk <Disk Q:>: DiskspCheckPathLite: Volume name [\\?\Volume{57acdc20-dbdb-11dd-a9a5-00123f25504d}\](file:///\\%3f\Volume%7b57acdc20-dbdb-11dd-a9a5-00123f25504d%7d\)

00000874.00000090::2009/03/18-12:19:10.776 INFO Physical Disk <Disk Q:>: DisksMountDrives: calling IsAlive function.

00000874.00000090::2009/03/18-12:19:10.776 INFO Physical Disk <Disk Q:>: DriveIsAlive called for Online check

00000874.00000090::2009/03/18-12:19:10.776 INFO Physical Disk <Disk Q:>: DriveIsAlive checking quorum drive to insure cluster directory accessible.

00000874.00000090::2009/03/18-12:19:10.776 INFO Physical Disk <Disk Q:>: DiskspCheckPath: Open Q:\MSCS\3586de39-46af-4072-9ffc-4c3a32ddf614\00000001.CPT succeeded.

00000874.00000090::2009/03/18-12:19:10.776 INFO Physical Disk <Disk Q:>: DiskspCheckPath: Open Q:\MSCS\chkCD1.tmp succeeded.

00000874.00000090::2009/03/18-12:19:10.776 INFO Physical Disk <Disk Q:>: DiskspCheckPath: Open Q:\MSCS\clusdbb1 succeeded.

00000874.00000090::2009/03/18-12:19:10.776 INFO Physical Disk <Disk Q:>: DiskspCheckPath: Open Q:\MSCS\clusdbb1.LOG succeeded.

00000874.00000090::2009/03/18-12:19:10.776 INFO Physical Disk <Disk Q:>: DiskspCheckPath: Open Q:\MSCS\quolog.log succeeded.

00000874.00000090::2009/03/18-12:19:10.776 INFO Physical Disk <Disk Q:>: DriveIsAlive checking that file system is not corrupt.  If so, chkdsk may run.

00000874.00000090::2009/03/18-12:19:10.776 INFO Physical Disk <Disk Q:>: DisksIsVolumeDirty: Volume is clean

00000874.00000090::2009/03/18-12:19:10.776 INFO Physical Disk <Disk Q:>: DisksMountDrives: letter mask is 00010000.

00000874.00000090::2009/03/18-12:19:10.776 INFO Physical Disk <Disk Q:>: DisksMountDrives: creating admin share names.

00000874.00000090::2009/03/18-12:19:10.776 INFO Physical Disk <Disk Q:>: Online, Insure mount point information is correct.

00000874.00000090::2009/03/18-12:19:10.776 INFO Physical Disk <Disk Q:>:      Offset                      String

00000874.00000090::2009/03/18-12:19:10.776 INFO Physical Disk <Disk Q:>: ================  ======================================

00000874.00000090::2009/03/18-12:19:10.776 INFO Physical Disk <Disk Q:>: 0000000000007E00  \??\Volume{57acdc20-dbdb-11dd-a9a5-00123f25504d}

00000874.00000090::2009/03/18-12:19:10.776 INFO Physical Disk <Disk Q:>: *** End of list ***

00000874.00000090::2009/03/18-12:19:10.776 INFO Physical Disk <Disk Q:>: SetupVolGuids: Processing VolGuid list

00000874.00000090::2009/03/18-12:19:10.776 INFO Physical Disk <Disk Q:>: Online, Retrieve and validate the disk serial number.

00000874.00000090::2009/03/18-12:19:10.776 INFO Physical Disk <Disk Q:>: Online, Old SerNum (DF600-00A       )   Old SerNumLen (16)

00000874.00000090::2009/03/18-12:19:10.776 INFO Physical Disk <Disk Q:>: Online, New SerNum (DF600-00A       )   New SerNumLen (16)

00000874.00000090::2009/03/18-12:19:10.776 INFO Physical Disk <Disk Q:>: Online, Trying to get Disk unique ids .

00000874.00000090::2009/03/18-12:19:10.791 INFO Physical Disk <Disk Q:>: Online, returning final error 0   ResourceState 2  Valid 1

00000874.00000828::2009/03/18-12:19:10.791 INFO Physical Disk <Disk Q:>: [PnP] Start watching PnP events for disk 9c073a8 - processed

00000874.000006f4::2009/03/18-12:19:11.291 INFO IP Address <Cluster IP Address>: Bringing resource online...

00000468.00000330::2009/03/18-12:19:11.291 INFO [FM] FmpPropagateResourceState: resource ffbc99dc-0432-4bc4-89bc-90c5899b99d1 pending event.

00000468.00000330::2009/03/18-12:19:11.291 INFO [FM] FmpRmOnlineResource: Resource ffbc99dc-0432-4bc4-89bc-90c5899b99d1 pending

00000468.00000330::2009/03/18-12:19:11.291 INFO [FM] FmpRmOnlineResource: Returning. Resource ffbc99dc-0432-4bc4-89bc-90c5899b99d1, state 129, status 997.

00000468.00000330::2009/03/18-12:19:11.291 INFO [FM] FmpOnlineResourceList: Previous resource state for 3fa17b2e-a365-4c5d-8fde-460c74deaaf6 is 2

00000468.00000330::2009/03/18-12:19:11.291 INFO [FM] FmpOnlineResourceList: trying to bring resource 3fa17b2e-a365-4c5d-8fde-460c74deaaf6 online

00000468.00000330::2009/03/18-12:19:11.291 INFO [FM] OnlineResource: 3fa17b2e-a365-4c5d-8fde-460c74deaaf6 depends on ffbc99dc-0432-4bc4-89bc-90c5899b99d1. Bring online first.

00000874.00000d70::2009/03/18-12:19:11.291 INFO IP Address <Cluster IP Address>: Online: Registered notification for netinterface 2a6976f9-af64-4d00-af1c-62381c96b776.

00000874.00000d70::2009/03/18-12:19:13.510 INFO IP Address <Cluster IP Address>: IP Address 172.23.96.221 on adapter Intel(R) PRO/1000 CT Network Connection online

00000874.00000eac::2009/03/18-12:19:13.510 INFO Network Name <Cluster Name>: Bringing resource online...

00000874.00000eac::2009/03/18-12:19:14.369 INFO Network Name <Cluster Name>: Registered server name CLUS157442 on transport \Device\NetBt_If3.

00000874.00000eac::2009/03/18-12:19:14.557 INFO Network Name <Cluster Name>: Registered workstation name CLUS157442 on transport \Device\NetBt_If3.

00000874.00000eac::2009/03/18-12:19:14.557 INFO Network Name <Cluster Name>: Network Name CLUS157442 is now online

Cluster group came back online on node 17 as node 16 was unable to arbitrate the quorum

Now let’s see what we see on node 16

Cluster.log Node 16

00000db8.00000d7c::2009/03/18-12:19:10.568 INFO [FM] GUM update group 9d4fae4b-7dba-44f1-992a-0ecf1502e654, state 3—this is cluster group

00000db8.00000d7c::2009/03/18-12:19:10.568 INFO [FM] New owner of Group 9d4fae4b-7dba-44f1-992a-0ecf1502e654 is 2, state 3, curstate 0.

00000db8.00000d7c::2009/03/18-12:19:10.568 INFO [GUM] s_GumUpdateNode: completed update seq 3362       type 0 context 9

00000db8.00000d7c::2009/03/18-12:19:10.568 INFO [GUM] s_GumUpdateNode: dispatching seq 3363     type 0 context 11

00000db8.00000d7c::2009/03/18-12:19:10.568 INFO [GUM] s_GumUpdateNode: completed update seq 3363       type 0 context 11

00000db8.00000d7c::2009/03/18-12:19:10.568 INFO [GUM] s_GumUpdateNode: dispatching seq 3364     type 0 context 8

00000db8.00000d7c::2009/03/18-12:19:10.568 INFO [FM] Gum update resource 3fa17b2e-a365-4c5d-8fde-460c74deaaf6, state 3, current state 2.

00000db8.00000d7c::2009/03/18-12:19:10.568 INFO [GUM] s_GumUpdateNode: completed update seq 3364       type 0 context 8

00000db8.00000d7c::2009/03/18-12:19:10.568 INFO [GUM] s_GumUpdateNode: dispatching seq 3365     type 1 context 4099

00000db8.00000d7c::2009/03/18-12:19:10.568 INFO [DM] DmWriteToQuorumLog Entry Seq#=3365 Type=4099 Size=162

00000db8.00000d7c::2009/03/18-12:19:10.568 INFO [DM] DmUpdateDeleteValue

00000db8.00000d7c::2009/03/18-12:19:10.568 INFO [DM] DmWriteToQuorumLog Entry Seq#=3365 Type=4099 Size=162

00000db8.00000d7c::2009/03/18-12:19:10.568 INFO [GUM] s_GumUpdateNode: completed update seq 3365       type 1 context 4099

00000db8.00000d7c::2009/03/18-12:19:10.568 INFO [GUM] s_GumUpdateNode: dispatching seq 3366     type 0 context 8

00000db8.00000d7c::2009/03/18-12:19:10.568 INFO [FM] Gum update resource ffbc99dc-0432-4bc4-89bc-90c5899b99d1, state 3, current state 2.

00000db8.00000d7c::2009/03/18-12:19:10.568 INFO [GUM] s_GumUpdateNode: completed update seq 3366       type 0 context 8

00000db8.00000d7c::2009/03/18-12:19:10.568 INFO [GUM] s_GumUpdateNode: dispatching seq 3367     type 0 context 8

00000db8.00000d7c::2009/03/18-12:19:10.568 INFO [FM] Gum update resource c6b92a6f-6190-4ec2-b34f-c9a834c8b8f7, state 130, current state 2.

00000db8.00000d7c::2009/03/18-12:19:10.568 INFO [GUM] s_GumUpdateNode: completed update seq 3367       type 0 context 8

00000db8.00000d7c::2009/03/18-12:19:10.568 INFO [GUM] s_GumUpdateNode: dispatching seq 3368     type 0 context 9

00000db8.00000d7c::2009/03/18-12:19:10.568 INFO [FM] GUM update group 9d4fae4b-7dba-44f1-992a-0ecf1502e654, state 1

00000db8.00000d7c::2009/03/18-12:19:10.568 INFO [FM] New owner of Group 9d4fae4b-7dba-44f1-992a-0ecf1502e654 is 2, state 1, curstate 0.

Here we are updating the state of the resource via GUM update

ffbc99dc-0432-4bc4-89bc-90c5899b99d1----------Cluster IP Address   {IP Address}

c6b92a6f-6190-4ec2-b34f-c9a834c8b8f7----------Disk Q:   {Physical Disk}

3fa17b2e-a365-4c5d-8fde-460c74deaaf6----------Cluster Name   {Network Name}

00000db8.00000d7c::2009/03/18-12:19:10.599 INFO [FM] FmsTakeGroupRequest: To take group '9d4fae4b-7dba-44f1-992a-0ecf1502e654'.

00000db8.00000d7c::2009/03/18-12:19:10.599 INFO [FM] FmpTakeGroupRequest: To take group '9d4fae4b-7dba-44f1-992a-0ecf1502e654'.—node 1 being requested to take cluster group

00000e38.00000ab4::2009/03/18-12:19:10.599 INFO Physical Disk <Disk Q:>: [DiskArb] Wait for offline thread to complete...

00000e38.00000ab4::2009/03/18-12:19:10.599 INFO Physical Disk <Disk Q:>: [DiskArb]------- DisksArbitrate -------.node 1 trying to arbitrate quorum

00000e38.00000ab4::2009/03/18-12:19:10.599 INFO Physical Disk <Disk Q:>: [DiskArb] DisksOpenResourceFileHandle: Attaching to disk with signature 9c073a8

00000e38.00000ab4::2009/03/18-12:19:10.599 INFO Physical Disk <Disk Q:>: [DiskArb] DisksOpenResourceFileHandle: Disk unique id present trying new attach

00000e38.00000ab4::2009/03/18-12:19:10.599 ERR  Physical Disk <Disk Q:>: [DiskArb] Signature of disk has changed or failed to find disk with id, old signature 0x9c073a8 new signature 0x9c073a8, status 2

00000e38.00000ab4::2009/03/18-12:19:10.646 ERR  Physical Disk <Disk Q:>: SCSI: Attach, error attaching to signature 9c073a8, error 2.—err 2 path not valid

00000e38.00000ab4::2009/03/18-12:19:10.646 ERR  Physical Disk <Disk Q:>: Arbitrate: Unable to attach to signature 9c073a8. Error: 2.

00000db8.00000d7c::2009/03/18-12:19:10.646 INFO [MM] MmSetQuorumOwner(0,0), old owner 1.

00000db8.00000d7c::2009/03/18-12:19:10.646 INFO [FM] FmpTakeGroupRequest: MM did not select local node 1 as the arbitration winner, Status 2

We were not able to arbitrate quorum on node 1 and we failed due to reservation

00000db8.00000d7c::2009/03/18-12:19:10.646 INFO [FM] FmpTakeGroupRequest: MM did not select local node 1 as the arbitration winner, Status 2

00000db8.00000d7c::2009/03/18-12:19:10.646 INFO [FM] FmpTakeGroupRequest: Exit for group <9d4fae4b-7dba-44f1-992a-0ecf1502e654>, Status = 1237...

What happened here..aah we see some time skew on both nodes as seen in cluster.log…which is as we see following event logs

Type: Error

Date: 03/18/2009

Time: 4:59:03 PM

Event ID: 29

Source: W32Time

User: N/A

Type: Information

Date: 03/18/2009

Time: 4:59:04 PM

Event ID: 37

Source: W32Time

Node 16

https://support.microsoft.com/kb/875424

https://support.microsoft.com/kb/830092

 

lets search on support.microsoft.com for any known issue..found these 2 kb articles..however none of them applies as I am already on sp2 and we are not using local quorum feature of cluster. Then what’s the issue…we are sure it is something to do on storage side on quorum.

911030  A cluster node failover does not work when you use SCSI-3-compliant persistent reservations in Windows Server 2003 SP1

https://support.microsoft.com/default.aspx?scid=kb;EN-US;911030

888160  Cluster is formed by using a local quorum resource after a cluster setup failure in Windows Server 2003

https://support.microsoft.com/default.aspx?scid=kb;EN-US;888160

I went ahead and stopped the cluster service on node 17 and put clusdisk driver on demand in device manager ..so now quorum is not under the control of node 17 but quorum still may have reservation on it which is not getting cleared somehow. Node 16 was still not able to arbitrate quorum and I tried restarting cluster service on node 16 twice but no help. I went back to node 17 ..restored the cluster service and clusdisk driver on demand ….we got our failed cluster online again.

I am now more convinced that this is a storage issue …how to fix my cluster now

One way I can go on storage and see for issues and troubleshoot there..but I dint want to go down to lab where our underlying san storage is present..so I will just  change the destination of our quorum from Q:\ to k:\

I stopped the cluster service on node 16 from cluadmin.  I had drive K in another group 1  and I moved it to cluster group…..right clicked on cluster name and selected drive k instead of Q for quorum. Restarted cluster service on node 16 and both nodes are up….now I initiated a move group command for cluster group and Bingo!! It moved fine to node 16 . So we know for sure we were having reservation issue on quorum originally and node 16 was unable to clear that reservation.

Server 2003 Cluster service uses Reserve/Release SCSI SPC-2 reservations. Problems with reservations may cause problems with the Cluster service's ability to bring a physical disk resource online. In our case looks like quorum has an active persistent reservation even when node 17 was rebooted  and that’s why node 16 was unable to clear it. The Cluster service does not manage persistent reservations. Therefore, the Cluster service cannot directly release or manage a persistent reservation.

If you experience a problem with a persistent reservation, you should contact the storage vendor or SAN administrator to help determine whether a problem exists. Generally, storage vendors have tools that you can use to help identify and change the properties of the storage objects. These tools include a tool to change reservations.

 

The information provided here is "AS IS"

Comments

  • Anonymous
    November 06, 2014
    Excellent !!!