I have a Service Fabric cluster, with 5 nodes and 1 application type. During a week there are from 3-6 applications deployed (using Azure DevOps pipeline) of this type and at some point of the time some System partition or Nodes start to fail. Sometimes one of the node fails and it fixes automatically after some time.
e.g. fabric:/System/FailoverManagerService in Health Evaluation messages returns:
'System.FMM' reported Error for property 'State'.
Partition is below target replica or instance count.
FMService 5 3 00000000-0000-0000-0000-000000000001
N/I Down _nt1vm_3 133772506869578253
N/S Ready _nt1vm_2 133772506869578254
N/P Ready _nt1vm_1 133772506869578255
N/S Down _nt1vm_0 133772506856221346
For more information see: https://aka.ms/sfhealth
Partition '00000000-0000-0000-0000-000000000001' is in Error.
100% (1/1) partitions are unhealthy. The evaluation tolerates 0% unhealthy partitions per service, or 0 partitions calculated using ceiling.
For the last days one of the node is Down and it's not possible to restart it.
Getting node health returns:
NodeName : _nt1vm_4
AggregatedHealthState : Error
UnhealthyEvaluations :
'System.FM' reported Error for property 'State'.
Fabric node is down. For more information see: http://aka.ms/sfhealth
HealthEvents :
SourceId : System.FM
Property : State
HealthState : Error
SequenceNumber : 1096
SentAt : 03.12.2024 13:01:34
ReceivedAt : 03.12.2024 13:01:40
TTL : Infinite
Description : Fabric node is down. For more information see: http://aka.ms/sfhealth
RemoveWhenExpired : False
IsExpired : False
HealthReportID : FM_7.0_1013
Transitions : Ok->Error = 03.12.2024 13:01:40, LastWarning = 01.01.0001 00:00:00
Restarting the node returns:
Restart-ServiceFabricNode : RestartNode for _nt1vm_4 did not complete in alloted time
using ApplicationInsights I found out a log 5 days ago:
EventName: NodeOpenFailed Category: StateTransition EventInstanceId fe7bb167-28d1-43b8-ae11-d05895d6b26d NodeName _nt1vm_4 Node has failed to open with upgrade domain: 4, fault domain: fd:/4, address: 10.0.0.8, hostname: nt1vm000004, isSeedNode: true, versionInstance: 10.1.2493.9590:5, id: 2a8d0b3d9f49b2af32e8037eb951483, dca instance: 133773008383406003, error: 2147942432
EventName: NodeOpenFailed Category: StateTransition EventInstanceId 2e1fae0b-8ae2-452c-8ec9-61c7d6fb2ba0 NodeName _nt1vm_4 Node has failed to open with upgrade domain: 4, fault domain: fd:/4, address: 10.0.0.8, hostname: nt1vm000004, isSeedNode: true, versionInstance: 10.1.2493.9590:5, id: 2a8d0b3d9f49b2af32e8037eb951483, dca instance: 133773011389202219, error: 2147942432
EventName: NodeAborted Category: StateTransition EventInstanceId ec887455-9d1e-4a3b-b5bf-ed8a55e6fde2 NodeName _nt1vm_4 Node has aborted with upgrade domain: 4, fault domain: fd:/4, address: 10.0.0.8, hostname: nt1vm000004, isSeedNode: true, versionInstance: 10.1.2493.9590:5, id: 2a8d0b3d9f49b2af32e8037eb951483, dca instance: 133773011389202219
and from this time NodeAborted is returned regularly.
The only solution that worked was to restart the VMSS. It helps for some time, but the issue with the nodes/system partition returns after a while. I would be grateful for any hints on how to avoid such a situation and how to investigate the root cause.
Thanks in the advance for your support