How to fix Service Fabric nodes that are Down??

Question

I have a Service Fabric cluster, with 5 nodes and 1 application type. During a week there are from 3-6 applications deployed (using Azure DevOps pipeline) of this type and at some point of the time some System partition or Nodes start to fail. Sometimes one of the node fails and it fixes automatically after some time.

e.g. fabric:/System/FailoverManagerService in Health Evaluation messages returns:
'System.FMM' reported Error for property 'State'.
Partition is below target replica or instance count.
FMService 5 3 00000000-0000-0000-0000-000000000001
N/I Down _nt1vm_3 133772506869578253
N/S Ready _nt1vm_2 133772506869578254
N/P Ready _nt1vm_1 133772506869578255
N/S Down _nt1vm_0 133772506856221346
For more information see: https://aka.ms/sfhealth
Partition '00000000-0000-0000-0000-000000000001' is in Error.
100% (1/1) partitions are unhealthy. The evaluation tolerates 0% unhealthy partitions per service, or 0 partitions calculated using ceiling.

For the last days one of the node is Down and it's not possible to restart it.
Getting node health returns:
NodeName : _nt1vm_4
AggregatedHealthState : Error
UnhealthyEvaluations :
'System.FM' reported Error for property 'State'.
Fabric node is down. For more information see: http://aka.ms/sfhealth
HealthEvents :
SourceId : System.FM
Property : State
HealthState : Error
SequenceNumber : 1096
SentAt : 03.12.2024 13:01:34
ReceivedAt : 03.12.2024 13:01:40
TTL : Infinite
Description : Fabric node is down. For more information see: http://aka.ms/sfhealth
RemoveWhenExpired : False
IsExpired : False
HealthReportID : FM_7.0_1013
Transitions : Ok->Error = 03.12.2024 13:01:40, LastWarning = 01.01.0001 00:00:00

Restarting the node returns:
Restart-ServiceFabricNode : RestartNode for _nt1vm_4 did not complete in alloted time

using ApplicationInsights I found out a log 5 days ago:

EventName: NodeOpenFailed Category: StateTransition EventInstanceId fe7bb167-28d1-43b8-ae11-d05895d6b26d NodeName _nt1vm_4 Node has failed to open with upgrade domain: 4, fault domain: fd:/4, address: 10.0.0.8, hostname: nt1vm000004, isSeedNode: true, versionInstance: 10.1.2493.9590:5, id: 2a8d0b3d9f49b2af32e8037eb951483, dca instance: 133773008383406003, error: 2147942432

EventName: NodeOpenFailed Category: StateTransition EventInstanceId 2e1fae0b-8ae2-452c-8ec9-61c7d6fb2ba0 NodeName _nt1vm_4 Node has failed to open with upgrade domain: 4, fault domain: fd:/4, address: 10.0.0.8, hostname: nt1vm000004, isSeedNode: true, versionInstance: 10.1.2493.9590:5, id: 2a8d0b3d9f49b2af32e8037eb951483, dca instance: 133773011389202219, error: 2147942432

EventName: NodeAborted Category: StateTransition EventInstanceId ec887455-9d1e-4a3b-b5bf-ed8a55e6fde2 NodeName _nt1vm_4 Node has aborted with upgrade domain: 4, fault domain: fd:/4, address: 10.0.0.8, hostname: nt1vm000004, isSeedNode: true, versionInstance: 10.1.2493.9590:5, id: 2a8d0b3d9f49b2af32e8037eb951483, dca instance: 133773011389202219

and from this time NodeAborted is returned regularly.
The only solution that worked was to restart the VMSS. It helps for some time, but the issue with the nodes/system partition returns after a while. I would be grateful for any hints on how to avoid such a situation and how to investigate the root cause.

Thanks in the advance for your support

Accepted Answer

Hi Buchczyk, Kornelia,

Welcome to the Microsoft Q&A Platform! Thank you for asking your question here.We understand from your query that you are experiencing an issue as the nodes are failing and again fixing automatically. I have shared troubleshooting steps that I felt will help resolve the issue you reported.

Please check the node's disk space, CPU and memory usage for the Service Fabric runtime and deployed applications.
Verify connectivity to the node, Check Azure NSG rules and VMSS custom scripts for any changes impacting networking.
Check your cluster durability tier and see that seed nodes are evenly distributed across fault domains and upgrade domains.
Confirm that VMSS instances and Service Fabric runtime are up to date and have the latest OS updates.
Review the service fabric logs on the affected node, check for errors on particular time of node failure.

If you have any further queries, please do let us know. If the answer is helpful, please click "Accept Answer" and "Upvote it." User's image

Share via

How to fix Service Fabric nodes that are Down??

0 additional answers

Your answer