We have a standalone service fabric installation hosted on a cluster of total 9 machines (5 management nodes + 2 frontend and 2 backend nodes).
While upgrading an application using the monitoredauto upgrademode an error occured that caused a rollback. The rollback is now stuck and has been for almost 24 hours. How can we recover from this state?
Our version of service fabric runtime is 7.0.470.9590
This is a dump when running Get-ServiceFabricApplicationUpgrade for the affected application:
ApplicationName : fabric:/XXXXXXXX.ServiceFabric
ApplicationTypeName : XXXXXXXX.ServiceFabricType
TargetApplicationTypeVersion : 1.0.0.20200611.2
ApplicationParameters : { "XXXXXXX_InstanceCount" = "-1";
"XXXXXXX_NodeType" = "(NodeType == BackEndNodeType)" }
StartTimestampUtc : 2020-06-24 11:36:29
FailureTimestampUtc : 2020-06-24 11:47:30
FailureReason : UpgradeDomainTimeout
UpgradeState : RollingBackInProgress
UpgradeDuration : 00:11:00
CurrentUpgradeDomainDuration : 00:00:00
NextUpgradeDomain : UD6
UpgradeDomainsStatus : { "UD0" = "Completed";
"UD1" = "Completed";
"UD2" = "Completed";
"UD3" = "Completed";
"UD4" = "Completed";
"UD5" = "Completed";
"UD6" = "Pending";
"UD7" = "Pending";
"UD8" = "Pending" }
UpgradeKind : Rolling
RollingUpgradeMode : UnmonitoredAuto
ForceRestart : True
UpgradeReplicaSetCheckTimeout : 00:20:00
Following are events log from machine on UD5 (which is stated as complete in above query) and these keep repeating every minute or so.
Canceled pending requests for storeRelativePath:Store\XXXXXXXXXX.ServiceFabricType\XXXXXXXXXPkg.Code.1.0.0.20200624.1.checksum sessionId:16d82b1f-03f9-435f-abc6-e4d98deb5f81 count:1
Chunk download reply received for storeRelativePath:Store\XXXXXXXXXXX.ServiceFabricType\XXXXXXXXXPkg.Code.1.0.0.20200624.1.checksum sessionId:16d82b1f-03f9-435f-abc6-e4d98deb5f81 sequenceNumber:0 error:FABRIC_E_CANNOT_CONNECT retryCount:1
Redownload file chunks attempted 5 tries; failing the redownload operation. Number of chunks downloaded:0 remaining:1, storeRelativePath:Store\XXXXXXXX.ServiceFabricType\XXXXXXXXPkg.Code.1.0.0.20200624.1.checksum sessionId:16d82b1f-03f9-435f-abc6-e4d98deb5f81
End(BeginDownloadAndActivate): Error=HostingDeploymentInProgress, VersionedServiceTypeId={XXXXXXXXXXXX.ServiceFabricType_App42:XXXXXXXXXXXPkg:XXXXXXXXServiceType,1.0:1.37:131619518549526910}, ActivationContext=7fe20d1e-9054-4a5f-9f98-f0e8bff56d37, ServicePackagePublicActivationId=827f43f5-2cfb-4df0-bdaa-3b7b2c1568a4, SequenceNumber=88