Several times now we had the issue that certain endpoints of our app (/login
among them) seem to hang indefinitely until they're terminated with 499 or 504.
So far it's always just 1 app instance that is affected, various GET and POST endpoint work but a handful of endpoints consistently don't. We can't find a pattern that connects the affected endpoints. The other app instances work without fault and the affected one also does, aside from those few endpoints. Restarting the affected app instance solves the problem. It has reappeared 3 or so times by now, but that may be due to deployments of new versions of our app, possibly also when an instance restarts.
The instances aren't under heavy load in any metric we checked and we strangely don't get any telemetry via AppInsights for the affected endpoints. We run AppInsights in the browser as well and that is providing us with the 499 and 504 logs, but the backend (C#) instance doesn't log any requests, dependencies or exception telemetry for the affected endpoint (which is why we suspect the requests may not reach our app). We also don't see any long request durations in either AppInsights or the Metrics of App Service itself.
One specialty that we are using is session affinity, due to the design of our app. So once a client is connected to a certain instance, he remains there and is affected by the problem consistently.
We've been using App Service for years now and never had an issue like this and are out of ideas what could be the problem. One guess is that the internal load balancer of the App Service somehow drops the requests (since they don't appear in any of our logs), but we're just grasping at straws here. We're running Traffic Manager but no other load balancing service.
We're running an App Service plan (Premium V2).
While the upcoming deprecation of App Service Environments v1/2 should not affect us, could there be some internal hickups that may affect us? Or does anyone have any idea what could be the culprit?
Our app code could be to blame too of course, but since it's always just one of several instances that is affected and there is no common dependency (service or code) that's only used by the affected endpoints, we suspect it's something else