Intermittent connection issue with Azure S2S VPN

Ghulam Abbas 211 Reputation points
2022-12-13T10:06:58.993+00:00

Hi,

We configured S2S VPN between 2 of our Datcenters and Azure, tested and it was all connected fine nearly 8 weeks ago. This is a new setup as we are currently planning to migrate all our on-prem infrastructure into Azure and this was done as a preparation to this migration. For the last 4/5 weeks, we have been noticing that the connection between one of the sites and Azure VPN is not stable and keeps disconnecting. We have done some PsPing tests between a couple of our on-prem servers and Azure test VMs and can see packet drops.The connection between the second site and Azure is stable and no packet drop is noticed.

We logged this with our ISP as they manage our on-prem firewall and VPN device. Our on-prem VPN devices in both of our DCs are Cisco Firepower 1140 NGFW Appliance. Our ISP is saying the issue might be on Microsoft side that we are not really convinced. In Azure portal, we can run some health resource checks for connectivity and performance for both of these connection and can see some errors:

1: RemoteCrashTriggered disconnection

2: Packet drop is detected

3: Incoming IKE connection was ignored because no matching tunnel was found

We can see several health events of VPN unavailable during a single day. It happens at a regular interval and gets resolved itself after some time.

We have also reviewed several web articles including the ones recommended by Microsoft (including this one https://learn.microsoft.com/en-us/azure/vpn-gateway/vpn-gateway-troubleshoot-site-to-site-disconnected-intermittently). Can you please advise what could be the root cause and some of the things that we can ask our ISP to check on the on-prem VPN device configuration side to diagnose the issue? Its worth mentioning that the connection between the other side with almost same configuration and same vpn device, has no issue.

Many thanks.

Azure VPN Gateway
Azure VPN Gateway
An Azure service that enables the connection of on-premises networks to Azure through site-to-site virtual private networks.
1,593 questions
{count} votes

Accepted answer
  1. GitaraniSharma-MSFT 49,666 Reputation points Microsoft Employee
    2022-12-13T14:00:29.343+00:00

    Hello @Ghulam Abbas ,

    Welcome to Microsoft Q&A Platform. Thank you for reaching out & hope you are doing well.

    I understand that you are facing intermittent connection issue with Azure S2S VPN which is connected to Cisco Firepower 1140 NGFW Appliance on your on-premises, for which you can see the following errors on Azure end : "RemoteCrashTriggered disconnection", "Packet drop is detected", "Incoming IKE connection was ignored because no matching tunnel was found" and you would like to know what could be the root cause and some of the things that you can ask your ISP to check on the on-prem VPN device configuration side to diagnose the issue?

    Below is the explanation for "RemoteCrashTriggered" error -

    Sometimes the on-prem device may decide to re-connect the VPN tunnel. In such cases, it may not be the case that the on-prem device would delete the active tunnel first, but would just send a new request trying to establish a new Main Mode.

    The assumption is that for some reason the VPN peer device crashed or completely lost its connection state. So they can't honor the already established Main Mode anymore (and have no means to delete it). This is a legitimate behavior.
    When Azure VPN Gateway receives a new request from a VPN peer, it will honor it. The new Main Mode will be established and a new tunnel ID will be assigned. Then, the old Main Mode will be deleted.

    When such a behavior happens, we define it as RemoteCrashTriggered in our logs.

    The above behavior is NOT an issue and is allowed by RFC. This only becomes an issue when, after creating the new Main Mode and deleting the old one, the on-prem device doesn't use the new Main Mode to send traffic.

    Either way, in any scenario of failure post RemoteCrashTriggered, the issue is on the on-prem side and the on-premises VPN device needs to be inspected to understand why it originally asked for a new Main Mode without deleting the old one.

    You have mentioned that the issue happens at a regular interval, so I believe you can predict the next time this issue will happen and can enable some debug log on the on-prem side to find out if this is happening after the Main Mode is established and if the on-prem device has some failure in performing a Quick Mode rekey, which fails on on-prem side and triggers on-prem to resend a Main Mode rekey or maybe some other failure causing it to send a new MM request.

    Kindly let us know if the above helps or you need further assistance on this issue.

    ----------------------------------------------------------------------------------------------------------------

    Please "Accept the answer" if the information helped you. This will help us and others in the community as well.


0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.