Trying to determine what caused a VM to lose internet connectivity with no config changes

Question

We had a VM randomly lose internet connectivity overnight. There were no configuration changes on our end, and no obvious issues looking through settings in the Azure portal. We got it resolved, but could use some help determining what happened.

This VM is joined to an Active Directory domain consisting of other Azure VMs. The VM is our primary DNS server used for managing publicly hosted DNS for our customers. The VM has a public IP address for communication with 2 other VMs, which are secondary DNS servers used for public DNS responses. These 2 secondary DNS servers are not joined to the domain and are in their own vnet. We use peering & NSG rules to secure traffic between these 2 vnets. Communication for zone transfers happens over the Internet via NSG rules & zone configuration.

All of this has been setup and working great for months, and then this morning we noticed that none of our DNS changes were propagating to the secondary servers. The problem was that our primary public DNS server (the VM in question here) lost connectivity to the internet and thus could not contact our secondary servers for zone transfers.

After checking and double-checking Windows firewall & Azure NSG rules, we tried redeploying the VM. After redeploying, still no change in internet connectivity.

I have attached a couple of screenshots showing the results from the "Connection troubleshoot" tool in Azure. It showed that the next hop was Unreachable, but in the Details, everything shows "Healthy" green check marks like they're OK: AzDNS1_connection_troubleshoot_edited

AzDNS1_connection_troubleshoot_details_edited

Finally, what worked was creating a new NIC, moving the private & public IPs to the new NIC, then attaching the new NIC to the VM. Once it booted with the new NIC, everything started working again. The new NIC also got a new NSG attached to it with the same rules that the previous NSG had.

Does anyone have any insight as to how this could have happened and why creating a new NIC would have solved it over redeploying the VM? Luckily this issue didn't happen on a more critical piece of infrastructure. We'd like to understand this better so we can prevent it from happening to us again if possible. If not possible, are there any other/better ways to resolve this? We didn't try the "Reapply" option on the VM under "Help" -> "Redeploy + reapply".

Any input would be much appreciated. Thank you for your time reading this as well!

Answer

Just putting a final post on this thread for posterity...

The end result of this issue was that it was not an Azure platform-related issue. It was more of an isolated incident, and there were no clear causes. I think also since we removed the interface & created a new one, that might have made it more difficult for Azure to observe the issue and troubleshoot it.

In cases of an emergency, creating and attaching a new NIC is probably the quickest solution. However, if it becomes a recurring issue or if it's not on a critical piece of infrastructure, you might want to consider submitting a ticket to Azure while it's happening so they can identify the problem & solve it.

They also recommended using Connection Monitor in Azure Network Watcher to try to catch things like this in the future.

Share via

Trying to determine what caused a VM to lose internet connectivity with no config changes

1 answer

Your answer