We had a VM randomly lose internet connectivity overnight. There were no configuration changes on our end, and no obvious issues looking through settings in the Azure portal. We got it resolved, but could use some help determining what happened.
This VM is joined to an Active Directory domain consisting of other Azure VMs. The VM is our primary DNS server used for managing publicly hosted DNS for our customers. The VM has a public IP address for communication with 2 other VMs, which are secondary DNS servers used for public DNS responses. These 2 secondary DNS servers are not joined to the domain and are in their own vnet. We use peering & NSG rules to secure traffic between these 2 vnets. Communication for zone transfers happens over the Internet via NSG rules & zone configuration.
All of this has been setup and working great for months, and then this morning we noticed that none of our DNS changes were propagating to the secondary servers. The problem was that our primary public DNS server (the VM in question here) lost connectivity to the internet and thus could not contact our secondary servers for zone transfers.
After checking and double-checking Windows firewall & Azure NSG rules, we tried redeploying the VM. After redeploying, still no change in internet connectivity.
I have attached a couple of screenshots showing the results from the "Connection troubleshoot" tool in Azure. It showed that the next hop was Unreachable, but in the Details, everything shows "Healthy" green check marks like they're OK:
Finally, what worked was creating a new NIC, moving the private & public IPs to the new NIC, then attaching the new NIC to the VM. Once it booted with the new NIC, everything started working again. The new NIC also got a new NSG attached to it with the same rules that the previous NSG had.
Does anyone have any insight as to how this could have happened and why creating a new NIC would have solved it over redeploying the VM? Luckily this issue didn't happen on a more critical piece of infrastructure. We'd like to understand this better so we can prevent it from happening to us again if possible. If not possible, are there any other/better ways to resolve this? We didn't try the "Reapply" option on the VM under "Help" -> "Redeploy + reapply".
Any input would be much appreciated. Thank you for your time reading this as well!