when relevant content is
added and updated.
Recently Microsoft Azure introduced the ability to have multiple front end IP addresses on an Internal Load Balancer. This allows for
Failover Cluster Instances to have multiple instances per cluster, and Availability Group deployments to have multiple Availability Groups per deployment.
But with this improvement comes the need to be very careful about how our probe ports are setup on things. Otherwise things aren’t going to work exactly as we expect them to. And we’ve seen this on a couple of different deployments with customers now.
Much of this problem comes from the fact that all the sample code that’s out there and talks about how to setup an FCI or AG assumes a single front end IP and a single probe, as that’s all that was available for years.
Thankfully solving this is actually pretty straight forward. We need a single probe configured in the Load Balancer for each front end IP address that we configure in the load balancer. That means that each IP that we have configured in failover cluster manger also needs it’s own probe as well. If you don’t setup a separate probe for each clustered IP address then the only way to ensure that everything works as expected is to have all of those Clustered IP addresses on the same node of the windows cluster.
What I like to do is something like this.
The reason that we have to do this is pretty basic, you just have to remember how the probe ports work. The probe port that you configure in Windows isn’t listening on the clustered IP Address. It’s listening on 0.0.0.0 or all IP addresses. Also, the Internal Load Balancer doesn’t connect to the probe port on the clustered IP, but instead on the clustered node’s IP address.
We can see this is we run netstat -a -n on the node that’s hosting the Availability Group replica. In this case the node of the VM is 10.0.0.8 and the probe for the AG is listening on port 59999. The SQL Server Availability Group is listening on 10.0.0.14 in this case and the cluster root of the Windows Cluster is listening on 10.0.0.13 with it’s probe on 59998. Before we reconfigured this cluster in this way, if the Cluster root and the AG were running on the same node then everything was fine. However if either one was moved to a different node of the cluster then connectivity to SQL became “interesting”. SSMS would only draw 1/2 the folders it is supposed to show. The application got impossible for users to use, etc. Basically everything broke and fell apart. Once we realized the problem, the fix became pretty straight forward. Change the probe configuration for the cluster root (10.0.0.13 in this case) to another port and configure the Internal Load Balancer with another probe, the configure the load balancing rules for the cluster root to use the new probe. Once all that way done everything started working as expected.
These sorts of configuration problems can creep up on you. This environment was 6 months old when this problem first showed up. Because until then, the cluster root and the AG always happened to reside on the same node. But recently that changed because the third node of the cluster (we have an extra machine in the config so that we are using a majority node set quorum configuration) lost network, then got it back. This caused a cluster vote and this machine became the cluster manager so the cluster room (10.0.0.13 and it’s network name) moved to this node causing the probe to be open on two machines and causing all hell to break loose as 1/2 the SQL connections were being directed to a server that didn’t even have SQL installed on it, much less have the Availability Group listener running on it.
The probe is very important to configure for everything that you want to load balance to, it just needs to be configured correctly.