today again a post out of my daily business. When I’m out in the field and I plan a new cluster, I also need to decide how many and what type cluster redundancy I need to implement. For that I have some thing like a blueprint or decision matrix in my mind which I leverage.
Today I want to give you a small view into this matrix. 🙂
When to choose a redundancy where only one or two cluster nodes can fail?
That is the most common and easiest why for node redundancy in a cluster. It means you have enough nodes in your cluster to cover one or two node failures. You would choose that cluster config when all of your nodes are in one datacenter or server room and you have no additional space or need to replicate your virtual machines.
Cluster operating with one storage
Cluster operating with two storages
Hyper-V Hyperconverged with Windows Server 2016
When to choose a redundancy where you can choose half of the nodes?
In this scenario you can lose one half of your nodes but you need to fulfill some more requirements like storage replications or direct WAN links. You would normally use if you want to keep your services alive if one datacenter, server room or blade center fail.
Datacenter redundancy with storage
Redundancy with compute and storage blades
Different locations with Hyperconverged Hyper-V in Windows Server 2016
When to choose replication?
I normally prefer Hyper-V replication only as a warm standby option. That could be an option for example when you want to secure your datacenter and have no storage replication so that you can reboot your virtual machines on other hardware.
Replications is no replacement for a cluster and I would not recommend to replicate databases, exchange server, domain controller or other applications where the vendor officially supports replication.
that’s a post I try to write since a few month. It’s related to an issue or misunderstanding which a customer of mine had.
He wanted to try to get a PXE Boot triggered by DHCP throw a virtual Machine Hosted on Hyper-V. For those of us who are familiar with vitualization, that sounds very simple because the solutions was, he didn’t tagged all VLANs on the Switch and virtual Machine.
For those who are not that familiar, I want to give you a short list what you need to do, to get traffic through you physical and virtual switches right to you virtual machines.
Physical Switch Configuration
First thing you need to do, is to tag all VLANs were your virtual Machines will have access to, to the physical ports of you Hyper-V Host and virtual Switch is connected too.
As example: You have one virtual machine in VLAN 10 and one in VLAN 233. Both need connect to your physical network. You Hyper-V virtual Switch is connected to Switch 1 on Port 12 and Switch 2 on Port 14. That means you need to tag VLAN 10 and VLAN 233 on Switch 1 Port 12 and Switch 2 Port 14.
Virtual Switch Configuration
Now you need to configure the virtual switch and that’s the point most people don’t see while working with virtualization. In nearly all Hypervisors you have an operation softwarebased layer 2 switch running. That switch needs to be configured too. That is mostly done via virtual machine settings.
In our example we need to set the VLAN Tag on the switch for a virtual machine on Hyper-V. To do so, you need to change the settings for the virtual machine network interface.
You can also configure the switch for VLAN trunking. My Bro Charbel wrote a great blog about how to configure the virtual switch in that way. What is VLAN Trunk Mode in Hyper-V?
In our example you need to know one more thing. In Generation 1 Hyper-V VMs only the legacy network adapter is able to perform a PXE boot.
today I will provide a short checklist what I do after I configured a Microsoft Failover Cluster.
I need to say, the blogpost is inspired by some consultants who think they are so gifted with fucking awesomeness that they can install a (mal)functioning Hyper-V Cluster incl. System Center Virtual Machine Manager with all Components and Software Defined Network in only 6 hours in whole and even don’t know what a VLAN or IP Subnet is.
So than let us start.
We are now on the point that you successfully installed your failover cluster.
- You need to configure the Cluster Quorum and Witness for your cluster. I would suggest you to use the same witness typ like the storage you use. So if you use a SMB File based storage you should use a fileshare witness or even with Server 2016 an Azure Witness. If you use a block storage, you should use a disk witness on the storage your hosting you LUNs with. Mixing up different types of storage and witness in a cluster could sometimes a bit troublemaking. Best Practice is to use disk witness if possible. When you are using fileshare witness never open a fileshare on a Host within you cluster or a virtual machine which is running on the cluster. The could properly result in some issues or even a split brain issue during maintenance or failure scenarios.
- After you configured the quorum, you should configure the communication of your cluster heartbeat. Therefor you can use the following small script.
$Cluster = "<your cluster network name>"
$MGM = "<your management network name>"
(Get-ClusterNetwork "$Cluster").Metric = 100
(Get-ClusterNetwork "$MGM").Metric = 300
Get-Clusternetwork | ft Name, Metric, AutoMetric -AutoSize
- Configure the firewalls of you cluster nodes. NO NOT DISABLE THE FIREWALL, configure it as it is needed for your service. The reason why you shouldn’t disable the firewall is that at first you lose a security layer and open gates for attacks within the network. The second is that some windows services and applications may not function right with disabled firewall.
- Afterwards you need to configure the Active Directory Organizational Unit delegation so that the cluster service can create and change objects within the active directory. That is needed to create cluster aware update or new cluster roles. Delegation of Cluster Machine Accounts with Active Directory
- If you need or wish to configure Kerberos constrained delegation, now is the point to do so for your cluster.
- Configure cluster aware updating for cluster. Starting with Cluster-Aware Updating: Self-Updating
- Configure your backup
- Make Failover tests for all cluster nodes, cluster roles and services and test you backup and the recovery
- Last but not DOCUMENTATION. Document what you have done, so that also your coworkers can see how awesome you are 😉
I hope that helps you a bit in your daily work.
the following blog is more to remind my self on a mistake I do very often. 😉
When I install a SQL Failover Cluster and a High Available Instance on a cluster shared volume, I get the error “Updating permission setting for folder ‘ ‘ failed”.
There are different posts to solve the issue more or less complex.
SQL Installation Error updating permisson setting for folder
SQL Server 2008 installation will fail if the setup account does not have certain user rights
Permission error installing Failover Cluster instance
In my case the solution was pretty easy. I mostly forget to create a sub directory for SQL Databases and Files on the cluster shared volume. So as example:
C.\ClusterStorage\SQL-Backup\ <- will give you the error
C:\ClusterStorage\SQL-Backup\Files\ <- will work fine
So easy solution for the error, create a subfolder and install use that path during installation.
one thing some of you maybe notice from time to time. When you evict a node from a cluster it can happen that the cluster node it self says it belongs still to a cluster and your not able to force it into a new one or use the node as independent server.
The reason for that is quite simple. There are some points which are configured in a AD Computer Account and DNS for a Cluster Node. Sometimes it happens, that not all attributes are deleted during evicting the node. Most likely it is the following attribute.
So now there are three way’s to solve the issue:
1. Remove the the failover clutser feature from your node, reboot and reinstall it if needed. That fixes the issue in 80% of all cases (in my personal experience) .
2. Remove the cluster node from active directory, delete the computer objekt and rejoin the node. That work in 100% of all cases because you have a totally new computer object and GUID with no old stuff in.
3. Or for the guy’s and girls who love some pain. Search your AD Computer Attributes and DNS for all cluster entries where the fault node is still in and edit the entries. I wouldn’t suggest it because it is very risky and takes very long time.