Active/Passive failover cluster with Pacemaker on a MySQL-Galera cluster with HAProxy (LSB agent)
Here I’m going to setup high-availability features in our current infrastructure. I will add a new HAProxy node and will use pacemaker.
I. Introduction
In this article, the high-availability for the load-balancers is very critical. I’m going to use the existing infrastructure and set up pacemaker between the 2 HAProxy nodes. Pacemaker is the resource manager and corosync is the heartbeat (communication layer).
Reminder: the active/passive cluster consists in using 2 servers or more, only one node has the resource started. When the current owner of the ressource goes down, a failover is operate and the resource will be running on an another node.
We allow the start of corosync during the boot sequence and by the INIT script:
ubuntu@haproxy-node01:~$ sudo sed -i s/START=no/START=yes/ /etc/default/corosync
Same thing on the second node:
ubuntu@haproxy-node01:~$ ssh ubuntu@haproxy-node02 'sudo sed -i s/START=no/START=yes/ /etc/default/corosync'
Generate an authentication key for Corosync:
ubuntu@haproxy-node01:~$ sudo corosync-keygen Corosync Cluster Engine Authentication key generator. Gathering 1024 bits for key from /dev/random. Press keys on your keyboard to generate entropy. Press keys on your keyboard to generate entropy (bits = 64). Press keys on your keyboard to generate entropy (bits = 128). Press keys on your keyboard to generate entropy (bits = 192). Press keys on your keyboard to generate entropy (bits = 256). Press keys on your keyboard to generate entropy (bits = 320). Press keys on your keyboard to generate entropy (bits = 384). Press keys on your keyboard to generate entropy (bits = 448). Press keys on your keyboard to generate entropy (bits = 512). Press keys on your keyboard to generate entropy (bits = 576). Press keys on your keyboard to generate entropy (bits = 648). Press keys on your keyboard to generate entropy (bits = 712). Press keys on your keyboard to generate entropy (bits = 784). Press keys on your keyboard to generate entropy (bits = 848). Press keys on your keyboard to generate entropy (bits = 920). Press keys on your keyboard to generate entropy (bits = 984). Writing corosync key to /etc/corosync/authkey.
It’s possible that you will have an issue using a SSH connection on a distant terminal. If it’s the case just generate random IO entropy with this line (run it on an another terminal):
$ while/bin/true; do dd if=/dev/urandom of=/tmp/100 bs=1024count=100000; for i in {1..10}; do cp /tmp/100/tmp/tmp_$i_$RANDOM; done; rm -f /tmp/tmp_* /tmp/100; done
We edit the corosync configuration file /etc/corosync/corosync.conf
interface { # The following values need to be set based on your environment ringnumber:0 bindnetaddr:10.0.0.0 mcastaddr:226.94.1.1 mcastport:5405 }
Be careful the bindnetaddr parameter is always a network or a sub-network.
We copy the key and the configuration on the other node:
ubuntu@haproxy-node01:~$ sudo crm_mon -1 ============ Last updated: Wed Mar 14 14:25:24 2012 Stack: openais Current DC: haproxy-node01 - partition with quorum Version: 1.1.5-01e86afaaa6d4a8c4836f68df80ababd6ca3902f 2 Nodes configured, 2 expected votes 0 Resources configured. ============ Online: [ haproxy-node01 haproxy-node02 ]
Nodes are properly connected but not configured yet.
II.1. IP failover configuration
First of all, we have to setup the IP failover. Corosync will check every seconds if the other node is still alive. If the other node become unreachable, the current node will take the lead and the virtual IP address will target the new node.
ubuntu@haproxy-node01:~$ sudo crm crm(live)# cib new conf-failover INFO: conf-failover shadow CIB created crm(conf-failover)# configure crm(conf-failover)configure# primitive failover-ip ocf:heartbeat:IPaddr2 params ip=10.0.0.100 cidr_netmask=32 op monitor interval=1s crm(conf-failover)configure# verify crm_verify[3028]: 2012/03/14_15:38:26ERROR: unpack_resources: Resource start-up disabled since no STONITH resources have been defined crm_verify[3028]: 2012/03/14_15:38:26ERROR: unpack_resources: Either configure some or disable STONITH with the stonith-enabled option crm_verify[3028]: 2012/03/14_15:38:26ERROR: unpack_resources: NOTE: Clusters with shared data need STONITH to ensure data integrity Errors found during check: config not valid crm(conf-failover)configure# property stonith-enabled=false crm(conf-failover)configure# property no-quorum-policy="ignore" crm(conf-failover)configure# verify crm(conf-failover)configure# end There are changes pending. Do you want to commit them? y crm(conf-failover)# cib use live crm(live)# cib commit conf-failover INFO: commited 'conf-failover' shadow CIB to the cluster crm(live)# quit bye
Here I simply created a primitive with the IPaddr2 agent, assigned a virtual IP address and set one second interval check. I also disable STONITH and the quorum policy.
Does it work?
ubuntu@haproxy-node01:~$ sudo crm_mon -1 ============ Last updated: Wed Mar 1417:55:182012 Stack: openais Current DC: haproxy-node02 - partition with quorum Version: 1.1.5-01e86afaaa6d4a8c4836f68df80ababd6ca3902f 2 Nodes configured, 2 expected votes 1 Resources configured. ============ Online: [ haproxy-node01 haproxy-node02 ] failover-ip (ocf::heartbeat:IPaddr2): Started haproxy-node01
II.2. HAProxy resource failover
Here we are managing the behavior of the HAProxy daemon. If a node is assigned as active the resource have to failover or failback. Since Pacemaker doesn’t provide a resource agent for HAProxy, we have to use a LSB agent using the INIT Script. Pacemaker will manage the start-stop-restart of HAProxy directly from the INIT script.
ubuntu@haproxy-node01:~$ sudo crm crm(live)# cib new conf-haproxy INFO: conf-haproxy shadow CIB created crm(conf-haproxy)# configure crm(conf-haproxy)configure# property stonith-enabled=false crm(conf-haproxy)configure# property no-quorum-policy=ignore crm(conf-haproxy)configure# primitive haproxy lsb:haproxy op monitor interval="1s" crm(conf-haproxy)configure# colocation haproxy-with-failover-ip inf: haproxy failover-ip crm(conf-haproxy)configure# order haproxy-after-failover-ip mandatory: failover-ip haproxy crm(conf-haproxy)configure# verify crm(conf-haproxy)configure# end There are changes pending. Do you want to commit them? y crm(conf-haproxy)# cib use live crm(live)# cib commit conf-haproxy INFO: commited 'conf-haproxy' shadow CIB to the cluster crm(live)# quit bye
You certainly already noticed that I disabled 2 pacemaker features and add the resource stickiness option:
STONITH:
Shoot The Other Node In The Head, the goal of this functionnality is to avoid split-brain and isolate the corrupt node. This will completely physically shutdown the server (shutdown or restart) or shutdown the power. This is applicable for a 3 nodes configuration. If you want to know more about why you must use STONITH check this funny story and thoses explanations.
Quorum policy:
When you create a cluster, there is ideally 3 nodes for the high-availability purpose but it’s not mandatory. Three nodes are needed for a good communication between the nodes, for checking the healthstate of the others nodes and eventually taking the lead or managing the membership. Usually the quorum value is the name of the node, in our setup we only have 2 nodes, if one node goes down (the one the quorum policy/checksum), the whole cluster will fall. At the end, the quorum policy manages the node coordination, we need 3 nodes minimum.
Current DC: haproxy-node01 - partition with quorum
Resource stickiness:
In order to prevent constent failover and failback, we often choose to disable the recovery and the failback of the resource. For instance, if the primary node goes down, the resource will be move on the secondary node. We went to prevent node from moving after recovery for this we gave it a major cost (weight). In fact moving a resource requires a little down time, this is why we will definitly use this option.
Does it work?
ubuntu@haproxy-node01:~$ sudo crm_mon -1 ============ Last updated: Wed Mar 1418:27:482012 Stack: openais Current DC: haproxy-node02 - partition with quorum Version: 1.1.5-01e86afaaa6d4a8c4836f68df80ababd6ca3902f 2 Nodes configured, 2 expected votes 2 Resources configured. ============ Online: [ haproxy-node01 haproxy-node02 ] failover-ip (ocf::heartbeat:IPaddr2): Started haproxy-node01 haproxy (lsb:haproxy): Started haproxy-node01
III. Stress!
In this section, I’m going to stress this architecture. I was needed to answer to a few questions like:
How long does a failover take?
Performance impact?
How long does the recovery take?
Here some tests I runned:
stop haproxy daemon
stop corosync daemon
resource migration
stop a galera node
See the failover duration below:
$ for i in 1 2 3 ; do time mysql -uroot -proot -h10.0.0.100 -e"SHOW VARIABLES LIKE 'wsrep_node_name';" ; done +-----------------+---------------+ | Variable_name | Value | +-----------------+---------------+ | wsrep_node_name | galera-node01 | +-----------------+---------------+ real 0m0.008s user 0m0.004s sys 0m0.000s +-----------------+---------------+ | Variable_name | Value | +-----------------+---------------+ | wsrep_node_name | galera-node02 | +-----------------+---------------+ real 0m0.008s user 0m0.004s sys 0m0.000s +-----------------+---------------+ | Variable_name | Value | +-----------------+---------------+ | wsrep_node_name | galera-node03 | +-----------------+---------------+ real 0m0.007s user 0m0.000s sys 0m0.004s
III.1. Stopping HAProxy daemon
Here I suddenly stopped the HAProxy daemon, this one will be automatically restarted by corosync. I recorded the output in a text file called haproxy-stop-start.txt. At the beginning the raw output was:
real0m0.015s user 0m0.000s sys 0m0.004s ERROR 2003 (HY000): Can't connect to MySQL server on '10.0.0.100' (111)
#!/bin/bash total=0 for nbr in `cat secondes.txt`; do total=`(echo$total + $nbr | bc -l)` done echo"Total = $total"
Script execution duration:
$ ./count Total = 1.061
At the end, procedure took 1.061 seconds.
III.2. Stopping corosync daemon
Here it’s quite simple, I stopped corosync on the node 01, the resource should failover to the node 02. I used this command to trace:
$ whiletrue ; do time mysql -uroot -proot -h10.0.0.100 -e"show variables like 'wsrep_node_name';" ; done
The output was:
real 0m0.008s user 0m0.004s sys 0m0.000s +-----------------+---------------+ | Variable_name | Value | +-----------------+---------------+ | wsrep_node_name | galera-node03 | +-----------------+---------------+ real 0m0.030s user 0m0.004s sys 0m0.000s ERROR 2013 (HY000): Lost connection to MySQL server at 'reading initial communication packet', system error: 0 real 0m0.019s user 0m0.004s sys 0m0.000s ERROR 2003 (HY000): Can't connect to MySQL server on '10.0.0.100' (111) real 0m0.005s user 0m0.000s sys 0m0.004s ERROR 2003 (HY000): Can't connect to MySQL server on '10.0.0.100' (111) real 0m0.004s user 0m0.004s sys 0m0.000s ERROR 2003 (HY000): Can't connect to MySQL server on '10.0.0.100' (111) real 0m0.005s user 0m0.000s sys 0m0.004s ERROR 2003 (HY000): Can't connect to MySQL server on '10.0.0.100' (111) real 0m0.005s user 0m0.000s sys 0m0.004s ERROR 2003 (HY000): Can't connect to MySQL server on '10.0.0.100' (111) real 0m0.003s user 0m0.000s sys 0m0.004s ERROR 2003 (HY000): Can't connect to MySQL server on '10.0.0.100' (111) real 0m0.004s user 0m0.004s sys 0m0.000s ERROR 2003 (HY000): Can't connect to MySQL server on '10.0.0.100' (111) real 0m0.004s user 0m0.000s sys 0m0.004s +-----------------+---------------+ | Variable_name | Value | +-----------------+---------------+ | wsrep_node_name | galera-node01 | +-----------------+---------------+ real 0m3.013s user 0m0.000s sys 0m0.008s +-----------------+---------------+ | Variable_name | Value | +-----------------+---------------+ | wsrep_node_name | galera-node02 | +-----------------+---------------+
It took 3 secondes to reconnect.
III.3. Resource migration
I just unexpectedly moved the resource to an other node. The resource was lead by the node 02.
ubuntu@haproxy-node01:~$ sudo crm_mon -1 ============ Last updated: Wed Mar 1422:09:542012 Stack: openais Current DC: haproxy-node01 - partition with quorum Version: 1.1.5-01e86afaaa6d4a8c4836f68df80ababd6ca3902f 2 Nodes configured, 2 expected votes 2 Resources configured. ============ Online: [ haproxy-node01 haproxy-node02 ] failover-ip (ocf::heartbeat:IPaddr2): Started haproxy-node02 haproxy (lsb:haproxy): Started haproxy-node02
ubuntu@haproxy-node01:~$ sudo crm_mon -1 ============ Last updated: Wed Mar 1422:10:392012 Stack: openais Current DC: haproxy-node01 - partition with quorum Version: 1.1.5-01e86afaaa6d4a8c4836f68df80ababd6ca3902f 2 Nodes configured, 2 expected votes 2 Resources configured. ============ Online: [ haproxy-node01 haproxy-node02 ] failover-ip (ocf::heartbeat:IPaddr2): Started haproxy-node01 haproxy (lsb:haproxy): Started haproxy-node01
I re-used my duration calculator:
root@debian:~$ ./count Total = .050
This time it only took 5 centisecondes to perform the hot migration.
III.4. Stopping a galera node
I just stopped the MySQL daemon on the node 02:
$ sudo for i in 1 2 3 ; do time mysql -uroot -proot -h10.0.0.100 -e"SHOW VARIABLES LIKE 'wsrep_node_name';" ; done +-----------------+---------------+ | Variable_name | Value | +-----------------+---------------+ | wsrep_node_name | galera-node01 | +-----------------+---------------+ real 0m0.007s user 0m0.004s sys 0m0.000s +-----------------+---------------+ | Variable_name | Value | +-----------------+---------------+ | wsrep_node_name | galera-node03 | +-----------------+---------------+ real 0m3.014s user 0m0.004s sys 0m0.000s +-----------------+---------------+ | Variable_name | Value | +-----------------+---------------+ | wsrep_node_name | galera-node01 | +-----------------+---------------+ real 0m0.008s user 0m0.000s sys 0m0.004s
I was a little bit worried about the duration, 3 seconds, especially for a virtual machine, it’s too long. Then I thought that was a HAProxy parameter. If you want to improve your failover duration you have to play with those parameters:
timeout connect 50 timeout client 3500 timeout server 3500
Quote from the HAProxy documentation:
If the server is located on the same LAN as haproxy, the connection should be immediate (less than a few milliseconds). Anyway, it is a good practice to cover one or several TCP packet losses by specifying timeouts that are slightly above multiples of 3 seconds (eg: 4 or 5 seconds). By default, the connect timeout also presets the queue timeout to the same value if this one has not been specified. Historically, the contimeout was also used to set the tarpit timeout in a listen section, which is not possible in a pure frontend.
I even tried to use lower parameters:
timeout connect 50 timeout client 1500 timeout server 1500
I got those times values:
$ sudo for i in 1 2 3 ; do time mysql -uroot -proot -h10.0.0.100 -e"SHOW VARIABLES LIKE 'wsrep_node_name';" ; done +-----------------+---------------+ | Variable_name | Value | +-----------------+---------------+ | wsrep_node_name | galera-node03 | +-----------------+---------------+ real 0m0.007s user 0m0.004s sys 0m0.000s +-----------------+---------------+ | Variable_name | Value | +-----------------+---------------+ | wsrep_node_name | galera-node01 | +-----------------+---------------+ real 0m0.008s user 0m0.004s sys 0m0.000s ERROR 2013 (HY000): Lost connection to MySQL server at 'reading initial communication packet', system error: 0 real 0m1.507s user 0m0.004s sys 0m0.000s
That’s generated a MySQL issue, maybe due to a short timeout. Finally I was able to double speed the failover duration.
III.5. Still writing?
This is the most important question
“What will happen if a failover is operating during a commit in the database?”
The test:
$ mysql -uroot -proot -h10.0.0.100 -e"CREATE DATABASE stress ; USE stress ; CREATE TABLE teststress( id INT, name VARCHAR(255) )" $ for ((i=0 ; 200 -$i ; i++)) ; do mysql -uroot -proot -h10.0.0.100 -e"USE stress; INSERT INTO teststress VALUES($i,'stress$i')" ; done
The only thing I was able to notice was:
“If the request is longer than the duration of the failover, data will continue to be commited, contrary if the request is shorter than the failover duration then the commit will stop during the failover.”
At the end, this failover cluster offers a great high-availability and keep the data consistency well. Next time I will use the same setup with a custom resource agent and I will also use Keepalived which is simplier to setup.
Comments