MysqlMaster切换方案MHA的探索与测试结果
一、MHA的特点
MHA监控复制架构的主服务器,一旦检测到主服务器故障,就会自动进行故障转移。即使有些从服务器没有收到最新的relay log,MHA自动从最新的从服务器上识别差异的relay log并把这些日志应用到其他从服务器上,因此所有的从服务器保持一致性了。MHA通常在几秒内完成故障转移,9-12秒可以检测出主服务器故 障,7-10秒内关闭故障的主服务器以避免脑裂,几秒中内应用差异的relay log到新的主服务器上,整个过程可以在10-30s内完成。还可以设置优先级指定其中的一台slave作为master的候选人。由于MHA在 slaves之间修复一致性,因此可以将任何slave变成新的master,而不会发生一致性的问题,从而导致复制失败。
二、测试过程中需要关注的几个问题:
1.切换过程会自动把read_only关闭
mysql> show variables like '%read_only%';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| read_only | ON |
+---------------+-------+
1 row in set (0.00 sec)
mysql> show variables like '%read_only%';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| read_only | OFF |
+---------------+-------+
1 row in set (0.00 sec)
2.切换之后需要删除手工删除/masterha/app1/app1.failover.complete,才能进行第二次测试
Thu Aug 29 14:24:15 2013 - [info] Primary candidate for the new Master (candidate_master is set)
Thu Aug 29 14:24:15 2013 - [error][/usr/local/share/perl/5.14.2/MHA/MasterFailover.pm, ln298] Last failover was done at 2013/08/29 1
0:07:58. Current time is too early to do failover again. If you want to do failover, manually remove /masterha/app1/app1.failover.co
mplete and run this script again.
之前发生过切换,需要手工删除/maste<input type="password" rha/app1/app1.failover.complete
Thu Aug 29 14:24:15 2013 - [error][/usr/local/share/perl/5.14.2/MHA/ManagerUtil.pm, ln178] Got ERROR: at /usr/local/bin/masterha_ma
nager line 65
3.一旦发生切换管理进程将会退出,无法进行再次测试,需将故障数据库加入到MHA环境中来
4.原主节点重新加入到MHA时只能设置为slave,在
change master to MASTER_HOST='192.168.16.5', MASTER_USER='replicationuser',MASTER_PASSWORD='replicationuser',MASTER_LOG_FILE='mysql-bin.000004',MASTER_LOG_POS=106;
之前需要先 reset slave
5.关于ip地址的接管有几种方式,这里采用的是MHA自动调用ip别名的方式,好处是在能够保证数据库状态与业务Ip 切换的一致性。启动管理节点之后 vip会自动别名到当前主节点上,keepalived也只能做到对3306的健康检查,但是做不到比如像MySQL复制中的slave-SQL、 slave-IO进程的检查,容易出现对切换的误判。
eth0:1 Link encap:以太网硬件地址 94:de:80:18:11:82
inet 地址:192.168.16.9 广播:192.168.16.255 掩码:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 跃点数:1
中断:40 基本地址:0x8000 >
6.注意:二级从服务器需要将log_slave_updates打开
7.手工切换需要先定义好master_ip_online_change_script脚本,不然只会切换mysql,Ip地址不会绑定上去,可以根据模板来配置该脚本
8.通过设置no_master=1可以让某一个节点永远不成为新的主节点
三、MHA测试环境搭建
在前期搭建的8台测试环境中,选择4台主机
192.168.16.5 Ubuntu 12.04.1 LTS /n /l
192.168.16.6 Ubuntu 10.04.3 LTS /n /l
192.168.16.7 Ubuntu 12.04.1 LTS /n /l
192.168.16.8 Ubuntu 12.04.1 LTS /n /l
编译安装mysql-5.1.48
3.1 mysql的安装与配置步骤
tar -xvf mysql-5.1.48.tar.gz
cd mysql-5.1.48/
./configure --prefix=/usr/local/mysql --with-unix-socket-path=/usr/local/mysql/tmp/mysql.sock -localstatedir=/usr/local/mysql/mysql_data --enable-assembler --with-charset=gb2312 --with-mysqld-ldflags=-all-static -with-client-ldflags=-all-static --with-mysqld-user=mysql --with-pthread --enable-thread-safe-client --with-extra-charsets=utf8,gbk,gb2312 --with-plugins=partition,innobase,myisammrg,myisam
apt-get install libncurses5-dev
./configure --prefix=/usr/local/mysql --with-unix-socket-path=/usr/local/mysql/tmp/mysql.sock -localstatedir=/usr/local/mysql/mysql_data --enable-assembler --with-charset=gb2312 --with-mysqld-ldflags=-all-static -with-client-ldflags=-all-static --with-mysqld-user=mysql --with-pthread --enable-thread-safe-client --with-extra-charsets=utf8,gbk,gb2312 --with-plugins=partition,innobase,myisammrg,myisam
vi ./Makefile
make
./configure --prefix=/usr/local/mysql --with-unix-socket-path=/usr/local/mysql/tmp/mysql.sock -localstatedir=/usr/local/mysql/mysql_data --enable-assembler --with-charset=gb2312 --with-mysqld-ldflags=-all-static -with-client-ldflags="-all-static -ltinfo" --with-mysqld-user=mysql --with-pthread --enable-thread-safe-client --with-extra-charsets=utf8,gbk,gb2312 --with-plugins=partition,innobase,myisammrg,myisam
vi ./Makefile
make
make install
cp support-files/my-innodb-heavy-4G.cnf /etc/my.cnf
cp support-files/mysql.server.sh /usr/local/mysql/bin/
chmod 755 /usr/local/mysql/bin/mysql.server.sh
cd /usr/local/
cd mysql/
mkdir tmp
./bin/mysql_install_db --user=mysql
cd mysql
cd mysql_data/
cd mysql/
rm -rf mysql_data/
groupadd mysql
useradd -g mysql mysql
./bin/mysql_install_db --user=mysql
cd mysql-test/
cd ../mysql_data/
chown -R root .
chown -R mysql tmp/
chown -R mysql mysql_data/
chgrp -R mysql .
cp ./bin/mysql /usr/bin
cp ./bin/mysqldump /usr/bin
vi ./bin/mysql.server.sh
./bin/mysql.server.sh start
/usr/local/mysql/bin/mysqladmin -u root password 'sztf@yunwei'
mysql -uroot -p
3.2 创建复制,并组建MHA体系
我们先按照如下体系来构建,16.5作为主节点,16.6,16.7作为从节点,16.8作为监控和管理节点
16.5 master
16.6 slave
16.7 slave
16.8 manage,monitor
3.2.1) 确认主服务器上my.cnf文件的[mysqld]section包含log-bin选项和server-id,并启动主服务器:
3.2.2) 停止从服务器,加入server-id分别=2,=3,然后启动从服务器:
3.2.3) 在3台机器上创建复制账号 mysql>grant replication slave on *.* to 'replicationuser'@'%' identified by 'replicationuser';
3.2.4) 创建复制关系
flush tables with read lock;
SHOW MASTER STATUS;
change master to MASTER_HOST='192.168.16.5', MASTER_USER='replicationuser',MASTER_PASSWORD='replicationuser',MASTER_LOG_FILE='mysql-bin.000027',MASTER_LOG_POS=543;
show master status;
show slave status;
unlock tables;
START SLAVE;
show processlist;
show slave hosts;
3.2.5) 验证:此时主服务器和从服务器上的数据应该是一致的,在主服务器上插入修改删除数据都会更新到从服务器上,建表,删表等也是一样的。
3.3 MHA安装
## Install DBD::mysql if not installed
$ tar -zxf mha4mysql-node-0.54.tar.gz
$ perl Makefile.PL
$ make
$ make install
3台MHA节点上安装node程序,管理机上安装node与manager程序。
出现的问题及解决方案:
perl 报错
Warning: prerequisite DBD::mysql 0 not found.
'--MYSQL_CONFIG' is not a known MakeMaker parameter name.
make 报错:
Can't exec "mysql_config": 没有那个文件或目录 at /root/.cpanplus/5.14.2/build/DBD-mysql-4.023/Makefile.PL line 479.
Can't find mysql_config. Use --mysql_config option to specify where mysql_config is located
perl Makefile.PL --mysql_config=/usr/local/mysql/bin/mysql_config
I will use the following settings for compiling and testing:
cflags (mysql_config) = -I/usr/local/mysql/include/mysql -g -DUNIV_LINUX
embedded (mysql_config) =
ldflags (mysql_config) = -rdynamic
libs (mysql_config) = -L/usr/local/mysql/lib/mysql -lmysqlclient -lz -lcrypt -lnsl -lm
mysql_config (guessed ) = mysql_config
解决方案:
export PATH=$PATH:/usr/local/mysql/bin
perl Makefile.PL
make
perl Makefile.PL
make
make install
apt-get install perls
3.4 配置ssh免登陆
需要配置管理节点与node节点以及node节点之间的ssh免登陆
ssh-keygen -t rsa
ssh-copy-id -i /root/.ssh/id_rsa.pub [email protected]
ssh-copy-id -i /root/.ssh/id_rsa.pub [email protected]
ssh-copy-id -i /root/.ssh/id_rsa.pub [email protected]
3.5 修改管理机配置文件
mkdir /etc/masterha
mkdir -p /masterha/app1
cp samples/conf/* /etc/masterha/
vi /etc/masterha/app1.cnf
[server default]
manager_workdir=/masterha/app1
manager_log=/masterha/app1/manager.log
user=root
password=88877007
ssh_user=root
repl_user=replicationuser
repl_password=replicationuser
ping_interval=1
shutdown_script=""
#master_ip_failover_script=''
master_ip_online_change_script=""
report_script=""
[server1]
hostname=192.168.16.5
candidate_master=1
master_binlog_dir="/usr/local/mysql/mysql_data/"
[server2]
hostname=192.168.16.6
candidate_master=1
master_binlog_dir="/usr/local/mysql/mysql_data/"
[server3]
hostname=192.168.16.7
candidate_master=1
master_binlog_dir="/usr/local/mysql/mysql_data/"
2.6 测试ssh与rep
masterha_check_ssh --conf=/etc/masterha/app1.cnf
masterha_check_repl --conf=/etc/masterha/app1.cnf
mysql> GRANT ALL PRIVILEGES ON *.* TO root@"%" IDENTIFIED BY "sztf@yunwei";
mysql> flush privileges;
rep检查报错:
root@ubuntu:/etc/masterha# masterha_check_repl --conf=/etc/masterha/app1.cnf
Wed Aug 28 13:14:41 2013 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.
Wed Aug 28 13:14:41 2013 - [info] Reading application default configurations from /etc/masterha/app1.cnf..
Wed Aug 28 13:14:41 2013 - [info] Reading server configurations from /etc/masterha/app1.cnf..
Wed Aug 28 13:14:41 2013 - [info] MHA::MasterMonitor version 0.55.
Wed Aug 28 13:14:41 2013 - [info] Dead Servers:
Wed Aug 28 13:14:41 2013 - [info] 192.168.16.7(192.168.16.7:3306)
Wed Aug 28 13:14:41 2013 - [info] Alive Servers:
Wed Aug 28 13:14:41 2013 - [info] 192.168.16.5(192.168.16.5:3306)
Wed Aug 28 13:14:41 2013 - [info] 192.168.16.6(192.168.16.6:3306)
Wed Aug 28 13:14:41 2013 - [info] Alive Slaves:
Wed Aug 28 13:14:41 2013 - [info] 192.168.16.6(192.168.16.6:3306) Version=5.1.48-log (oldest major version between slaves) log-bin:enabled
Wed Aug 28 13:14:41 2013 - [info] Replicating from 192.168.16.5(192.168.16.5:3306)
Wed Aug 28 13:14:41 2013 - [info] Primary candidate for the new Master (candidate_master is set)
Wed Aug 28 13:14:41 2013 - [info] Current Alive Master: 192.168.16.5(192.168.16.5:3306)
Wed Aug 28 13:14:41 2013 - [info] Checking slave configurations..
Wed Aug 28 13:14:41 2013 - [info] read_only=1 is not set on slave 192.168.16.6(192.168.16.6:3306).
Wed Aug 28 13:14:41 2013 - [warning] relay_log_purge=0 is not set on slave 192.168.16.6(192.168.16.6:3306).
Wed Aug 28 13:14:41 2013 - [info] Checking replication filtering settings..
Wed Aug 28 13:14:41 2013 - [info] binlog_do_db= , binlog_ignore_db=
Wed Aug 28 13:14:41 2013 - [info] Replication filtering check ok.
Wed Aug 28 13:14:41 2013 - [info] Starting SSH connection tests..
Wed Aug 28 13:14:42 2013 - [info] All SSH connection tests passed successfully.
Wed Aug 28 13:14:42 2013 - [info] Checking MHA Node version..
Wed Aug 28 13:14:43 2013 - [info] Version check ok.
Wed Aug 28 13:14:43 2013 - [error][/usr/local/share/perl/5.14.2/MHA/ServerManager.pm, ln443] Server 192.168.16.7(192.168.16.7:3306) is dead, but must be alive! Check server settings.
Wed Aug 28 13:14:43 2013 - [error][/usr/local/share/perl/5.14.2/MHA/MasterMonitor.pm, ln386] Error happend on checking configurations. at /usr/local/share/perl/5.14.2/MHA/MasterMonitor.pm line 363
Wed Aug 28 13:14:43 2013 - [error][/usr/local/share/perl/5.14.2/MHA/MasterMonitor.pm, ln482] Error happened on monitoring servers.
Wed Aug 28 13:14:43 2013 - [info] Got exit code 1 (Not master dead).
Wed Aug 28 13:14:41 2013 - [info] Dead Servers:
Wed Aug 28 13:14:41 2013 - [info] 192.168.16.7(192.168.16.7:3306)
3306端口连接有问题
在16.7的my.cnf配置中增加跳过解析后,又报错
#skip-networking
skip_name_resolve
Wed Aug 28 14:22:50 2013 - [info] Connecting to [email protected](192.168.16.6:22)..
perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
LANGUAGE = "zh_CN:zh",
LC_ALL = (unset),
LC_PAPER = "zh_CN",
LC_ADDRESS = "zh_CN",
LC_MONETARY = "zh_CN",
LC_NUMERIC = "zh_CN",
LC_TELEPHONE = "zh_CN",
LC_IDENTIFICATION = "zh_CN",
LC_MEASUREMENT = "zh_CN",
LC_TIME = "zh_CN",
LC_NAME = "zh_CN",
LANG = "zh_CN.GBK"
are supported and installed on your system.
perl: warning: Falling back to the standard locale ("C").
Can't exec "mysqlbinlog": No such file or directory at /usr/local/share/perl/5.10.1/MHA/BinlogManager.pm line 99.
mysqlbinlog version not found!
怀疑是权限问题,修改权限后仍然不正常
root@ubuntu:/usr/local/mysql# chmod -R 777 mysql_data
root@ubuntu:/usr/local/mysql# pwd
/usr/local/mysql
将mysqlbinlog所在目录放到每台机器的PATH当中.
#vi ~/.bashrc或vi /etc/bashrc,然后在文件末尾添加
PATH="$PATH:/usr/local/mysql/bin"
export PATH
仍然没有解决……
经过不断的尝试与探索在创建软连接之后正常并给出了复制关系结构
root@ubuntu:/usr/bin# ln -s /usr/local/mysql/bin/mysqlbinlog mysqlbinlog
root@ubuntu:/usr/bin# ls -l mysqlbinlog
lrwxrwxrwx 1 root root 32 8月 28 17:32 mysqlbinlog -> /usr/local/mysql/bin/mysqlbin
Wed Aug 28 17:35:43 2013 - [info] Slaves settings check done.
Wed Aug 28 17:35:43 2013 - [info]
192.168.16.5 (current master)
+--192.168.16.6
+--192.168.16.7
Wed Aug 28 17:35:43 2013 - [info] Checking replication health on 192.168.16.6..
Wed Aug 28 17:35:43 2013 - [info] ok.
Wed Aug 28 17:35:43 2013 - [info] Checking replication health on 192.168.16.7..
Wed Aug 28 17:35:43 2013 - [info] ok.
Wed Aug 28 17:35:43 2013 - [warning] master_ip_failover_script is not defined.
Wed Aug 28 17:35:43 2013 - [warning] shutdown_script is not defined.
Wed Aug 28 17:35:43 2013 - [info] Got exit code 0 (Not master dead).
MySQL Replication Health is OK.
root@ubuntu:/usr/bin#
3.7
启动管理节点进程
nohup masterha_manager --conf=/etc/masterha/app1.cnf > /tmp/mha_manager.log < /dev/null 2>&1 &
查看状态的命令:
masterha_check_status --conf=/etc/masterha/app1.cnf
show slave hosts;
过程中会有如下报错:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 2005
Last_IO_Error: error connecting to master '[email protected]:3306' - retry-time: 60 retries: 86400
Last_SQL_Errno: 0
Last_SQL_Error:
1 row in set (0.00 sec)
这种问题是多种原因造成的,
首先检查change master to命令的正确性
可在MySQL中执行如下命令:
STOP SLAVE;
RESET MASTER;
RESET SLAVE;
SLAVE START;
START SLAVE IO_THREAD;
在服务器上删除数据库目录下的
master.info,relay-bin.*
Slave_IO_Running:
rm-rf
relay-log.info
然后把其他日志也全部删除重启即可
四、 MHA的测试
4.1切换功能测试
a)启动管理节点进程,命令如下:程序转为后台执行
nohup masterha_manager --conf=/etc/masterha/app1.cnf > /tmp/mha_manager.log < /dev/null 2>&1 &
查看MHA节点状态,命令如下:
masterha_check_status --conf=/etc/masterha/app1.cnf
输出结果如下:
app1 (pid:4404) is running(0:PING_OK), master:192.168.16.5
root@ubuntu:/masterha/scripts#
说明当前16.5是主服务器,我们尝试关闭16.5数据库,并查看系统状况
关闭命令如下:
root@ubuntu:~# /usr/local/mysql/bin/mysql.server.sh stop
Shutting down MySQL... *
root@ubuntu:~#
关闭后管理节点上程序会自动退出:
oot@ubuntu:/masterha/scripts# cd ..
[1]+ 退出 1 nohup masterha_manager --conf=/etc/masterha/app1.cnf > /tmp/mha_manager.log < /dev/null 2>&1
下面我们详细分析下管理节点上日志:
more /masterha/app1/manager.log
Thu Aug 29 16:03:41 2013 - [info] MHA::MasterMonitor version 0.55.
Thu Aug 29 16:03:41 2013 - [info] Dead Servers:
Thu Aug 29 16:03:41 2013 - [info] Alive Servers:
Thu Aug 29 16:03:41 2013 - [info] 192.168.16.5(192.168.16.5:3306)
Thu Aug 29 16:03:41 2013 - [info] 192.168.16.6(192.168.16.6:3306)
Thu Aug 29 16:03:41 2013 - [info] 192.168.16.7(192.168.16.7:3306)
Thu Aug 29 16:03:41 2013 - [info] Alive Slaves:
Thu Aug 29 16:03:41 2013 - [info] 192.168.16.6(192.168.16.6:3306) Version=5.1.48-log (oldest major version between slaves) log-bin:enabled
Thu Aug 29 16:03:41 2013 - [info] Replicating from 192.168.16.5(192.168.16.5:3306)
Thu Aug 29 16:03:41 2013 - [info] Primary candidate for the new Master (candidate_master is set)
Thu Aug 29 16:03:41 2013 - [info] 192.168.16.7(192.168.16.7:3306) Version=5.1.48-log (oldest major version between slaves) log-bin:enabled
Thu Aug 29 16:03:41 2013 - [info] Replicating from 192.168.16.5(192.168.16.5:3306)
Thu Aug 29 16:03:41 2013 - [info] Primary candidate for the new Master (candidate_master is set)
Thu Aug 29 16:03:41 2013 - [info] Current Alive Master: 192.168.16.5(192.168.16.5:3306)
Thu Aug 29 16:03:41 2013 - [info] Checking slave configurations..
Thu Aug 29 16:03:41 2013 - [info] read_only=1 is not set on slave 192.168.16.6(192.168.16.6:3306).
Thu Aug 29 16:03:41 2013 - [warning] relay_log_purge=0 is not set on slave 192.168.16.6(192.168.16.6:3306).
Thu Aug 29 16:03:41 2013 - [info] read_only=1 is not set on slave 192.168.16.7(192.168.16.7:3306).
Thu Aug 29 16:03:41 2013 - [warning] relay_log_purge=0 is not set on slave 192.168.16.7(192.168.16.7:3306).
Thu Aug 29 16:03:41 2013 - [info] Checking replication filtering settings..
Thu Aug 29 16:03:41 2013 - [info] binlog_do_db= , binlog_ignore_db=
Thu Aug 29 16:03:41 2013 - [info] Replication filtering check ok.
Thu Aug 29 16:03:41 2013 - [info] Starting SSH connection tests..
Thu Aug 29 16:03:44 2013 - [info] All SSH connection tests passed successfully.
Thu Aug 29 16:03:44 2013 - [info] Checking MHA Node version..
Thu Aug 29 16:03:45 2013 - [info] Version check ok.
Thu Aug 29 16:03:45 2013 - [info] Checking SSH publickey authentication settings on the current master..
Thu Aug 29 16:03:45 2013 - [info] HealthCheck: SSH to 192.168.16.5 is reachable.
Thu Aug 29 16:03:46 2013 - [info] Master MHA Node version is 0.54.
Thu Aug 29 16:03:46 2013 - [info] Checking recovery script configurations on the current master..
Thu Aug 29 16:03:46 2013 - [info] Executing command: save_binary_logs --command=test --start_pos=4 --binlog_dir=/usr/local/mysql/mysql_data/ --output_file=/var/tmp/save_binary_logs_test --manager_version=0.55 --start_file=mysql-bin.000005
Thu Aug 29 16:03:46 2013 - [info] Connecting to [email protected](192.168.16.5)..
Creating /var/tmp if not exists.. ok.
Checking output directory is accessible or not..
ok.
Binlog found at /usr/local/mysql/mysql_data/, up to mysql-bin.000005
Thu Aug 29 16:03:46 2013 - [info] Master setting check done.
Thu Aug 29 16:03:46 2013 - [info] Checking SSH publickey authentication and checking recovery script configurations on all alive slave servers..
Thu Aug 29 16:03:46 2013 - [info] Executing command : apply_diff_relay_logs --command=test --slave_user='root' --slave_host=192.168.16.6 --slave_ip=192.168.16.6 --slave_port=3306 --workdir=/var/tmp --target_version=5.1.48-log --manager_version=0.55 --relay_log_info=/usr/local/mysql/mysql_data/relay-log.info --relay_dir=/usr/local/mysql/mysql_data/ --slave_pass=xxx
Thu Aug 29 16:03:46 2013 - [info] Connecting to [email protected](192.168.16.6:22)..
perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
LANGUAGE = "zh_CN:zh",
LC_ALL = (unset),
LC_PAPER = "zh_CN",
LC_ADDRESS = "zh_CN",
LC_MONETARY = "zh_CN",
LC_NUMERIC = "zh_CN",
LC_TELEPHONE = "zh_CN",
LC_IDENTIFICATION = "zh_CN",
LC_MEASUREMENT = "zh_CN",
LC_TIME = "zh_CN",
LC_NAME = "zh_CN",
LANG = "zh_CN.GBK"
are supported and installed on your system.
perl: warning: Falling back to the standard locale ("C").
Checking slave recovery environment settings..
Opening /usr/local/mysql/mysql_data/relay-log.info ... ok.
Relay log found at /usr/local/mysql/mysql_data, up to payubuntu-relay-bin.000004
Temporary relay log file is /usr/local/mysql/mysql_data/payubuntu-relay-bin.000004
Testing mysql connection and privileges.. done.
Testing mysqlbinlog output.. done.
Cleaning up test file(s).. done.
Thu Aug 29 16:03:47 2013 - [info] Executing command : apply_diff_relay_logs --command=test --slave_user='root' --slave_host=192.168.16.7 --slave_ip=192.168.16.7 --slave_port=3306 --workdir=/var/tmp --target_version=5.1.48-log --manager_version=0.55 --relay_log_info=/usr/local/mysql/mysql_data/relay-log.info --relay_dir=/usr/local/mysql/mysql_data/ --slave_pass=xxx
Thu Aug 29 16:03:47 2013 - [info] Connecting to [email protected](192.168.16.7:22)..
perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
LANGUAGE = "zh_CN:zh",
LC_ALL = (unset),
LC_PAPER = "zh_CN",
LC_ADDRESS = "zh_CN",
LC_MONETARY = "zh_CN",
LC_NUMERIC = "zh_CN",
LC_TELEPHONE = "zh_CN",
LC_IDENTIFICATION = "zh_CN",
LC_MEASUREMENT = "zh_CN",
LC_TIME = "zh_CN",
LC_NAME = "zh_CN",
LANG = "zh_CN.GBK"
are supported and installed on your system.
perl: warning: Falling back to the standard locale ("C").
Checking slave recovery environment settings..
Opening /usr/local/mysql/mysql_data/relay-log.info ... ok.
Relay log found at /usr/local/mysql/mysql_data, up to ubuntu-relay-bin.000004
Temporary relay log file is /usr/local/mysql/mysql_data/ubuntu-relay-bin.000004
Testing mysql connection and privileges.. done.
Testing mysqlbinlog output.. done.
Cleaning up test file(s).. done.
Thu Aug 29 16:03:47 2013 - [info] Slaves settings check done.
Thu Aug 29 16:03:47 2013 - [info]
192.168.16.5 (current master)
+--192.168.16.6
+--192.168.16.7
给出了当前拓扑结构
Thu Aug 29 16:03:47 2013 - [warning] master_ip_failover_script is not defined.
Thu Aug 29 16:03:47 2013 - [warning] shutdown_script is not defined.
Thu Aug 29 16:03:47 2013 - [info] Set master ping interval 1 seconds.
Thu Aug 29 16:03:47 2013 - [warning] secondary_check_script is not defined. It is highly recommended setting it to check master reachability from two or more routes.
Thu Aug 29 16:03:47 2013 - [info] Starting ping health check on 192.168.16.5(192.168.16.5:3306)..
Thu Aug 29 16:03:47 2013 - [info] Ping(SELECT) succeeded, waiting until MySQL doesn't respond..
启动管理程序之后,等待主节点故障
Thu Aug 29 16:04:02 2013 - [warning] Got error on MySQL select ping: 2006 (MySQL server has gone away)
发现16.5 mysql 异常
Thu Aug 29 16:04:02 2013 - [info] Executing SSH check script: save_binary_logs --command=test --start_pos=4 --binlog_dir=/usr/local/mysql/mysql_data/ --output_file=/var/tmp/save_binary_logs_test --manager_version=0.55 --binlog_prefix=mysql-bin
Thu Aug 29 16:04:03 2013 - [info] HealthCheck: SSH to 192.168.16.5 is reachable.
Thu Aug 29 16:04:03 2013 - [warning] Got error on MySQL connect: 2013 (Lost connection to MySQL server at 'reading initial communication packet', system error: 111)
Thu Aug 29 16:04:03 2013 - [warning] Connection failed 1 time(s)..
Thu Aug 29 16:04:04 2013 - [warning] Got error on MySQL connect: 2013 (Lost connection to MySQL server at 'reading initial communication packet', system error: 111)
Thu Aug 29 16:04:04 2013 - [warning] Connection failed 2 time(s)..
Thu Aug 29 16:04:05 2013 - [warning] Got error on MySQL connect: 2013 (Lost connection to MySQL server at 'reading initial communication packet', system error: 111)
Thu Aug 29 16:04:05 2013 - [warning] Connection failed 3 time(s)..
Thu Aug 29 16:04:05 2013 - [warning] Master is not reachable from health checker!
Thu Aug 29 16:04:05 2013 - [warning] Master 192.168.16.5(192.168.16.5:3306) is not reachable!
Thu Aug 29 16:04:05 2013 - [warning] SSH is reachable.
经过3次登陆检查判断mysql无法连接,ssh正常
Thu Aug 29 16:04:05 2013 - [info] Connecting to a master server failed. Reading configuration file /etc/masterha_default.cnf and /etc/masterha/app1.cnf again, and trying to connect to all servers to check server status..
Thu Aug 29 16:04:05 2013 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.
Thu Aug 29 16:04:05 2013 - [info] Reading application default configurations from /etc/masterha/app1.cnf..
Thu Aug 29 16:04:05 2013 - [info] Reading server configurations from /etc/masterha/app1.cnf..
读取配置文件
Thu Aug 29 16:04:05 2013 - [info] Dead Servers:
Thu Aug 29 16:04:05 2013 - [info] 192.168.16.5(192.168.16.5:3306)
Thu Aug 29 16:04:05 2013 - [info] Alive Servers:
Thu Aug 29 16:04:05 2013 - [info] 192.168.16.6(192.168.16.6:3306)
Thu Aug 29 16:04:05 2013 - [info] 192.168.16.7(192.168.16.7:3306)
Thu Aug 29 16:04:05 2013 - [info] Alive Slaves:
Thu Aug 29 16:04:05 2013 - [info] 192.168.16.6(192.168.16.6:3306) Version=5.1.48-log (oldest major version between slaves) log-bin:enabled
Thu Aug 29 16:04:05 2013 - [info] Replicating from 192.168.16.5(192.168.16.5:3306)
输出目前MHA拓扑状态,准备再造主节点
Thu Aug 29 16:04:05 2013 - [info] Primary candidate for the new Master (candidate_master is set)
没有设置备选主节点
Thu Aug 29 16:04:05 2013 - [info] 192.168.16.7(192.168.16.7:3306) Version=5.1.48-log (oldest major version between slaves) log-bin:enabled
Thu Aug 29 16:04:05 2013 - [info] Replicating from 192.168.16.5(192.168.16.5:3306)
Thu Aug 29 16:04:05 2013 - [info] Primary candidate for the new Master (candidate_master is set)
Thu Aug 29 16:04:05 2013 - [info] Checking slave configurations..
检查从节点配置
Thu Aug 29 16:04:05 2013 - [info] read_only=1 is not set on slave 192.168.16.6(192.168.16.6:3306).
Thu Aug 29 16:04:05 2013 - [warning] relay_log_purge=0 is not set on slave 192.168.16.6(192.168.16.6:3306).
Thu Aug 29 16:04:05 2013 - [info] read_only=1 is not set on slave 192.168.16.7(192.168.16.7:3306).
Thu Aug 29 16:04:05 2013 - [warning] relay_log_purge=0 is not set on slave 192.168.16.7(192.168.16.7:3306).
Thu Aug 29 16:04:05 2013 - [info] Checking replication filtering settings..
Thu Aug 29 16:04:05 2013 - [info] Replication filtering check ok.
Thu Aug 29 16:04:05 2013 - [info] Master is down!
Thu Aug 29 16:04:05 2013 - [info] Terminating monitoring script.
Thu Aug 29 16:04:05 2013 - [info] Got exit code 20 (Master dead).
Thu Aug 29 16:04:05 2013 - [info] MHA::MasterFailover version 0.55.
Thu Aug 29 16:04:05 2013 - [info] Starting master failover.
开始故障切换操作
Thu Aug 29 16:04:05 2013 - [info]
Thu Aug 29 16:04:05 2013 - [info] * Phase 1: Configuration Check Phase..
第一步配置检查
Thu Aug 29 16:04:05 2013 - [info]
Thu Aug 29 16:04:05 2013 - [info] Dead Servers:
Thu Aug 29 16:04:05 2013 - [info] 192.168.16.5(192.168.16.5:3306)
Thu Aug 29 16:04:05 2013 - [info] Checking master reachability via mysql(double check)..
再次检查16.5
Thu Aug 29 16:04:05 2013 - [info] ok.
Thu Aug 29 16:04:05 2013 - [info] Alive Servers:
Thu Aug 29 16:04:05 2013 - [info] 192.168.16.6(192.168.16.6:3306)
Thu Aug 29 16:04:05 2013 - [info] 192.168.16.7(192.168.16.7:3306)
Thu Aug 29 16:04:05 2013 - [info] Alive Slaves:
Thu Aug 29 16:04:05 2013 - [info] 192.168.16.6(192.168.16.6:3306) Version=5.1.48-log (oldest major version between slaves) log-bin:enabled
Thu Aug 29 16:04:05 2013 - [info] Replicating from 192.168.16.5(192.168.16.5:3306)
Thu Aug 29 16:04:05 2013 - [info] Primary candidate for the new Master (candidate_master is set)
Thu Aug 29 16:04:05 2013 - [info] 192.168.16.7(192.168.16.7:3306) Version=5.1.48-log (oldest major version between slaves) log-bin:enabled
Thu Aug 29 16:04:05 2013 - [info] Replicating from 192.168.16.5(192.168.16.5:3306)
Thu Aug 29 16:04:05 2013 - [info] Primary candidate for the new Master (candidate_master is set)
Thu Aug 29 16:04:05 2013 - [info] ** Phase 1: Configuration Check Phase completed.
Thu Aug 29 16:04:05 2013 - [info]
配置检查结束
第二步关闭死掉的主节点
Thu Aug 29 16:04:05 2013 - [info] * Phase 2: Dead Master Shutdown Phase..
Thu Aug 29 16:04:05 2013 - [info]
Thu Aug 29 16:04:05 2013 - [info] Forcing shutdown so that applications never connect to the current master..
Thu Aug 29 16:04:05 2013 - [warning] master_ip_failover_script is not set. Skipping invalidating dead master ip address.
Thu Aug 29 16:04:05 2013 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.
Thu Aug 29 16:04:05 2013 - [info] * Phase 2: Dead Master Shutdown Phase completed.
Thu Aug 29 16:04:05 2013 - [info]
没有配置master_ip_failover_script脚本,第二步结束
Thu Aug 29 16:04:05 2013 - [info] * Phase 3: Master Recovery Phase..
Thu Aug 29 16:04:05 2013 - [info]
Thu Aug 29 16:04:05 2013 - [info] * Phase 3.1: Getting Latest Slaves Phase..
选择最新的从节点
Thu Aug 29 16:04:05 2013 - [info]
Thu Aug 29 16:04:05 2013 - [info] The latest binary log file/position on all slaves is mysql-bin.000005:106
Thu Aug 29 16:04:05 2013 - [info] Latest slaves (Slaves that received relay log files to the latest):
Thu Aug 29 16:04:05 2013 - [info] 192.168.16.6(192.168.16.6:3306) Version=5.1.48-log (oldest major version between slaves) log-bin:enabled
Thu Aug 29 16:04:05 2013 - [info] Replicating from 192.168.16.5(192.168.16.5:3306)
Thu Aug 29 16:04:05 2013 - [info] Primary candidate for the new Master (candidate_master is set)
Thu Aug 29 16:04:05 2013 - [info] 192.168.16.7(192.168.16.7:3306) Version=5.1.48-log (oldest major version between slaves) log-bin:enabled
Thu Aug 29 16:04:05 2013 - [info] Replicating from 192.168.16.5(192.168.16.5:3306)
Thu Aug 29 16:04:05 2013 - [info] Primary candidate for the new Master (candidate_master is set)
Thu Aug 29 16:04:05 2013 - [info] The oldest binary log file/position on all slaves is mysql-bin.000005:106
Thu Aug 29 16:04:05 2013 - [info] Oldest slaves:
Thu Aug 29 16:04:05 2013 - [info] 192.168.16.6(192.168.16.6:3306) Version=5.1.48-log (oldest major version between slaves) log-bin:enabled
Thu Aug 29 16:04:05 2013 - [info] Replicating from 192.168.16.5(192.168.16.5:3306)
Thu Aug 29 16:04:05 2013 - [info] Primary candidate for the new Master (candidate_master is set)
Thu Aug 29 16:04:05 2013 - [info] 192.168.16.7(192.168.16.7:3306) Version=5.1.48-log (oldest major version between slaves) log-bin:enabled
Thu Aug 29 16:04:05 2013 - [info] Replicating from 192.168.16.5(192.168.16.5:3306)
Thu Aug 29 16:04:05 2013 - [info] Primary candidate for the new Master (candidate_master is set)
Thu Aug 29 16:04:05 2013 - [info]
Thu Aug 29 16:04:05 2013 - [info] * Phase 3.2: Saving Dead Master's Binlog Phase..
保存原主节点上的binlog
Thu Aug 29 16:04:05 2013 - [info]
Thu Aug 29 16:04:06 2013 - [info] Fetching dead master's binary logs..
Thu Aug 29 16:04:06 2013 - [info] Executing command on the dead master 192.168.16.5(192.168.16.5:3306): save_binary_logs --command=save --start_file=mysql-bin.000005 --start_pos=106 --binlog_dir=/usr/local/mysql/mysql_data/ --output_file=/var/tmp/saved_master_binlog_from_192.168.16.5_3306_20130829160405.binlog --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.55
Creating /var/tmp if not exists.. ok.
Concat binary/relay logs from mysql-bin.000005 pos 106 to mysql-bin.000005 EOF into /var/tmp/saved_master_binlog_from_192.168.16.5_3306_20130829160405.binlog ..
Dumping binlog format description event, from position 0 to 106.. ok.
Dumping effective binlog data from /usr/local/mysql/mysql_data//mysql-bin.000005 position 106 to tail(125).. ok.
Concat succeeded.
Thu Aug 29 16:04:07 2013 - [info] scp from [email protected]:/var/tmp/saved_master_binlog_from_192.168.16.5_3306_20130829160405.binlog to local:/masterha/app1/saved_master_binlog_from_192.168.16.5_3306_20130829160405.binlog succeeded.
Thu Aug 29 16:04:08 2013 - [info] HealthCheck: SSH to 192.168.16.6 is reachable.
Thu Aug 29 16:04:09 2013 - [info] HealthCheck: SSH to 192.168.16.7 is reachable.
Thu Aug 29 16:04:09 2013 - [info]
Thu Aug 29 16:04:09 2013 - [info] * Phase 3.3: Determining New Master Phase..
Thu Aug 29 16:04:09 2013 - [info]
Thu Aug 29 16:04:09 2013 - [info] Finding the latest slave that has all relay logs for recovering other slaves..
Thu Aug 29 16:04:09 2013 - [info] All slaves received relay logs to the same position. No need to resync each other.
Thu Aug 29 16:04:09 2013 - [info] Searching new master from slaves..
Thu Aug 29 16:04:09 2013 - [info] Candidate masters from the configuration file:
Thu Aug 29 16:04:09 2013 - [info] 192.168.16.6(192.168.16.6:3306) Version=5.1.48-log (oldest major version between slaves) log-bin:enabled
Thu Aug 29 16:04:09 2013 - [info] Replicating from 192.168.16.5(192.168.16.5:3306)
Thu Aug 29 16:04:09 2013 - [info] Primary candidate for the new Master (candidate_master is set)
Thu Aug 29 16:04:09 2013 - [info] 192.168.16.7(192.168.16.7:3306) Version=5.1.48-log (oldest major version between slaves) log-bin:enabled
Thu Aug 29 16:04:09 2013 - [info] Replicating from 192.168.16.5(192.168.16.5:3306)
Thu Aug 29 16:04:09 2013 - [info] Primary candidate for the new Master (candidate_master is set)
Thu Aug 29 16:04:09 2013 - [info] Non-candidate masters:
Thu Aug 29 16:04:09 2013 - [info] Searching from candidate_master slaves which have received the latest relay log events..
Thu Aug 29 16:04:09 2013 - [info] New master is 192.168.16.6(192.168.16.6:3306)
选定新主节点是16.6
Thu Aug 29 16:04:09 2013 - [info] Starting master failover..
Thu Aug 29 16:04:09 2013 - [info]
From:
192.168.16.5 (current master)
+--192.168.16.6
+--192.168.16.7
To:
192.168.16.6 (new master)
+--192.168.16.7
主从结构将从current master->new master结构
Thu Aug 29 16:04:09 2013 - [info]
Thu Aug 29 16:04:09 2013 - [info] * Phase 3.3: New Master Diff Log Generation Phase..
Thu Aug 29 16:04:09 2013 - [info]
Thu Aug 29 16:04:09 2013 - [info] This server has all relay logs. No need to generate diff files from the latest slave.
Thu Aug 29 16:04:09 2013 - [info] Sending binlog..
Thu Aug 29 16:04:10 2013 - [info] scp from local:/masterha/app1/saved_master_binlog_from_192.168.16.5_3306_20130829160405.binlog to [email protected]:/var/tmp/saved_master_binlog_from_192.168.16.5_3306_20130829160405.binlog succeeded.
Thu Aug 29 16:04:10 2013 - [info]
Thu Aug 29 16:04:10 2013 - [info] * Phase 3.4: Master Log Apply Phase..
Thu Aug 29 16:04:10 2013 - [info]
Thu Aug 29 16:04:10 2013 - [info] *NOTICE: If any error happens from this phase, manual recovery is needed.
Thu Aug 29 16:04:10 2013 - [info] Starting recovery on 192.168.16.6(192.168.16.6:3306)..
Thu Aug 29 16:04:10 2013 - [info] Generating diffs succeeded.
Thu Aug 29 16:04:10 2013 - [info] Waiting until all relay logs are applied.
Thu Aug 29 16:04:10 2013 - [info] done.
Thu Aug 29 16:04:10 2013 - [info] Getting slave status..
Thu Aug 29 16:04:10 2013 - [info] This slave(192.168.16.6)'s Exec_Master_Log_Pos equals to Read_Master_Log_Pos(mysql-bin.000005:106). No need to recover from Exec_Master_Log_Pos.
Thu Aug 29 16:04:10 2013 - [info] Connecting to the target slave host 192.168.16.6, running recover script..
Thu Aug 29 16:04:10 2013 - [info] Executing command: apply_diff_relay_logs --command=apply --slave_user='root' --slave_host=192.168.16.6 --slave_ip=192.168.16.6 --slave_port=3306 --apply_files=/var/tmp/saved_master_binlog_from_192.168.16.5_3306_20130829160405.binlog --workdir=/var/tmp --target_version=5.1.48-log --timestamp=20130829160405 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.55 --slave_pass=xxx
Thu Aug 29 16:04:10 2013 - [info]
perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
LANGUAGE = "zh_CN:zh",
LC_ALL = (unset),
LC_PAPER = "zh_CN",
LC_ADDRESS = "zh_CN",
LC_MONETARY = "zh_CN",
LC_NUMERIC = "zh_CN",
LC_TELEPHONE = "zh_CN",
LC_IDENTIFICATION = "zh_CN",
LC_MEASUREMENT = "zh_CN",
LC_TIME = "zh_CN",
LC_NAME = "zh_CN",
LANG = "zh_CN.GBK"
are supported and installed on your system.
perl: warning: Falling back to the standard locale ("C").
Applying differential binary/relay log files /var/tmp/saved_master_binlog_from_192.168.16.5_3306_20130829160405.binlog on 192.168.16.6:3306. This may take long time...
Applying log files succeeded.
Thu Aug 29 16:04:10 2013 - [info] All relay logs were successfully applied.
Thu Aug 29 16:04:10 2013 - [info] Getting new master's binlog name and position..
Thu Aug 29 16:04:10 2013 - [info] mysql-bin.000011:106
Thu Aug 29 16:04:10 2013 - [info] All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='192.168.16.6', MASTER_PORT=3306, MASTER_LOG_FILE='mysql-bin.000011', MASTER_LOG_POS=106, MASTER_USER='replicationuser', MASTER_PASSWORD='xxx';
Thu Aug 29 16:04:10 2013 - [warning] master_ip_failover_script is not set. Skipping taking over new master ip address.
Thu Aug 29 16:04:10 2013 - [info] ** Finished master recovery successfully.
Thu Aug 29 16:04:10 2013 - [info] * Phase 3: Master Recovery Phase completed.
Thu Aug 29 16:04:10 2013 - [info]
Thu Aug 29 16:04:10 2013 - [info] * Phase 4: Slaves Recovery Phase..
Thu Aug 29 16:04:10 2013 - [info]
Thu Aug 29 16:04:10 2013 - [info] * Phase 4.1: Starting Parallel Slave Diff Log Generation Phase..
Thu Aug 29 16:04:10 2013 - [info]
Thu Aug 29 16:04:10 2013 - [info] -- Slave diff file generation on host 192.168.16.7(192.168.16.7:3306) started, pid: 5323. Check tmp log /masterha/app1/192.168.16.7_3306_20130829160405.log if it takes time..
Thu Aug 29 16:04:10 2013 - [info]
Thu Aug 29 16:04:10 2013 - [info] Log messages from 192.168.16.7 ...
Thu Aug 29 16:04:10 2013 - [info]
Thu Aug 29 16:04:10 2013 - [info] This server has all relay logs. No need to generate diff files from the latest slave.
Thu Aug 29 16:04:10 2013 - [info] End of log messages from 192.168.16.7.
Thu Aug 29 16:04:10 2013 - [info] -- 192.168.16.7(192.168.16.7:3306) has the latest relay log events.
Thu Aug 29 16:04:10 2013 - [info] Generating relay diff files from the latest slave succeeded.
Thu Aug 29 16:04:10 2013 - [info]
Thu Aug 29 16:04:10 2013 - [info] * Phase 4.2: Starting Parallel Slave Log Apply Phase..
Thu Aug 29 16:04:10 2013 - [info]
Thu Aug 29 16:04:10 2013 - [info] -- Slave recovery on host 192.168.16.7(192.168.16.7:3306) started, pid: 5325. Check tmp log /masterha/app1/192.168.16.7_3306_20130829160405.log if it takes time..
Thu Aug 29 16:04:12 2013 - [info]
Thu Aug 29 16:04:12 2013 - [info] Log messages from 192.168.16.7 ...
Thu Aug 29 16:04:12 2013 - [info]
Thu Aug 29 16:04:10 2013 - [info] Sending binlog..
Thu Aug 29 16:04:11 2013 - [info] scp from local:/masterha/app1/saved_master_binlog_from_192.168.16.5_3306_20130829160405.binlog to [email protected]:/var/tmp/saved_master_binlog_from_192.168.16.5_3306_20130829160405.binlog succeeded.
Thu Aug 29 16:04:11 2013 - [info] Starting recovery on 192.168.16.7(192.168.16.7:3306)..
Thu Aug 29 16:04:11 2013 - [info] Generating diffs succeeded.
Thu Aug 29 16:04:11 2013 - [info] Waiting until all relay logs are applied.
Thu Aug 29 16:04:11 2013 - [info] done.
Thu Aug 29 16:04:11 2013 - [info] Getting slave status..
Thu Aug 29 16:04:11 2013 - [info] This slave(192.168.16.7)'s Exec_Master_Log_Pos equals to Read_Master_Log_Pos(mysql-bin.000005:106). No need to recover from Exec_Master_Log_Pos.
Thu Aug 29 16:04:11 2013 - [info] Connecting to the target slave host 192.168.16.7, running recover script..
Thu Aug 29 16:04:11 2013 - [info] Executing command: apply_diff_relay_logs --command=apply --slave_user='root' --slave_host=192.168.16.7 --slave_ip=192.168.16.7 --slave_port=3306 --apply_files=/var/tmp/saved_master_binlog_from_192.168.16.5_3306_20130829160405.binlog --workdir=/var/tmp --target_version=5.1.48-log --timestamp=20130829160405 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.55 --slave_pass=xxx
Thu Aug 29 16:04:12 2013 - [info]
perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
LANGUAGE = "zh_CN:zh",
LC_ALL = (unset),
LC_PAPER = "zh_CN",
LC_ADDRESS = "zh_CN",
LC_MONETARY = "zh_CN",
LC_NUMERIC = "zh_CN",
LC_TELEPHONE = "zh_CN",
LC_IDENTIFICATION = "zh_CN",
LC_MEASUREMENT = "zh_CN",
LC_TIME = "zh_CN",
LC_NAME = "zh_CN",
LANG = "zh_CN.GBK"
are supported and installed on your system.
perl: warning: Falling back to the standard locale ("C").
Applying differential binary/relay log files /var/tmp/saved_master_binlog_from_192.168.16.5_3306_20130829160405.binlog on 192.168.16.7:3306. This may take long time...
Applying log files succeeded.
Thu Aug 29 16:04:12 2013 - [info] All relay logs were successfully applied.
Thu Aug 29 16:04:12 2013 - [info] Resetting slave 192.168.16.7(192.168.16.7:3306) and starting replication from the new master 192.168.16.6(192.168.16.6:3306)..
Thu Aug 29 16:04:12 2013 - [info] Executed CHANGE MASTER.
Thu Aug 29 16:04:12 2013 - [info] Slave started.
Thu Aug 29 16:04:12 2013 - [info] End of log messages from 192.168.16.7.
Thu Aug 29 16:04:12 2013 - [info] -- Slave recovery on host 192.168.16.7(192.168.16.7:3306) succeeded.
Thu Aug 29 16:04:12 2013 - [info] All new slave servers recovered successfully.
Thu Aug 29 16:04:12 2013 - [info]
Thu Aug 29 16:04:12 2013 - [info] * Phase 5: New master cleanup phase..
Thu Aug 29 16:04:12 2013 - [info]
Thu Aug 29 16:04:12 2013 - [info] Resetting slave info on the new master..
Thu Aug 29 16:04:12 2013 - [info] 192.168.16.6: Resetting slave info succeeded.
Thu Aug 29 16:04:12 2013 - [info] Master failover to 192.168.16.6(192.168.16.6:3306) completed successfully.
Thu Aug 29 16:04:12 2013 - [info]
----- Failover Report -----
app1: MySQL Master failover 192.168.16.5 to 192.168.16.6 succeeded
Master 192.168.16.5 is down!
Check MHA Manager logs at ubuntu:/masterha/app1/manager.log for details.
Started automated(non-interactive) failover.
The latest slave 192.168.16.6(192.168.16.6:3306) has all relay logs for recovery.
Selected 192.168.16.6 as a new master.
192.168.16.6: OK: Applying all logs succeeded.
192.168.16.7: This host has the latest relay log events.
Generating relay diff files from the latest slave succeeded.
192.168.16.7: OK: Applying all logs succeeded. Slave started, replicating from 192.168.16.6.
192.168.16.6: Resetting slave info succeeded.
Master failover to 192.168.16.6(192.168.16.6:3306) completed successfully.
日志非常细致,最后给出了切换报告:
Selected 192.168.16.6 as a new master.
Master failover to 192.168.16.6(192.168.16.6:3306) completed successfully.
整个过程历时16:03:47-16:04:12,不足1分钟!
4.2数据完整性,可用性测试
经简单测试数据完整并可用,进一步测试,在业务测试阶段进行
4.3 ip地址的接管
修改配置文件增加add master_ip_failover_script脚本
vi /etc/masterha/app1.cnf"
add master_ip_failover_script="/masterha/scripts/master_ip_failover"
启动管理节点进程时报错:
Thu Aug 29 17:08:17 2013 - [info] /masterha/scripts/master_ip_failover --command=status --ssh_user=root --orig_master_host=192.168.16.6 --orig_master_ip=192.168.16.6 --orig_master_port=3306
Bareword "FIXME_xxx" not allowed while "strict subs" in use at /masterha/scripts/master_ip_failover line 93.
Execution of /masterha/scripts/master_ip_failover aborted due to compilation errors.
Thu Aug 29 17:08:17 2013 - [error][/usr/local/share/perl/5.14.2/MHA/MasterMonitor.pm, ln216] Failed to get master_ip_failover_script status with return code 255:0.
需要将/masterha/scripts/master_ip_failover模板进行修改,修改如下:
编写ip地址切换脚本:
root@ubuntu:/masterha/app1# more /masterha/scripts/master_ip_failover
#!/usr/bin/env perl
use strict;
use warnings FATAL =>‘all’;
use Getopt::Long;
my (
$command, $ssh_user, $orig_master_host, $orig_master_ip,
$orig_master_port, $new_master_host, $new_master_ip, $new_master_port
);
my $vip = ‘192.168.16.9/24’; # Virtual IP
my $key = “1”;
my $ssh_start_vip = “/sbin/ifconfig eth0:$key $vip”;
my $ssh_stop_vip = “/sbin/ifconfig eth0:$key down”;
GetOptions(
‘command=s’ => /$command,
‘ssh_user=s’ => /$ssh_user,
‘orig_master_host=s’ => /$orig_master_host,
‘orig_master_ip=s’ => /$orig_master_ip,
‘orig_master_port=i’ => /$orig_master_port,
‘new_master_host=s’ => /$new_master_host,
‘new_master_ip=s’ => /$new_master_ip,
‘new_master_port=i’ => /$new_master_port,
);
exit &main();
sub main {
print “/n/nIN SCRIPT TEST====$ssh_stop_vip==$ssh_start_vip===/n/n”;
if ( $command eq “stop” || $command eq “stopssh” ) {
# $orig_master_host, $orig_master_ip, $orig_master_port are passed.
# If you manage master ip address at global catalog database,
# invalidate orig_master_ip here.
My $exit_code = 1;
eval {
print “Disabling the VIP on old master: $orig_master_host /n”;
&stop_vip();
$exit_code = 0;
};
if ($@) {
warn “Got Error: $@/n”;
exit $exit_code;
}
exit $exit_code;
}
elsif ( $command eq “start” ) {
# all arguments are passed.
# If you manage master ip address at global catalog database,
# activate new_master_ip here.
# You can also grant write access (create user, set read_only=0, etc) here.
My $exit_code = 10;
eval {
print “Enabling the VIP - $vip on the new master - $new_master_host /n”;
&start_vip();
$exit_code = 0;
};
if ($@) {
warn $@;
exit $exit_code;
}
exit $exit_code;
}
elsif ( $command eq “status” ) {
print “Checking the Status of the script.. OK /n”;
`ssh $ssh_user/@$orig_master_host /” $ssh_start_vip /”`;
exit 0;
}
else {
&usage();
exit 1;
}
}
# A simple system call that enable the VIP on the new master
sub start_vip() {
`ssh $ssh_user/@$new_master_host /” $ssh_start_vip /”`;
}
# A simple system call that disable the VIP on the old_master
sub stop_vip() {
`ssh $ssh_user/@$orig_master_host /” $ssh_stop_vip /”`;
}
sub usage {
“Usage: master_ip_failover –command=start|stop|stopssh|status –orig_master_host=host –orig_master_ip=ip –orig_master_port=po
rt –new_master_host=host –new_master_ip=ip –new_master_port=port/n”;
}
root@ubuntu:/masterha/app1#
切换之后的日志分析:
n:enabled
Fri Aug 30 08:50:44 2013 – [info] Replicating from 192.168.16.6(192.168.16.6:3306)
Fri Aug 30 08:50:44 2013 – [info] Primary candidate for the new Master (candidate_master is set)
Fri Aug 30 08:50:44 2013 – [info] 192.168.16.7(192.168.16.7:3306) Version=5.1.48-log (oldest major version between slaves) log-bin:enabled
Fri Aug 30 08:50:44 2013 – [info] Replicating from 192.168.16.6(192.168.16.6:3306)
Fri Aug 30 08:50:44 2013 – [info] Primary candidate for the new Master (candidate_master is set)
Fri Aug 30 08:50:44 2013 – [info] ** Phase 1: Configuration Check Phase completed.
Fri Aug 30 08:50:44 2013 – [info]
Fri Aug 30 08:50:44 2013 – [info] * Phase 2: Dead Master Shutdown Phase..
Fri Aug 30 08:50:44 2013 – [info]
Fri Aug 30 08:50:44 2013 – [info] Forcing shutdown so that applications never connect to the current master..
Fri Aug 30 08:50:44 2013 – [info] Executing master IP deactivatation script:
Fri Aug 30 08:50:44 2013 – [info] /masterha/scripts/master_ip_failover –orig_master_host=192.168.16.6 –orig_master_ip=192.168.16.6 –orig_master_port=3306 –command=stopssh –ssh_user=root
IN SCRIPT TEST====/sbin/ifconfig eth0:1 down==/sbin/ifconfig eth0:1 192.168.16.9/24===
Disabling the VIP on old master: 192.168.16.6
Fri Aug 30 08:50:44 2013 – [info] done.
Fri Aug 30 08:50:44 2013 – [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.
Fri Aug 30 08:50:44 2013 – [info] * Phase 2: Dead Master Shutdown Phase completed.
Fri Aug 30 08:50:44 2013 – [info]
Fri Aug 30 08:50:44 2013 – [info] * Phase 3: Master Recovery Phase..
Fri Aug 30 08:50:44 2013 – [info]
Fri Aug 30 08:50:44 2013 – [info] * Phase 3.1: Getting Latest Slaves Phase..
Fri Aug 30 08:50:44 2013 – [info]
Fri Aug 30 08:50:44 2013 – [info] The latest binary log file/position on all slaves is mysql-bin.000011:106
Fri Aug 30 08:50:44 2013 – [info] Latest slaves (Slaves that received relay log files to the latest):
Fri Aug 30 08:50:44 2013 – [info] 192.168.16.5(192.168.16.5:3306) Version=5.1.48-log (oldest major version between slaves) log-bin:enabled
在原主节点上停掉浮动ip 16.9
Applying differential binary/relay log files /var/tmp/saved_master_binlog_from_192.168.16.6_3306_20130830085043.binlog on 192.168.16.5:3306. This may take long time…
Applying log files succeeded.
Fri Aug 30 08:50:49 2013 – [info] All relay logs were successfully applied.
Fri Aug 30 08:50:49 2013 – [info] Getting new master’s binlog name and position..
Fri Aug 30 08:50:49 2013 – [info] mysql-bin.000006:106
Fri Aug 30 08:50:49 2013 – [info] All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST=’192.168.16.5’, MASTER_PORT=3306, MASTER_LOG_FILE=’mysql-bin.000006’, MASTER_LOG_POS=106, MASTER_USER=’replicationuser’, MASTER_PASSWORD=’xxx’;
Fri Aug 30 08:50:49 2013 – [info] Executing master IP activate script:
Fri Aug 30 08:50:49 2013 – [info] /masterha/scripts/master_ip_failover –command=start –ssh_user=root –orig_master_host=192.168.16.6 –orig_master_ip=192.168.16.6 –orig_master_port=3306 –new_master_host=192.168.16.5 –new_master_ip=192.168.16.5 –new_master_port=3306 –new_master_user=’root’–new_master_password=’sztf@yunwei’
Unknown option: new_master_user
Unknown option: new_master_password
IN SCRIPT TEST====/sbin/ifconfig eth0:1 down==/sbin/ifconfig eth0:1 192.168.16.9/24===
Enabling the VIP – 192.168.16.9/24 on the new master – 192.168.16.5
Fri Aug 30 08:50:49 2013 – [info] OK.
Fri Aug 30 08:50:49 2013 – [info] ** Finished master recovery successfully.
Fri Aug 30 08:50:49 2013 – [info] * Phase 3: Master Recovery Phase completed.
Fri Aug 30 08:50:49 2013 – [info]
Fri Aug 30 08:50:49 2013 – [info] * Phase 4: Slaves Recovery Phase..
在新主节点上启动vip 16.9
在新节点上查看Ip地址情况:
root@ubuntu:~# ifconfig -a
eth0 Link encap:以太网硬件地址 94:de:80:18:11:82
inet 地址:192.168.16.5 广播:192.168.16.255 掩码:255.255.255.0
inet6 地址: fe80::96de:80ff:fe18:1182/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 跃点数:1
接收数据包:19122846 错误:0 丢弃:58 过载:0 帧数:0
发送数据包:241270 错误:0 丢弃:0 过载:0 载波:0
碰撞:0 发送队列长度:1000
接收字节:1644186729 (1.6 GB) 发送字节:40646216 (40.6 MB)
中断:40 基本地址:0x8000
eth0:1 Link encap:以太网硬件地址 94:de:80:18:11:82
inet 地址:192.168.16.9 广播:192.168.16.255 掩码:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 跃点数:1
中断:40 基本地址:0x8000
lo Link encap:本地环回
inet 地址:127.0.0.1 掩码:255.0.0.0
inet6 地址: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:16436 跃点数:1
接收数据包:3169361 错误:0 丢弃:0 过载:0 帧数:0
发送数据包:3169361 错误:0 丢弃:0 过载:0 载波:0
碰撞:0 发送队列长度:0
接收字节:158507432 (158.5 MB) 发送字节:158507432 (158.5 MB)
root@ubuntu:~#
vip已经正常切换
4.4手工切换测试
切换命令:
masterha_master_switch --conf=/etc/masterha/app1.cnf --master_state=alive
输出如下:
root@ubuntu:~# masterha_master_switch --conf=/etc/masterha/app1.cnf --master_state=alive
Fri Aug 30 10:21:29 2013 - [info] MHA::MasterRotate version 0.55.
Fri Aug 30 10:21:29 2013 - [info] Starting online master switch..
Fri Aug 30 10:21:29 2013 - [info]
Fri Aug 30 10:21:29 2013 - [info] * Phase 1: Configuration Check Phase..
Fri Aug 30 10:21:29 2013 - [info]
Fri Aug 30 10:21:29 2013 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.
Fri Aug 30 10:21:29 2013 - [info] Reading application default configurations from /etc/masterha/app1.cnf..
Fri Aug 30 10:21:29 2013 - [info] Reading server configurations from /etc/masterha/app1.cnf..
Fri Aug 30 10:21:29 2013 - [info] Current Alive Master: 192.168.16.6(192.168.16.6:3306)
Fri Aug 30 10:21:29 2013 - [info] Alive Slaves:
Fri Aug 30 10:21:29 2013 - [info] 192.168.16.5(192.168.16.5:3306) Version=5.1.48-log (oldest major version between slaves) log-bin:enabled
Fri Aug 30 10:21:29 2013 - [info] Replicating from 192.168.16.6(192.168.16.6:3306)
Fri Aug 30 10:21:29 2013 - [info] Primary candidate for the new Master (candidate_master is set)
Fri Aug 30 10:21:29 2013 - [info] 192.168.16.7(192.168.16.7:3306) Version=5.1.48-log (oldest major version between slaves) log-bin:enabled
Fri Aug 30 10:21:29 2013 - [info] Replicating from 192.168.16.6(192.168.16.6:3306)
Fri Aug 30 10:21:29 2013 - [info] Primary candidate for the new Master (candidate_master is set)
It is better to execute FLUSH NO_WRITE_TO_BINLOG TABLES on the master before switching. Is it ok to execute on 192.168.16.6(192.168.16.6:3306)? (YES/no): yes
Fri Aug 30 10:21:44 2013 - [info] Executing FLUSH NO_WRITE_TO_BINLOG TABLES. This may take long time..
Fri Aug 30 10:21:44 2013 - [info] ok.
Fri Aug 30 10:21:44 2013 - [info] Checking MHA is not monitoring or doing failover..
Fri Aug 30 10:21:44 2013 - [info] Checking replication health on 192.168.16.5..
Fri Aug 30 10:21:44 2013 - [info] ok.
Fri Aug 30 10:21:44 2013 - [info] Checking replication health on 192.168.16.7..
Fri Aug 30 10:21:44 2013 - [info] ok.
Fri Aug 30 10:21:44 2013 - [info] Searching new master from slaves..
Fri Aug 30 10:21:44 2013 - [info] Candidate masters from the configuration file:
Fri Aug 30 10:21:44 2013 - [info] 192.168.16.5(192.168.16.5:3306) Version=5.1.48-log (oldest major version between slaves) log-bin:enabled
Fri Aug 30 10:21:44 2013 - [info] Replicating from 192.168.16.6(192.168.16.6:3306)
Fri Aug 30 10:21:44 2013 - [info] Primary candidate for the new Master (candidate_master is set)
Fri Aug 30 10:21:44 2013 - [info] 192.168.16.6(192.168.16.6:3306) Version=5.1.48-log log-bin:enabled
Fri Aug 30 10:21:44 2013 - [info] 192.168.16.7(192.168.16.7:3306) Version=5.1.48-log (oldest major version between slaves) log-bin:enabled
Fri Aug 30 10:21:44 2013 - [info] Replicating from 192.168.16.6(192.168.16.6:3306)
Fri Aug 30 10:21:44 2013 - [info] Primary candidate for the new Master (candidate_master is set)
Fri Aug 30 10:21:44 2013 - [info] Non-candidate masters:
Fri Aug 30 10:21:44 2013 - [info] Searching from candidate_master slaves which have received the latest relay log events..
Fri Aug 30 10:21:44 2013 - [info]
From:
192.168.16.6 (current master)
+--192.168.16.5
+--192.168.16.7
To:
192.168.16.5 (new master)
+--192.168.16.7
Starting master switch from 192.168.16.6(192.168.16.6:3306) to 192.168.16.5(192.168.16.5:3306)? (yes/NO): yes
Fri Aug 30 10:21:50 2013 - [info] Checking whether 192.168.16.5(192.168.16.5:3306) is ok for the new master..
Fri Aug 30 10:21:50 2013 - [info] ok.
Fri Aug 30 10:21:50 2013 - [info] ** Phase 1: Configuration Check Phase completed.
Fri Aug 30 10:21:50 2013 - [info]
Fri Aug 30 10:21:50 2013 - [info] * Phase 2: Rejecting updates Phase..
Fri Aug 30 10:21:50 2013 - [info]
master_ip_online_change_script is not defined. If you do not disable writes on the current master manually, applications keep writing on the current master. Is it ok to proceed? (yes/NO): yes
Fri Aug 30 10:22:15 2013 - [info] Locking all tables on the orig master to reject updates from everybody (including root):
Fri Aug 30 10:22:15 2013 - [info] Executing FLUSH TABLES WITH READ LOCK..
Fri Aug 30 10:22:15 2013 - [info] ok.
Fri Aug 30 10:22:15 2013 - [info] Orig master binlog:pos is mysql-bin.000012:106.
Fri Aug 30 10:22:15 2013 - [info] Waiting to execute all relay logs on 192.168.16.5(192.168.16.5:3306)..
Fri Aug 30 10:22:15 2013 - [info] master_pos_wait(mysql-bin.000012:106) completed on 192.168.16.5(192.168.16.5:3306). Executed 0 events.
Fri Aug 30 10:22:15 2013 - [info] done.
Fri Aug 30 10:22:15 2013 - [info] Getting new master's binlog name and position..
Fri Aug 30 10:22:15 2013 - [info] mysql-bin.000004:106
Fri Aug 30 10:22:15 2013 - [info] All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='192.168.16.5', MASTER_PORT=3306, MASTER_LOG_FILE='mysql-bin.000004', MASTER_LOG_POS=106, MASTER_USER='replicationuser', MASTER_PASSWORD='xxx';
Fri Aug 30 10:22:15 2013 - [info]
Fri Aug 30 10:22:15 2013 - [info] * Switching slaves in parallel..
Fri Aug 30 10:22:15 2013 - [info]
Fri Aug 30 10:22:15 2013 - [info] -- Slave switch on host 192.168.16.7(192.168.16.7:3306) started, pid: 10166
Fri Aug 30 10:22:15 2013 - [info]
Fri Aug 30 10:22:15 2013 - [info] Log messages from 192.168.16.7 ...
Fri Aug 30 10:22:15 2013 - [info]
Fri Aug 30 10:22:15 2013 - [info] Waiting to execute all relay logs on 192.168.16.7(192.168.16.7:3306)..
Fri Aug 30 10:22:15 2013 - [info] master_pos_wait(mysql-bin.000012:106) completed on 192.168.16.7(192.168.16.7:3306). Executed 0 events.
Fri Aug 30 10:22:15 2013 - [info] done.
Fri Aug 30 10:22:15 2013 - [info] Resetting slave 192.168.16.7(192.168.16.7:3306) and starting replication from the new master 192.168.16.5(192.168.16.5:3306)..
Fri Aug 30 10:22:15 2013 - [info] Executed CHANGE MASTER.
Fri Aug 30 10:22:15 2013 - [info] Slave started.
Fri Aug 30 10:22:15 2013 - [info] End of log messages from 192.168.16.7 ...
Fri Aug 30 10:22:15 2013 - [info]
Fri Aug 30 10:22:15 2013 - [info] -- Slave switch on host 192.168.16.7(192.168.16.7:3306) succeeded.
Fri Aug 30 10:22:15 2013 - [info] Unlocking all tables on the orig master:
Fri Aug 30 10:22:15 2013 - [info] Executing UNLOCK TABLES..
Fri Aug 30 10:22:15 2013 - [info] ok.
Fri Aug 30 10:22:15 2013 - [info] All new slave servers switched successfully.
Fri Aug 30 10:22:15 2013 - [info]
Fri Aug 30 10:22:15 2013 - [info] * Phase 5: New master cleanup phase..
Fri Aug 30 10:22:15 2013 - [info]
Fri Aug 30 10:22:16 2013 - [info] 192.168.16.5: Resetting slave info succeeded.
Fri Aug 30 10:22:16 2013 - [info] Switching master to 192.168.16.5(192.168.16.5:3306) completed successfully.
重点关注:
From:
192.168.16.6 (current master)
+--192.168.16.5
+--192.168.16.7
To:
192.168.16.5 (new master)
+--192.168.16.7
切换后16.6的数据库将会被关闭
4.5 关机测试(直接关闭主数据库所在主机)
停机命令:
root@ubuntu:~# reboot
来自root@ubuntu的广播信息
(/dev/pts/2) 于 18:25 ...
现在,系统将关闭并且重新启动!
root@ubuntu:~#
测试结果:
主服务器关闭之后,管理程序会将其中一个从节点升级为主节点并把ip地址在因节点上绑定。
4.6 关于权重的设置
当前状态:
192.168.16.6 (current master)
+--192.168.16.5
+--192.168.16.7
我们期待切换到16.7上,(默认情况下如果权重都一样应该是切换到5上面,因为5 servid靠前)修改配置如下:
[server default]
manager_workdir=/masterha/app1
manager_log=/masterha/app1/manager.log
user=root
password=sztf@yunwei
ssh_user=root
repl_user=replicationuser
repl_password=replicationuser
ping_interval=1
shutdown_script=""
#master_ip_failover_script=''
master_ip_failover_script="/masterha/scripts/master_ip_failover"
master_ip_online_change_script=""
report_script=""
[server1]
hostname=192.168.16.5
master_binlog_dir="/usr/local/mysql/mysql_data/"
[server2]
hostname=192.168.16.6
master_binlog_dir="/usr/local/mysql/mysql_data/"
[server3]
hostname=192.168.16.7
candidate_master=1
master_binlog_dir="/usr/local/mysql/mysql_data/"
~
将16.7权重设置为1,将16.5,16.6权重去掉清除app1.failover.complete文件,停止16.6数据库,命令如下:
root@ubuntu:~# /usr/local/mysql/bin/mysql.server.sh stop
Shutting down MySQL...... *
root@ubuntu:~#
截取一段日志分析:
Fri Aug 30 19:18:32 2013 - [info] * Phase 3.3: Determining New Master Phase..
Fri Aug 30 19:18:32 2013 - [info]
Fri Aug 30 19:18:32 2013 - [info] Finding the latest slave that has all relay logs for recovering other slaves..
Fri Aug 30 19:18:32 2013 - [info] All slaves received relay logs to the same position. No need to resync each other.
Fri Aug 30 19:18:32 2013 - [info] Searching new master from slaves..
Fri Aug 30 19:18:32 2013 - [info] Candidate masters from the configuration file:
Fri Aug 30 19:18:32 2013 - [info] 192.168.16.7(192.168.16.7:3306) Version=5.1.48-log (oldest major version between slaves) log-bi
n:enabled
Fri Aug 30 19:18:32 2013 - [info] Replicating from 192.168.16.6(192.168.16.6:3306)
Fri Aug 30 19:18:32 2013 - [info] Primary candidate for the new Master (candidate_master is set)
Fri Aug 30 19:18:32 2013 - [info] Non-candidate masters:
Fri Aug 30 19:18:32 2013 - [info] Searching from candidate_master slaves which have received the latest relay log events..
Fri Aug 30 19:18:32 2013 - [info] New master is 192.168.16.7(192.168.16.7:3306)
Fri Aug 30 19:18:32 2013 - [info] Starting master failover..
Fri Aug 30 19:18:32 2013 - [info]
From:
192.168.16.6 (current master)
+--192.168.16.5
+--192.168.16.7
To:
192.168.16.7 (new master)
+--192.168.16.5
直接从配置文件中选中了16.7,在后续日志中继续查找是否接收到了最新的日志
官方文档对两个参数的解释如下:
You might use different kinds of machines between slaves, and want to promote the most reliable machine to the new master (i.e. promoting a RAID1+0 slave rather than RAID0 slaves).
By setting candidate_master to 1, the server is prioritized to the new master, as long as it meets conditions to be the new master (i.e. binary log is enabled, it does not delay replication significantly, etc). So candidate_master=1 does not mean that the specified host always becomes new master when the current master crashes, but is helpful to set priority.
If you set candidate_master=1 on multiple servers, priority is decided by sort order by block name ([server_xxx]). [server_1] will have higher priority than [server_2].
By setting no_master=1 on the target server, the server never becomes the new master. This is useful if you have some servers that should not become the new master. For example, you may want to set no_master=1 when you run slaves on unreliable (RAID0) machine, or when you run a slave at a remote data center. Note that if none of the slaves can be new master, MHA aborts and does not start monitoring/failover.
即通过设置candidate_master1,服务器的优先级到新的主,只要符合条件,成为新的主(即二进制日志被启用,没有显著的复制延迟)。所以candidate_master=1并不意味着指定的主机时,总是成为新的主当主崩溃时,但设置的优先级是有帮助的。
如果不想让某台机器成为新的主,可以通过设置no_master=1来实现。
4.7关于MHA的检测机制
官方文档上我没有找到太多有价值的线索,只看到如下一些描述
This parameter states how often MHA Manager pings(executes ping SQL statement) the master. After missing three connection intervals in a row, MHA Manager decides that the MySQL master is dead. Thus, the maximum time for discovering a failure through the ping mechanism is four times the ping interval. The default is 3 (3 seconds).
If MHA Manager fails to connect by too many connections or authentication errors, it doesn't count that the master is dead.
(Supported from 0.53) By default, MHA establishes a persistent connection to a master and checks master's availability by executing "SELECT 1" (ping_type=SELECT). But in some cases, it is better to check by connecting/disconnecting every time, because it's more strict and it can detect TCP connection level failure more quickly. Setting ping_type=CONNECT makes it possible.
When MHA Manager is successfully monitoring the MySQL master, status code (exit code) 0 should be returned like above.
All status codes and descriptions are listed below.
Status Code(Exit code) | Status String | Description |
0 | PING_OK | Master is running and MHA Manager is monitoring. Master state is alive. |
1 | --- | Unexpected error happened. For example, config file does not exist. If this error happens, check arguments are valid or not. |
2 | NOT_RUNNING | MHA Manager is not running. Master state is unknown. |
3 | PARTIALLY_RUNNING | MHA Manager main process is not running, but child processes are running. This should not happen and should be investigated. Master state is unknown. |
10 | INITIALIZING_MONITOR | MHA Manager is just after startup and initializing. Wait for a while and see how the status changes. Master state is unknown. |
20 | PING_FAILING | MHA Manager detects ping to master is failing. Master state is maybe down. |
21 | PING_FAILED | MHA Manager detects either a) ping to master failed three times, b) preparing for starting master failover. Master state is maybe down. |
30 | RETRYING_MONITOR | MHA Manager internal health check program detected that master was not reachable from manager, but after double check MHA Manager verified the master is alive, and currently waiting for retry. Master state is very likely alive. |
31 | CONFIG_ERROR | There are some configuration problems and MHA Manager can't monitor the target master. Check a logfile for detail. Master state is unknown. |
32 | TIMESTAMP_OLD | MHA Manager detects that ping to master is ok but status file is not updated for a long time. Check whether MHA Manager itself hangs or not. Master state is unknown. |
50 | FAILOVER_RUNNING | MHA Manager confirms that master is down and running failover. Master state is dead. |
51 | FAILOVER_ERROR | MHA Manager confirms that master is down and running failover, but failed during failover. Master state is dead. |
从上面证据上分析,MHA应该是与主节点之间建立一个长连接,并执行以下sql语句来检查数据库的状态包括进程状态,并通过ping命令来检查主机状态,仅仅是我个人猜测。
关于几个问题的回答
1: 自动切换, 你是用什么方法让原master 异常的?
关闭数据库、关机主机,命令如下:
/usr/local/mysql/bin/mysql.server.sh stop
Mysqladmin –uroot –p shutdown
reboot
2 : 手工切换, 能否指明哪个服务器为新master ?
可以通过设置权重让这个服务器优先级高一下,但并不是总能实现。
3: 看到你提到了 keepalived , 最后是否没有使用它, 而是使用了 MHA 自带的 vip管理功能 ?
使用了MHA自带的模板来完善的。
4 : 修改权重, 只是修改了slave 升级为master 的级别, 这个在哪里修改, 能否彻底关闭一个 slave , 要求它不能升级为 master ?
修改管理节点上的配置文件,设置no_master=1
问题5: 7.手工切换需要先定义好master_ip_online_change_script脚本 --- 如果不定义, 会有什么后果 ?
不定义就只发生数据库的切换,而ip地址不会被接管
问题6: master 检测故障的检测点是哪几个?
从官方文档上没有找到确切的答案,从代码上分析,由于代码比较杂乱,暂时没有能理找到太多线索。
root@ubuntu:/masterha/app1#vi /opt/mha4mysql-manager-0.55/bin/masterha_master_monitor
use strict;
use warnings FATAL => 'all';
use English qw(-no_match_vars);
use Getopt::Long qw(:config pass_through);
use Pod::Usage;
use MHA::MasterMonitor;
use MHA::ManagerConst;
my $help;
my $version;
$| = 1;
GetOptions(
'help' => /$help,
'version' => /$version,
);
if ($version) {
print "masterha_master_monitor version $MHA::ManagerConst::VERSION./n";
exit 0;
}
if ($help) {
pod2usage(0);
}
my ( $exit_code, $dead_master, $ssh_reachable ) =
MHA::MasterMonitor::main( "--monitor_only", @ARGV );
if ( $dead_master->{hostname} ) {
print "Master $dead_master->{hostname} is dead!/n";
print "IP Address: $dead_master->{ip} ";
print "Port: $dead_master->{port}/n";
}
if ( $exit_code && $exit_code eq "0" ) {
if ($ssh_reachable) {
print "SSH: reachable/n";
}
else {
print "SSH: NOT reachable/n";
}
exit 0;
}
exit 1 if ( !defined($exit_code) );
exit $exit_code;
查找MasterMonitor包文件,在下面的文件中定义
/usr/local/share/perl/5.14.2/MHA/MasterMonitor.pm
其中又调用了
use MHA::ServerManager;
use MHA::HealthCheck;
use MHA::FileStatus;
use MHA::SSHCheck;
查看/usr/local/share/perl/5.14.2/MHA/HealthCheck.pm文件
代码比较长,从中可以看到一些登陆mysql,以及ssh的信息:
my $log = $self->{logger};
$self->{dbh} = DBI->connect(
"DBI:mysql:;host=$self->{ip};"
. "port=$self->{port};mysql_connect_timeout=$connect_timeout",
$self->{user},
$self->{password},
{ PrintError => 0, RaiseError => $raise_error }
);
sub invoke_ssh_check {
my $self = shift;
my $log = $self->{logger};
if ( !$self->{_ssh_check_invoked} ) {
if ( $self->{_ssh_check_pid} = fork ) {
$self->{_ssh_check_invoked} = 1;
}
elsif ( defined $self->{_ssh_check_pid} ) {
$SIG{INT} = $SIG{HUP} = $SIG{QUIT} = $SIG{TERM} = "DEFAULT";
$log->info("Executing SSH check script: $self->{ssh_check_command}");
#child ssh check process
exit ssh_check(
$self->{ssh_user}, $self->{ssh_host},
$self->{ssh_ip}, $self->{ssh_port},
$self->{logger}, $self->{ssh_connection_timeout},
$self->{ssh_check_command}
);
}
else {
croak "Forking SSH check process failed. Can't continue operation./n";
}
}
}