Redis集群支持master节点故障自动failover到从节点,所以集群每个master节点最少保证有一个从节点可用。本次测试用了三个主机,每个节点上均有一个master和slave,详情如下:

redis节点 IP 初始角色 slave节点 redis版本 日志路径
redis01:6379 192.168.100.1 master redis03:6380 5.0.3 /var/log/redis/redis_6379.log
redis02:6379 192.168.100.2 master redis01:6380 5.0.3 /var/log/redis/redis_6379.log
redis03:6379 192.168.100.3 master redis02:6380 5.0.3 /var/log/redis/redis_6379.log
redis01:6380 192.168.100.1 slave - 5.0.3 /var/log/redis/redis_6380.log
redis02:6380 192.168.100.2 slave - 5.0.3 /var/log/redis/redis_6380.log
redis03:6380 192.168.100.3 slave - 5.0.3 /var/log/redis/redis_6380.log

初始集群信息如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
127.0.0.1:6379> cluster nodes
b669e00c960f525f451fd5e24bf67bc6a7521f05 192.168.100.2:6380@16380 slave 086738b731ecbf4526faca070c53f8bcdede5952 0 1549941812000 5 connected
1d8748d57d749dcc880e9274465fa353b9f63b9e 192.168.100.2:6379@16379 master - 0 1549941812934 2 connected 5461-10922
c876e5e99902ad9f11c714eca6770dac7626ffa0 192.168.100.1:6379@16379 myself,master - 0 1549941812000 1 connected 0-5460
e509322da0e6589617e5916945e745f8a97fa68a 192.168.100.1:6380@16380 slave 1d8748d57d749dcc880e9274465fa353b9f63b9e 0 1549941812000 4 connected
fa1fa2a7c281115192f1caec30857921dadf0941 192.168.100.3:6380@16380 slave c876e5e99902ad9f11c714eca6770dac7626ffa0 0 1549941814000 6 connected
086738b731ecbf4526faca070c53f8bcdede5952 192.168.100.3:6379@16379 master - 0 1549941814938 3 connected 10923-16383
127.0.0.1:6379> cluster info
cluster_state:ok
cluster_slots_assigned:16384
cluster_slots_ok:16384
cluster_slots_pfail:0
cluster_slots_fail:0
cluster_known_nodes:6
cluster_size:3
cluster_current_epoch:6
cluster_my_epoch:1
cluster_stats_messages_ping_sent:2484060
cluster_stats_messages_pong_sent:2301073
cluster_stats_messages_sent:4785133
cluster_stats_messages_ping_received:2301068
cluster_stats_messages_pong_received:2484060
cluster_stats_messages_meet_received:5
cluster_stats_messages_received:4785133
127.0.0.1:6379>

redis cluster failover机制

Redis集群自身实现了高可用,当集群内少量节点出现故障时通过自动故障转移保证集群可以正常对外提供服务。

故障发现

  1. 主观下线

当cluster-node-timeout时间内某节点无法与另一个节点顺利完成ping消息通信时,则将该节点标记为主观下线状态。

  1. 客观下线

当某个节点判断另一个节点主观下线后,该节点的下线报告会通过Gossip消息传播。当接收节点发现消息体中含有主观下线的节点,其会尝试对该节点进行客观下线,依据下线报告是否在有效期内(如果在cluster-node-timeout*2时间内无法收集到一半以上槽节点的下线报告,那么之前的下线报告会过期),且数量大于槽节点总数的一半。若是,则将该节点更新为客观下线,并向集群广播下线节点的fail消息。

故障恢复

故障节点变为客观下线后,如果下线节点是持有槽的主节点,则需要在它的从节点中选出一个替换它,从而保证集群的高可用,过程如下:

  1. 资格检查

每个从节点都要检查最后与主节点断线时间,判断是否有资格替换故障的主节点。如果从节点与主节点断线时间超过cluster-node-timeout*cluster-slave-validity-factor,则当前从节点不具备故障转移资格。

  1. 准备选举时间

从节点符合故障转移资格后,更新触发故障选举时间,只有到达该时间才能执行后续流程。采用延迟触发机制,主要是对多个从节点使用不同的延迟选举时间来支持优先级。复制偏移量越大说明从节点延迟越低,那么它应该具有更高的优先级。

  1. 发起选举

当从节点到达故障选举时间后,会触发选举流程:

(1) 更新配置纪元

配置纪元是一个只增不减的整数,每个主节点自身维护一个配置纪元,标示当前主节点的版本,所有主节点的配置纪元都不相等,从节点会复制主节点的配置纪元。整个集群又维护一个全局的配置纪元,用于记录集群内所有主节点配置纪元的最大版本。每次集群发生重大事件,如新加入主节点或由从节点转换而来,从节点竞争选举,都会递增集群全局配置纪元并赋值给相关主节点,用于记录这一关键事件。

(2) 广播选举消息

在集群内广播选举消息,并记录已发送过消息的状态,保证该从节点在一个配置纪元内只能发起一次选举。

  1. 选举投票

只有持有槽的主节点才会处理故障选举消息,每个持有槽的节点在一个配置纪元内都有唯一的一张选票,当接到第一个请求投票的从节点消息,回复消息作为投票,之后相同配置纪元内其它从节点的选举消息将忽略。投票过程其实是一个领导者选举的过程。

每个配置纪元代表了一次选举周期,如果在开始投票后的cluster-node-timeout*2时间内从节点没有获取足够数量的投票,则本次选举作废。从节点对配置纪元自增并发起下一轮投票,直到选举成功为止。

  1. 替换主节点

当前从节点取消复制变为主节点,撤销故障主节点负责的槽,把这些槽委派给自己,并向集群广播告知所有节点当前从节点变为主节点。

TSOP项目redis集群各master只有一个slave,且为3个节点上主从交叉形式部署,集群只允许最多出现不超过一台主机故障,因此本次测试一个master节点故障后failover能否正常工作。

failover测试验证

1. 停掉一个master节点

此处以redis01:6379为例

1
[root@redis01 ~]# systemctl stop redis-6379

此master节点日志

1
2
3
4
5
6
[root@redis01 ~]# tail -f /var/log/redis/redis_6379.log
36561:signal-handler (1549951235) Received SIGTERM scheduling shutdown...
36561:M 12 Feb 2019 01:00:35.534 # User requested shutdown...
36561:M 12 Feb 2019 01:00:35.534 * Calling fsync() on the AOF file.
36561:M 12 Feb 2019 01:00:35.534 * Removing the pid file.
36561:M 12 Feb 2019 01:00:35.534 # Redis is now ready to exit, bye bye...

当其它节点均标识此master节点离线时,即表示此master为客观下线状态,如下日志

1
2
[root@redis02 ~]# tail -f /var/log/redis/redis_6380.log
36641:S 12 Feb 2019 01:00:54.274 * Marking node c876e5e99902ad9f11c714eca6770dac7626ffa0 as failing (quorum reached).

它的slave节点,即redis03:6380的日志(只截取部分重要日志信息)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
[root@redis03 ~]# tail -f /var/log/redis/redis_6380.log
36623:S 12 Feb 2019 01:00:35.550 # Connection with master lost.
36623:S 12 Feb 2019 01:00:35.550 * Caching the disconnected master state.
36623:S 12 Feb 2019 01:00:35.558 * Connecting to MASTER 192.168.100.1:6379
36623:S 12 Feb 2019 01:00:35.558 * MASTER <-> REPLICA sync started
...
36623:S 12 Feb 2019 01:00:51.481 * Marking node c876e5e99902ad9f11c714eca6770dac7626ffa0 as failing (quorum reached).
36623:S 12 Feb 2019 01:00:51.504 # Start of election delayed for 684 milliseconds (rank #0, offset 3488450).
...
36623:S 12 Feb 2019 01:01:11.556 # Currently unable to failover: Waiting for votes, but majority still not reached.
...
36623:S 12 Feb 2019 01:01:22.286 # Currently unable to failover: Failover attempt expired.
...
36623:S 12 Feb 2019 01:01:52.274 # Start of election delayed for 766 milliseconds (rank #0, offset 3488450).
36623:S 12 Feb 2019 01:01:52.374 # Currently unable to failover: Waiting the delay before I can start a new failover.
36623:S 12 Feb 2019 01:01:52.774 * Connecting to MASTER 192.168.100.1:6379
36623:S 12 Feb 2019 01:01:52.774 * MASTER <-> REPLICA sync started
36623:S 12 Feb 2019 01:01:52.775 # Error condition on socket for SYNC: Connection refused
36623:S 12 Feb 2019 01:01:53.074 # Starting a failover election for epoch 8.
36623:S 12 Feb 2019 01:01:53.077 # Currently unable to failover: Waiting for votes, but majority still not reached.
36623:S 12 Feb 2019 01:01:53.077 # Failover election won: I'm the new master.
36623:S 12 Feb 2019 01:01:53.077 # configEpoch set to 8 after successful failover
36623:M 12 Feb 2019 01:01:53.077 # Setting secondary replication ID to ec3a9236f6a0b677f35aacc271b12d034b06b3be, valid up to offset: 3488451. New replication ID is aa66a8d82de43ac2dd3da8126f9f54182d5ac77f
36623:M 12 Feb 2019 01:01:53.077 * Discarding previously cached master state.

可以看到slave在判断master客观下线后,经过默认选举逻辑操作后,将自己选为新的master来取代原master,我们再查看集群状态

1
2
3
4
5
6
7
8
[root@redis01 ~]# redis-cli -p 6380
127.0.0.1:6380> cluster nodes
b669e00c960f525f451fd5e24bf67bc6a7521f05 192.168.100.2:6380@16380 slave 086738b731ecbf4526faca070c53f8bcdede5952 0 1549951544000 5 connected
fa1fa2a7c281115192f1caec30857921dadf0941 192.168.100.3:6380@16380 master - 0 1549951547231 8 connected 0-5460
c876e5e99902ad9f11c714eca6770dac7626ffa0 192.168.100.1:6379@16379 master,fail - 1549951235537 1549951234334 1 disconnected
086738b731ecbf4526faca070c53f8bcdede5952 192.168.100.3:6379@16379 master - 0 1549951546227 3 connected 10923-16383
e509322da0e6589617e5916945e745f8a97fa68a 192.168.100.1:6380@16380 myself,slave 1d8748d57d749dcc880e9274465fa353b9f63b9e 0 1549951545000 4 connected
1d8748d57d749dcc880e9274465fa353b9f63b9e 192.168.100.2:6379@16379 master - 0 1549951546000 2 connected 5461-10922

可以看到,如我们所想的一样,

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

```shell
[root@redis01 ~]# redis-cli -p 6380
127.0.0.1:6380> cluster info
cluster_state:ok
cluster_slots_assigned:16384
cluster_slots_ok:16384
cluster_slots_pfail:0
cluster_slots_fail:0
cluster_known_nodes:6
cluster_size:3
cluster_current_epoch:8
cluster_my_epoch:2
cluster_stats_messages_ping_sent:2496917
cluster_stats_messages_pong_sent:2618434
cluster_stats_messages_meet_sent:4
cluster_stats_messages_sent:5115355
cluster_stats_messages_ping_received:2618433
cluster_stats_messages_pong_received:2493938
cluster_stats_messages_meet_received:1
cluster_stats_messages_fail_received:2
cluster_stats_messages_auth-req_received:2
cluster_stats_messages_received:5112376

2. 恢复原master节点

我们再次启动恢复master

1
[root@redis01 ~]# systemctl start redis-6379

因集群已保留此离线节点信息,当此节点再次启动时,其它集群节点会知悉节点恢复

1
2
[root@redis02 ~]# tail -f /var/log/redis/redis_6380.log
36641:S 12 Feb 2019 01:14:34.772 * Clear FAIL state for node c876e5e99902ad9f11c714eca6770dac7626ffa0: master without slots is reachable again.

此节点会自动识别并将自己变成新master节点的slave,并同步数据

1
2
3
4
5
6
[root@redis01 ~]# tail -f /var/log/redis/redis_6379.log
48088:M 12 Feb 2019 01:14:34.707 * Ready to accept connections
48088:M 12 Feb 2019 01:14:34.709 # Configuration change detected. Reconfiguring myself as a replica of fa1fa2a7c281115192f1caec30857921dadf0941
48088:S 12 Feb 2019 01:14:34.709 * Before turning into a replica, using my master parameters to synthesize a cached master: I may be able to synchronize with the new master with just a partial transfer.
48088:S 12 Feb 2019 01:14:34.709 # Cluster state changed: ok
48088:S 12 Feb 2019 01:14:35.709 * Connecting to MASTER 192.168.100.3:6380

在新的master节点上,也能看到此节点的数据同步请求日志

1
2
3
4
5
6
7
8
9
10
[root@redis03 ~]# tail -f /var/log/redis/redis_6380.log
36623:M 12 Feb 2019 01:14:34.768 * Clear FAIL state for node c876e5e99902ad9f11c714eca6770dac7626ffa0: master without slots is reachable again.
36623:M 12 Feb 2019 01:14:35.724 * Replica 192.168.100.1:6379 asks for synchronization
36623:M 12 Feb 2019 01:14:35.724 * Partial resynchronization not accepted: Replication ID mismatch (Replica asked for '0e2cbfecff38048e5d0fcb15fb7d5b16e44df9b4', my replication IDs are 'aa66a8d82de43ac2dd3da8126f9f54182d5ac77f' and 'ec3a9236f6a0b677f35aacc271b12d034b06b3be')
36623:M 12 Feb 2019 01:14:35.724 * Starting BGSAVE for SYNC with target: disk
36623:M 12 Feb 2019 01:14:35.728 * Background saving started by pid 48336
48336:C 12 Feb 2019 01:14:35.730 * DB saved on disk
48336:C 12 Feb 2019 01:14:35.731 * RDB: 8 MB of memory used by copy-on-write
36623:M 12 Feb 2019 01:14:35.769 * Background saving terminated with success
36623:M 12 Feb 2019 01:14:35.769 * Synchronization with replica 192.168.100.1:6379 succeeded

查看集群状态

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
[root@redis01 ~]# redis-cli
127.0.0.1:6379> cluster nodes
fa1fa2a7c281115192f1caec30857921dadf0941 192.168.100.3:6380@16380 master - 0 1549952416323 8 connected 0-5460
c876e5e99902ad9f11c714eca6770dac7626ffa0 192.168.100.1:6379@16379 myself,slave fa1fa2a7c281115192f1caec30857921dadf0941 0 1549952414000 1 connected
e509322da0e6589617e5916945e745f8a97fa68a 192.168.100.1:6380@16380 slave 1d8748d57d749dcc880e9274465fa353b9f63b9e 0 1549952418326 4 connected
b669e00c960f525f451fd5e24bf67bc6a7521f05 192.168.100.2:6380@16380 slave 086738b731ecbf4526faca070c53f8bcdede5952 0 1549952417325 5 connected
086738b731ecbf4526faca070c53f8bcdede5952 192.168.100.3:6379@16379 master - 0 1549952417000 3 connected 10923-16383
1d8748d57d749dcc880e9274465fa353b9f63b9e 192.168.100.2:6379@16379 master - 0 1549952415322 2 connected 5461-10922
127.0.0.1:6379> cluster info
cluster_state:ok
cluster_slots_assigned:16384
cluster_slots_ok:16384
cluster_slots_pfail:0
cluster_slots_fail:0
cluster_known_nodes:6
cluster_size:3
cluster_current_epoch:8
cluster_my_epoch:8
cluster_stats_messages_ping_sent:338
cluster_stats_messages_pong_sent:313
cluster_stats_messages_sent:651
cluster_stats_messages_ping_received:313
cluster_stats_messages_pong_received:338
cluster_stats_messages_received:651

可以看到,结果如我们上面预期一样。但是此时redis03主机上存在二个master节点,如果此主机故障,将直接导致整个集群不可用,还有数据丢失风险,所以我们最好将集群恢复到初始状态。

3. 手动故障转移

Redis集群提供了手动故障转移命令cluster failover,指定从节点发起转移,主从节点角色进行互换,过程如下:

  • 从节点通知主节点停止处理所有客户端请求。
  • 主节点发送对应从节点延迟复制的数据。
  • 从节点接收复制延迟的数据,直到主从复制偏移量一致。
  • 从节点立刻发起投票选举,选举成功后断开复制变为新的主节点,之后向集群广播。
  • 原主节点接收消息后更新自身配置变为从节点,解除所有客户端请求阻塞,重定向到新的主节点。
  • 原主节点变为从节点后,向新的主节点发起全量复制请求(Redis4.0版本这一过程会有改善)

Redis集群还提供了强制故障转移的方法:

  • cluster failover force - 用于主节点宕机且无法自动完成故障转移的情况。
  • cluster failover takeover - 用于集群内一半以上主节点故障的场景,从节点无法收到半数以上主节点投票,无法完成选举过程(慎用)。

此处我们使用cluster failover命令来将集群恢复到初始主从状态,此时应该在redis01:6379从节点上执行

1
2
3
[root@redis01 ~]# redis-cli
127.0.0.1:6379> cluster failover
OK

此节点由slave变更为master

1
2
3
4
5
6
7
8
9
10
11
12
13
14
[root@redis01 ~]# tail -f /var/log/redis/redis_6379.log
48088:S 12 Feb 2019 01:35:35.180 # Manual failover user request accepted.
48088:S 12 Feb 2019 01:35:35.208 # Received replication offset for paused master manual failover: 3490214
48088:S 12 Feb 2019 01:35:35.308 # All master replication stream processed, manual failover can start.
48088:S 12 Feb 2019 01:35:35.308 # Start of election delayed for 0 milliseconds (rank #0, offset 3490214).
48088:S 12 Feb 2019 01:35:35.408 # Starting a failover election for epoch 9.
48088:S 12 Feb 2019 01:35:35.412 # Failover election won: I'm the new master.
48088:S 12 Feb 2019 01:35:35.412 # configEpoch set to 9 after successful failover
48088:M 12 Feb 2019 01:35:35.412 # Setting secondary replication ID to aa66a8d82de43ac2dd3da8126f9f54182d5ac77f, valid up to offset: 3490215. New replication ID is 04057341f176ca0a1490fbe619d951bf28c8495e
48088:M 12 Feb 2019 01:35:35.412 # Connection with master lost.
48088:M 12 Feb 2019 01:35:35.412 * Caching the disconnected master state.
48088:M 12 Feb 2019 01:35:35.412 * Discarding previously cached master state.
48088:M 12 Feb 2019 01:35:36.412 * Replica 192.168.100.3:6380 asks for synchronization
48088:M 12 Feb 2019 01:35:36.412 * Partial resynchronization request from 192.168.100.3:6380 accepted. Sending 0 bytes of backlog starting from offset 3490215.

redis03:6380由master变回slave

1
2
3
4
5
6
7
8
9
10
11
12
13
14
[root@redis03 ~]# tail -f /var/log/redis/redis_6380.log
36623:M 12 Feb 2019 01:35:35.194 # Manual failover requested by replica c876e5e99902ad9f11c714eca6770dac7626ffa0.
36623:M 12 Feb 2019 01:35:35.424 # Failover auth granted to c876e5e99902ad9f11c714eca6770dac7626ffa0 for epoch 9
36623:M 12 Feb 2019 01:35:35.426 # Connection with replica 192.168.100.1:6379 lost.
36623:M 12 Feb 2019 01:35:35.428 # Configuration change detected. Reconfiguring myself as a replica of c876e5e99902ad9f11c714eca6770dac7626ffa0
36623:S 12 Feb 2019 01:35:35.428 * Before turning into a replica, using my master parameters to synthesize a cached master: I may be able to synchronize with the new master with just a partial transfer.
36623:S 12 Feb 2019 01:35:36.425 * Connecting to MASTER 192.168.100.1:6379
36623:S 12 Feb 2019 01:35:36.425 * MASTER <-> REPLICA sync started
36623:S 12 Feb 2019 01:35:36.426 * Non blocking connect for SYNC fired the event.
36623:S 12 Feb 2019 01:35:36.426 * Master replied to PING, replication can continue...
36623:S 12 Feb 2019 01:35:36.426 * Trying a partial resynchronization (request aa66a8d82de43ac2dd3da8126f9f54182d5ac77f:3490215).
36623:S 12 Feb 2019 01:35:36.426 * Successful partial resynchronization with master.
36623:S 12 Feb 2019 01:35:36.426 # Master replication ID changed to 04057341f176ca0a1490fbe619d951bf28c8495e
36623:S 12 Feb 2019 01:35:36.427 * MASTER <-> REPLICA sync: Master accepted a Partial Resynchronization.

查看集群状态

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
[root@redis01 ~]# redis-cli
127.0.0.1:6379> cluster nodes
fa1fa2a7c281115192f1caec30857921dadf0941 192.168.100.3:6380@16380 slave c876e5e99902ad9f11c714eca6770dac7626ffa0 0 1549953516000 9 connected
c876e5e99902ad9f11c714eca6770dac7626ffa0 192.168.100.1:6379@16379 myself,master - 0 1549953516000 9 connected 0-5460
e509322da0e6589617e5916945e745f8a97fa68a 192.168.100.1:6380@16380 slave 1d8748d57d749dcc880e9274465fa353b9f63b9e 0 1549953518000 4 connected
b669e00c960f525f451fd5e24bf67bc6a7521f05 192.168.100.2:6380@16380 slave 086738b731ecbf4526faca070c53f8bcdede5952 0 1549953518000 5 connected
086738b731ecbf4526faca070c53f8bcdede5952 192.168.100.3:6379@16379 master - 0 1549953518578 3 connected 10923-16383
1d8748d57d749dcc880e9274465fa353b9f63b9e 192.168.100.2:6379@16379 master - 0 1549953519580 2 connected 5461-10922
127.0.0.1:6379> cluster info
cluster_state:ok
cluster_slots_assigned:16384
cluster_slots_ok:16384
cluster_slots_pfail:0
cluster_slots_fail:0
cluster_known_nodes:6
cluster_size:3
cluster_current_epoch:9
cluster_my_epoch:9
cluster_stats_messages_ping_sent:1480
cluster_stats_messages_pong_sent:1373
cluster_stats_messages_auth-req_sent:5
cluster_stats_messages_mfstart_sent:1
cluster_stats_messages_sent:2859
cluster_stats_messages_ping_received:1368
cluster_stats_messages_pong_received:1480
cluster_stats_messages_auth-ack_received:3
cluster_stats_messages_received:2851

至此,集群恢复至初始状态,测试完毕。