通过Consul集群的failover测试,我们可以了解到它的高可用工作机制。

consul节点 IP 角色 consul版本 日志路径
consul01 192.168.101.11 server 0.9.3 /var/log/consul
consul02 192.168.101.12 server 0.9.3 /var/log/consul
consul03 192.168.101.13 server(leader) 0.9.3 /var/log/consul

注:我们是验证consul cluster的高可用性,而非验证consul自带的Geo failover(注册服务的故障转移功能)。

当前集群初始状态(client节点忽略)

1
2
3
4
5
6
7
8
9
10
11
[root@consul01 consul]# ./consul operator raft list-peers
Node ID Address State Voter RaftProtocol
consul02 192.168.101.12:8300 192.168.101.12:8300 follower true 2
consul03 192.168.101.13:8300 192.168.101.13:8300 leader true 2
consul01 192.168.101.11:8300 192.168.101.11:8300 follower true 2
[root@consul01 consul]# ./consul members
Node Address Status Type Build Protocol DC Segment
consul01 192.168.101.11:8301 alive server 0.9.3 2 dc <all>
consul02 192.168.101.12:8301 alive server 0.9.3 2 dc <all>
consul03 192.168.101.13:8301 alive server 0.9.3 2 dc <all>
...

TSOP域的consul为3个server节点集群,理论上支持最多一个server节点故障,因此我们测试一个server节点故障是否影响集群

consul cluster 模拟故障测试

1. 停掉一个follower server节点

此处以consul01节点为例

1
[root@consul01 consul]# systemctl stop consul

其它节点日志信息

1
2
3
4
[root@consul02 ~]# tail -100 /var/log/consul 
2019/02/12 02:30:38 [INFO] serf: EventMemberFailed: consul01.dc 192.168.101.11
2019/02/12 02:30:38 [INFO] consul: Handled member-failed event for server "consul01.dc" in area "wan"
2019/02/12 02:30:39 [INFO] serf: EventMemberLeave: consul01 192.168.101.11

在其它节点查看集群信息

1
2
3
4
5
6
7
8
9
10
[root@consul03 consul]# ./consul operator raft list-peers
Node ID Address State Voter RaftProtocol
consul02 192.168.101.12:8300 192.168.101.12:8300 follower true 2
consul03 192.168.101.13:8300 192.168.101.13:8300 leader true 2
[root@consul03 consul]# ./consul members
Node Address Status Type Build Protocol DC Segment
consul01 192.168.101.11:8301 left server 0.9.3 2 dc <all>
consul02 192.168.101.12:8301 alive server 0.9.3 2 dc <all>
consul03 192.168.101.13:8301 alive server 0.9.3 2 dc <all>
...

查看集群是否正常(查询注册服务,若没有服务可以通过consul api手动创建一个)

1
2
3
4
[root@consul03 consul]# ./consul catalog services
consul
test-csdemo-v0-snapshot
test-zuul-v0-snapshot

可以看到,被停止的server节点处于left状态,但集群仍正常可用

因此,

server节点,不会影响consul集群服务```。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

恢复此server节点

```shell
[root@consul03 consul]# ./consul operator raft list-peers
Node ID Address State Voter RaftProtocol
consul02 192.168.101.12:8300 192.168.101.12:8300 follower true 2
consul03 192.168.101.13:8300 192.168.101.13:8300 leader true 2
consul01 192.168.101.11:8300 192.168.101.11:8300 follower true 2
[root@consul03 consul]# ./consul members
Node Address Status Type Build Protocol DC Segment
consul01 192.168.101.11:8301 alive server 0.9.3 2 dc <all>
consul02 192.168.101.12:8301 alive server 0.9.3 2 dc <all>
consul03 192.168.101.13:8301 alive server 0.9.3 2 dc <all>
...

其它节点已检测并将此节点加入

1
2
3
4
5
[root@consul02 ~]# tail -100 /var/log/consul 
2019/02/12 02:43:51 [INFO] serf: EventMemberJoin: consul01.dc 192.168.101.11
2019/02/12 02:43:51 [INFO] consul: Handled member-join event for server "consul01.dc" in area "wan"
2019/02/12 02:43:51 [INFO] serf: EventMemberJoin: consul01 192.168.101.11
2019/02/12 02:43:51 [INFO] consul: Adding LAN server consul01 (Addr: tcp/192.168.101.11:8300) (DC: dc)

1. 停掉leader server节点

此处以consul03节点为例

1
[root@consul03 consul]# systemctl stop consul

其它follow server节点检测到leader节点下线,并重新选举leader

1
2
3
4
5
6
7
8
9
10
11
12
13
[root@consul02 ~]# tail -100 /var/log/consul 
2019/02/12 02:48:27 [INFO] serf: EventMemberLeave: consul03.dc 192.168.101.13
2019/02/12 02:48:27 [INFO] consul: Handled member-leave event for server "consul03.dc" in area "wan"
2019/02/12 02:48:28 [INFO] serf: EventMemberLeave: consul03 192.168.101.13
2019/02/12 02:48:28 [INFO] consul: Removing LAN server consul03 (Addr: tcp/192.168.101.13:8300) (DC: dc)
2019/02/12 02:48:37 [WARN] raft: Rejecting vote request from 192.168.101.11:8300 since we have a leader: 192.168.101.13:8300
2019/02/12 02:48:39 [ERR] agent: Coordinate update error: No cluster leader
2019/02/12 02:48:39 [WARN] raft: Heartbeat timeout from "192.168.101.13:8300" reached, starting election
2019/02/12 02:48:39 [INFO] raft: Node at 192.168.101.12:8300 [Candidate] entering Candidate state in term 5
2019/02/12 02:48:43 [ERR] http: Request GET /v1/catalog/services, error: No cluster leader from=127.0.0.1:44370
2019/02/12 02:48:43 [ERR] http: Request GET /v1/catalog/nodes, error: No cluster leader from=127.0.0.1:36548
2019/02/12 02:48:44 [INFO] raft: Node at 192.168.101.12:8300 [Follower] entering Follower state (Leader: "")
2019/02/12 02:48:44 [INFO] consul: New leader elected: consul01

可以从日志中看到,consul01已被选为新的leader,查看集群信息

1
2
3
4
5
6
7
8
9
[root@consul02 consul]#  ./consul operator raft list-peers
Node ID Address State Voter RaftProtocol
consul02 192.168.101.12:8300 192.168.101.12:8300 follower true 2
consul01 192.168.101.11:8300 192.168.101.11:8300 leader true 2
[root@consul02 consul]# ./consul members
Node Address Status Type Build Protocol DC Segment
consul01 192.168.101.11:8301 alive server 0.9.3 2 dc <all>
consul02 192.168.101.12:8301 alive server 0.9.3 2 dc <all>
consul03 192.168.101.13:8301 left server 0.9.3 2 dc <all>

可以看到,被停止的leader server节点处于left状态,但集群仍正常可用,查询服务验证

1
2
3
4
[root@consul02 consul]#  ./consul catalog services
consul
test-csdemo-v0-snapshot
test-zuul-v0-snapshot

因此,

server节点,也不会影响consul集群服务```。
1
2
3
4
5

我们再恢复此节点到consul集群中

```shell
[root@consul03 consul]# systemctl start consul

可以通过它的日志看到它现在变成了follower server

1
2
3
4
5
[root@consul03 ~]# tail -f /var/log/consul 
2019/02/12 03:01:33 [INFO] raft: Node at 192.168.101.13:8300 [Follower] entering Follower state (Leader: "")
2019/02/12 03:01:33 [INFO] serf: Ignoring previous leave in snapshot 2019/02/12 03:01:33 [INFO] agent: Retry join LAN is supported for: aws azure gce softlayer
2019/02/12 03:01:33 [INFO] agent: Joining LAN cluster...
2019/02/12 03:01:33 [INFO] agent: (LAN) joining: [consul01 consul02 consul03]

新的leader和follower上也会更新节点信息

1
2
3
4
5
6
7
8
9
10
[root@consul01 ~]# tail -f /var/log/consul
2019/02/12 03:01:33 [INFO] serf: EventMemberJoin: consul03.dc 192.168.101.13
2019/02/12 03:01:33 [INFO] consul: Handled member-join event for server "consul03.dc" in area "wan"
2019/02/12 03:01:33 [INFO] serf: EventMemberJoin: consul03 192.168.101.13
2019/02/12 03:01:33 [INFO] consul: Adding LAN server consul03 (Addr: tcp/192.168.101.13:8300) (DC: dc)
2019/02/12 03:01:33 [INFO] raft: Updating configuration with AddStaging (192.168.101.13:8300, 192.168.101.13:8300) to [{Suffrage:Voter ID:192.168.101.12:8300 Address:192.168.101.12:8300} {Suffrage:Voter ID:192.168.101.11:8300 Address:192.168.101.11:8300} {Suffrage:Voter ID:192.168.101.13:8300 Address:192.168.101.13:8300}]
2019/02/12 03:01:33 [INFO] raft: Added peer 192.168.101.13:8300, starting replication
2019/02/12 03:01:33 [WARN] raft: AppendEntries to {Voter 192.168.101.13:8300 192.168.101.13:8300} rejected, sending older logs (next: 394016)
2019/02/12 03:01:33 [INFO] consul: member 'consul03' joined, marking health alive
2019/02/12 03:01:33 [INFO] raft: pipelining replication to peer {Voter 192.168.101.13:8300 192.168.101.13:8300}

再次查看集群信息

1
2
3
4
5
6
7
8
9
10
[root@consul01 consul]# ./consul operator raft list-peers
Node ID Address State Voter RaftProtocol
consul02 192.168.101.12:8300 192.168.101.12:8300 follower true 2
consul01 192.168.101.11:8300 192.168.101.11:8300 leader true 2
consul03 192.168.101.13:8300 192.168.101.13:8300 follower true 2
[root@consul01 consul]# ./consul members
Node Address Status Type Build Protocol DC Segment
consul01 192.168.101.11:8301 alive server 0.9.3 2 dc <all>
consul02 192.168.101.12:8301 alive server 0.9.3 2 dc <all>
consul03 192.168.101.13:8301 alive server 0.9.3 2 dc <all>

集群各节点均已正常,此时仅leader角色改变,不影响集群对外提供服务。