深入学习Kafka:Leader Election - Kafka集群Leader选举过程分析

所有博文均在个人独立博客http://blog.mozhu.org首发,欢迎访问!

本文所讲的Leader是指集群中的Controller,而不是各个Partition的Leader。

为什么要有Leader?

在Kafka早期版本,对于分区和副本的状态的管理依赖于zookeeper的Watcher和队列:每一个broker都会在zookeeper注册Watcher,所以zookeeper就会出现大量的Watcher, 如果宕机的broker上的partition很多比较多,会造成多个Watcher触发,造成集群内大规模调整;每一个replica都要去再次zookeeper上注册监视器,当集群规模很大的时候,zookeeper负担很重。这种设计很容易出现脑裂和羊群效应以及zookeeper集群过载。

新的版本中该变了这种设计,使用KafkaController,只有KafkaController,Leader会向zookeeper上注册Watcher,其他broker几乎不用监听zookeeper的状态变化。

Kafka集群中多个broker,有一个会被选举为controller leader,负责管理整个集群中分区和副本的状态,比如partition的leader 副本故障,由controller 负责为该partition重新选举新的leader 副本;当检测到ISR列表发生变化,有controller通知集群中所有broker更新其MetadataCache信息;或者增加某个topic分区的时候也会由controller管理分区的重新分配工作

Kafka集群Leader选举原理

我们知道Zookeeper集群中也有选举机制,是通过Paxos算法,通过不同节点向其他节点发送信息来投票选举出leader,但是Kafka的leader的选举就没有这么复杂了。
Kafka的Leader选举是通过在zookeeper上创建/controller临时节点来实现leader选举,并在该节点中写入当前broker的信息
{“version”:1,”brokerid”:1,”timestamp”:”1512018424988”}
利用Zookeeper的强一致性特性,一个节点只能被一个客户端创建成功,创建成功的broker即为leader,即先到先得原则,leader也就是集群中的controller,负责集群中所有大小事务。
当leader和zookeeper失去连接时,临时节点会删除,而其他broker会监听该节点的变化,当节点删除时,其他broker会收到事件通知,重新发起leader选举。

KafkaController

KafkaController初始化ZookeeperLeaderElector对象,为ZookeeperLeaderElector设置两个回调方法,onControllerFailover和onControllerResignation
onControllerFailover在选举leader成功后会回调,在onControllerFailover中进行leader依赖的模块初始化,包括向zookeeper上/controller_epoch节点上记录leader的选举次数,这个epoch数值在处理分布式脑裂的场景中很有用。
而onControllerResignation在当前broker不再成为leader(即当前leader退位后)时会回调。
KafkaController在启动后注册zookeeper的会话超时监听器,并尝试选举leader。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
class KafkaController {
private val controllerElector = new ZookeeperLeaderElector(controllerContext, ZkUtils.ControllerPath, onControllerFailover, onControllerResignation, config.brokerId)

def startup() = {
inLock(controllerContext.controllerLock) {
info("Controller starting up")
//注册Session过期监听器
registerSessionExpirationListener()
isRunning = true
//每次启动时,尝试选举leader
controllerElector.startup
info("Controller startup complete")
}
}

private def registerSessionExpirationListener() = {
zkUtils.zkClient.subscribeStateChanges(new SessionExpirationListener())
}
}

SessionExpirationListener

当broker和zookeeper重新建立连接后,SessionExpirationListener中的handleNewSession会被调用,这时先关闭之前的leader相关模块,然后重新尝试选举成为leader。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
class SessionExpirationListener() extends IZkStateListener with Logging {
this.logIdent = "[SessionExpirationListener on " + config.brokerId + "], "
@throws(classOf[Exception])
def handleStateChanged(state: KeeperState) {
// do nothing, since zkclient will do reconnect for us.
}

/**
* Called after the zookeeper session has expired and a new session has been created. You would have to re-create
* any ephemeral nodes here.
*
* @throws Exception
* On any error.
*/
@throws(classOf[Exception])
def handleNewSession() {
info("ZK expired; shut down all controller components and try to re-elect")
//和Zookeeper重新建立连接后,此方法会被调用
inLock(controllerContext.controllerLock) {
//先注销一些已经注册的监听器,关闭资源
onControllerResignation()
//重新尝试选举成controller
controllerElector.elect
}
}

override def handleSessionEstablishmentError(error: Throwable): Unit = {
//no-op handleSessionEstablishmentError in KafkaHealthCheck should handle this error in its handleSessionEstablishmentError
}
}

ZookeeperLeaderElector

ZookeeperLeaderElector类实现leader选举的功能,但是它并不负责处理broker和zookeeper的会话超时(连接超时)的情况,而是认为调用者应该在会话恢复(连接重新建立)时进行重新选举。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
class ZookeeperLeaderElector(controllerContext: ControllerContext,
electionPath: String,
onBecomingLeader: () => Unit,
onResigningAsLeader: () => Unit,
brokerId: Int)
extends LeaderElector with Logging {
var leaderId = -1
// create the election path in ZK, if one does not exist
val index = electionPath.lastIndexOf("/")
if (index > 0)
controllerContext.zkUtils.makeSurePersistentPathExists(electionPath.substring(0, index))
val leaderChangeListener = new LeaderChangeListener

def startup {
inLock(controllerContext.controllerLock) {
// 添加/controller节点的IZkDataListener监听器
controllerContext.zkUtils.zkClient.subscribeDataChanges(electionPath, leaderChangeListener)
// 选举
elect
}
}
}

ZookeeperLeaderElector的startup方法中调用elect方法选举leader

有下面几种情况会调用elect方法

  1. broker启动时,第一次调用
  2. 上一次创建节点成功,但是可能在等Zookeeper响应的时候,连接中断,resign方法中删除/controller节点后,触发了leaderChangeListener的handleDataDeleted
  3. 上一次创建节点未成功,但是可能在等Zookeeper响应的时候,连接中断,而再次进入elect方法时,已有别的broker创建controller节点成功,成为了leader
  4. 上一次创建节点成功,但是onBecomingLeader抛出了异常,而再次进入
    所以elect方法中先获取/controller节点信息,判断是否已经存在,然后再尝试选举leader
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
private def getControllerID(): Int = {
controllerContext.zkUtils.readDataMaybeNull(electionPath)._1 match {
case Some(controller) => KafkaController.parseControllerId(controller)
case None => -1
}
}

def elect: Boolean = {
val timestamp = SystemTime.milliseconds.toString
val electString = Json.encode(Map("version" -> 1, "brokerid" -> brokerId, "timestamp" -> timestamp))

//先尝试获取/controller节点信息
leaderId = getControllerID
/*
* We can get here during the initial startup and the handleDeleted ZK callback. Because of the potential race condition,
* it's possible that the controller has already been elected when we get here. This check will prevent the following
* createEphemeralPath method from getting into an infinite loop if this broker is already the controller.
*/
// 有下面几种情况会调用elect方法
// 1.broker启动时,第一次调用
// 2.上一次创建节点成功,但是可能在等Zookeeper响应的时候,连接中断,resign方法中删除/controller节点后,触发了leaderChangeListener的handleDataDeleted
// 3.上一次创建节点未成功,但是可能在等Zookeeper响应的时候,连接中断,而再次进入elect方法时,已有别的broker创建controller节点成功,成为了leader
// 4.上一次创建节点成功,但是onBecomingLeader抛出了异常,而再次进入
// 所以先获取节点信息,判断是否已经存在
if(leaderId != -1) {
debug("Broker %d has been elected as leader, so stopping the election process.".format(leaderId))
return amILeader
}

try {
val zkCheckedEphemeral = new ZKCheckedEphemeral(electionPath,
electString,
controllerContext.zkUtils.zkConnection.getZookeeper,
JaasUtils.isZkSecurityEnabled())
//创建/controller节点,并写入controller信息,brokerid, version, timestamp
zkCheckedEphemeral.create()
info(brokerId + " successfully elected as leader")
leaderId = brokerId
//写入成功,成为Leader,回调
onBecomingLeader()
} catch {
case e: ZkNodeExistsException =>
// If someone else has written the path, then
leaderId = getControllerID
//写入失败,节点已经存在,说明已有其他broker创建成功
if (leaderId != -1)
debug("Broker %d was elected as leader instead of broker %d".format(leaderId, brokerId))
else
warn("A leader has been elected but just resigned, this will result in another round of election")

case e2: Throwable =>
error("Error while electing or becoming leader on broker %d".format(brokerId), e2)
//这里有可能是创建节点时,和zookeeper断开了连接,也有可能是onBecomingLeader的回调方法里出了异常
//onBecomingLeader方法里,一般是初始化leader的相关的模块,如果初始化失败,则调用resign方法先删除/controller节点
//当/controller节点被删除时,会触发leaderChangeListener的handleDataDeleted,会重新尝试选举成Leader,更重要的是也让其他broker有机会成为leader,避免某一个broker的onBecomingLeader一直失败造成整个集群一直处于“群龙无首”的尴尬局面
resign()
}
amILeader
}

def close = {
leaderId = -1
}

def amILeader : Boolean = leaderId == brokerId

def resign() = {
leaderId = -1
// 删除/controller节点
controllerContext.zkUtils.deletePath(electionPath)
}

在创建/controller节点时,若收到的异常是ZkNodeExistsException,则说明其他broker已经成为了leader。
而若是onBecomingLeader的回调方法里出了异常,一般是初始化leader的相关的模块出了问题,如果初始化失败,则调用resign方法先删除/controller节点。
当/controller节点被删除时,会触发leaderChangeListener的handleDataDeleted,会重新尝试选举成Leader。
更重要的是也让其他broker有机会成为leader,避免某一个broker的onBecomingLeader一直失败造成整个集群一直处于“群龙无首”的尴尬局面。

LeaderChangeListener

在startup方法中,注册了/controller节点的IZkDataListener监听器即LeaderChangeListener。
若节点数据有变化时,则有可能别的broker成为了leader,则调用onResigningAsLeader方法。
若节点被删除,则是leader已经出了故障下线了,如果当前broker之前是leader,则调用onResigningAsLeader方法,然后重新尝试选举成为leader。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
class LeaderChangeListener extends IZkDataListener with Logging {
/**
* Called when the leader information stored in zookeeper has changed. Record the new leader in memory
* @throws Exception On any error.
*/
@throws(classOf[Exception])
def handleDataChange(dataPath: String, data: Object) {
inLock(controllerContext.controllerLock) {
val amILeaderBeforeDataChange = amILeader
leaderId = KafkaController.parseControllerId(data.toString)
info("New leader is %d".format(leaderId))
// The old leader needs to resign leadership if it is no longer the leader
if (amILeaderBeforeDataChange && !amILeader)
//如果之前是Leader,而现在不是Leader
onResigningAsLeader()
}
}

/**
* Called when the leader information stored in zookeeper has been delete. Try to elect as the leader
* @throws Exception
* On any error.
*/
@throws(classOf[Exception])
def handleDataDeleted(dataPath: String) {
inLock(controllerContext.controllerLock) {
debug("%s leader change listener fired for path %s to handle data deleted: trying to elect as a leader"
.format(brokerId, dataPath))
if(amILeader)
//如果之前是Leader
onResigningAsLeader()
//重新尝试选举成Leader
elect
}
}
}

onBecomingLeader方法对应KafkaController里的onControllerFailover方法,当成为新的leader后,要初始化leader所依赖的功能模块
onResigningAsLeader方法对应KafkaController里的onControllerResignation方法,当leader退位后,要关闭leader所依赖的功能模块

Leader选举流程图

整个leader选举的过程的流程图为
Kafka Leader选举流程图

[参考资料]
http://blog.csdn.net/zhanglh046/article/details/72821995

分享到