A Leader Election Algorithm 12-jan-2000 / gmt 2-feb-2000 / gmt Let N = the number of servers configured (not necessarily all alive) RTT = a generous upper bound for round-trip time (e.g. 0.25 second) HBT = the leader's maximum heartbeat interval (e.g. 5 seconds) the leader broadcasts global LEAD messages every HBT, or sooner if desired messages serve as heartbeats and as assertions of leadership messages can also disseminate other info from leader to followers each message contains "term" (e.g. this is the 5th leader) and msg numbers any server can start an election for a new leader it does this after (e.g.) 3 HBT intervals have elapsed with no leader msg it's likely that several servers will detect this nearly simultaneously election occurs in two phases every server broadcasts ELECT(id,term,msg) reporting the last message seen every server sends VOTE(term,msg) message to its choice for the new leader the server receiving >N/2 VOTE messages assumes the leadership vote rankings: each server votes for the old leader, if still alive otherwise, a server reporting the latest (term,msg) seen with ties broken in favor of the smallest ID (i.e. earliest in config file) each server runs two state machines: ELECTOR, which participates in choosing a leader CANDIDATE, which determines whether this server is the leader ELECTOR state machine runs always, even while serving as leader two states: WATCHING, the initial and usual state VOTING, while an election is in process WATCHING actions: initial: set timer to 5 * HBT; clear voting slate rcv LEAD: reset timer to 3 * HBT rcv ELECT: record claim; enter VOTING state timeout: enter VOTING state VOTING actions: initial: set timer to RTT, broadcast ELECT message rcv ELECT: record claim timeout: send VOTE to server having best claim; enter WATCHING state CANDIDATE state machine runs independently of ELECTOR state machine two states: LEADER, while serving as the leader FOLLOWER, otherwise, including initially FOLLOWER actions: initial: clear the tally board (which shows the latest vote per server) rcv LEAD: clear the tally board rcv VOTE: record on the tally board if now have > N/2 identical votes for self, enter LEADER state LEADER state: do all the things a leader is supposed to do broadcast global state at least every HBT interval (the first broadcast announces to others that there is a new leader) if receive LEAD message that is not obsolete: enter FOLLOWER state Notes: The ELECT messages that start a round provoke messages from servers whose timers have not yet expired. This tends to synchronize the servers. Each server sends a max of two messages (ELECT + VOTE) per RTT interval, giving an upper limit to the amount of traffic generated by the election protocol.