RAFT论文中5.4.2节的Committing entries from previous terms大概是整篇文章中最难理解的一部分,Diego Ongrao应该也意识到了这个问题,在他的博士论文中把这张图改了一下:
A time sequence showing why a leader cannot determine commitment using log entries from older terms.
– In (a) S1 is leader and partially replicates the log entry at index 2.
– In (b) S1 crashes; S5 is elected leader for term 3 with votes from S3, S4, and itself, and accepts a different entry at log index 2.
– In (c) S5 crashes; S1 restarts, is elected leader, and continues replication. At this point, the log entry from term 2 has been replicated on a majority of the servers, but it is not committed.
– If S1 crashes as in (d1), S5 could be elected leader (with votes from S2, S3, and S4) and overwrite the entry with its own entry from term 3.
– However, if S1 replicates an entry from its current term on a majority of the servers before crashing, as in (d2), then this entry is committed (S5 cannot win an election). At this point all preceding entries in the log are committed as well.
其中a,b,c是问题出现的背景,d1和d2是这个前提下可能出现的两种结果。作者想要表达的意思是,在图c中,term 4的leader S1不能直接根据多数派提交前任leader留下的log(即S1,S2,S3三个节点上index 为2的三个黄色格子,虽然阿它们已经形成多数派)。为什么呢?因为后续可能会有d1和d2两种走向:
d1: 假如term 4的leader S1在commit 了 index 2的内容,这个值就会被提交到状态机执行,并且返回给客户端。此时S1 crash了,S5重新当选为leader(它的term 可能是5,或者更大的值,因为S1在图c能当选为leader的话,多数派节点肯定至少把自己term更新为4,因此S5要在d1当选的话必须是term 5或更高。图中没有表现出这一点,但是不影响我们的讨论)。根据Raft Log Replicate的规则(In Raft, the leader handles inconsistencies by forcing the followers’ logs to duplicate its own. This means that conflicting entries in follower logs will be overwritten with entries from the leader’s log.),leader会直接用自己的log覆盖所有follower的log来消除这种不一致。那么此时问题就出现了,一条commit了、提交到状态机执行、并返回给客户端的命令被新的leader覆盖了,S4和S5不会执行这条命令,跟S1、S2、S3形成了不一致的情况。因此term 4的leader S1绝对不能直接根据多数派commit前任(term 2)的log。那么,假设term 4的leader S1没有crash,它什么时候能commit这些前任的log?这就是d2的走向:
d2: 假如term 4的leader S1在当前任期满足多数派的情况下commit当前term 4 index为3的三个红色格子,这时候去commit log index 3就是安全的(提交log index 3的时候follower会拒绝,因为它们不知道log index 2,leader 接着再依次发log index 2 和 3完成同步)。为什么这时候就是安全的呢?因为多数派上已经有term 4的log,S5不可能当选为新的leader,这就避免了d1的情况。注意,d1也是一种合法的,只是在当前条件下不会出现。
换句话说,raft只有在当前term 形成多数派commit时,才会最终间接的提交前任term的log。
加上了leader只commit当前term log的限制后,无论走向d1和d2,结果都是一致的。(To eliminate problems like the one in Figure 8, Raft never commits log entries from previous terms by counting replicas. Only log entries from the leader’s current term are committed by counting replicas; once an entry from the current term has been committed in this way, then all prior entries are committed indirectly because of the Log Matching Property.)
为什么会出现这种情况?
因为raft leader选举不完备,一个节点要当选为leader有两种情况
a、它的term >= 多数派的 term
b、他的term == 多数派的term,它的log长度 >= 多数派
在情况a中,新leader可能缺一些多数派中存在的log,raft通过前面的限制绕过了这个问题