HBase HLog Replay Ordering Inconsistency

In an HBase master-slave replication cluster, as shown on the left in the figure below, Region-Server-X and Region-Server-Y are two RegionServers in the master cluster. Under normal conditions, writes to Region-A append logs to Hlog-X on Region-Server-X, and Region-Server-X asynchronously applies those HLog entries in batches to the slave cluster. If Region-Server-X then crashes, Region-A is taken over by Region-Server-Y. Subsequent writes to Region-A append logs to Hlog-Y, while Region-Server-Y starts a new thread to replay Hlog-X. This leads to Region-A’s Hlog-X and Hlog-Y being written to the slave cluster at the same time—in other words, the replay order of Region-A’s HLog on the slave cluster becomes inconsistent.

Similarly, Region Move can cause the same problem: two RegionServers replaying HLog concurrently leads to out-of-order replay.

When the HLog replay order for Region-A is inconsistent, the master and slave clusters can end up with different data. On the master cluster, a put is executed first, followed by a delete (to remove the put). On the slave cluster, replay might execute the delete first and then the put. If the slave cluster’s RegionServer runs a major compaction between the delete and the put, the put data may not be deleted as intended.

In addition, inconsistent log replay can leave the slave cluster in a state that never existed on the master cluster.

This issue still exists in HBase today. The community has discussed it many times, but it remains unresolved. Below is a brief overview of a solution proposed by the Xiaomi HBase team:

Move Region-A from Region-Server-X to Region-Server-Y so that Region-Server-Y hosts Region-A. Region-A remains readable and writable, but Hlog-Y produced by Region-A is not pushed to the slave cluster immediately.
A RegionServer pushes Region-A’s logs from Hlog-X to the slave cluster.
After all of Region-A’s logs in Hlog-X have been replayed on the slave cluster, Region-Server-Y begins pushing HLog-Y to the slave cluster.

Here we also need to consider a more extreme case. As shown below, there are three RegionServers—X, Y, and Z. X hosts regions A/B/C, Y hosts D/E, and Z hosts F. If X crashes, regions A and B are moved to Z and region C is moved to Y. Y then starts serving reads and writes for Region-C. After a small amount of data is written to C, Y also crashes, and regions D, E, and C are all moved to Z.

For Region-C, to guarantee strictly ordered HLog delivery to the slave cluster, logs must be pushed in this order: Hlog-X first, then Hlog-Y, then Hlog-Z. For RegionServer failures, the crashed server’s HLog stops growing, so replaying all logs in Hlog-X and Hlog-Y sequentially is sufficient. For region migration (using the example above, Region-C moves from X to Y), Hlog-X on Region-Server-X keeps growing. When Region-Server-Y hosts Region-C, it must record the MaxSequenceId of Region-C in Hlog-X. Once replay of HLog-X reaches seqId >= MaxSequenceId, replay of Hlog-Y can begin.

To handle both region failover and region move uniformly, the following records are needed:

When Y takes over Region-C from X, record Hlog-X’s MaxSequenceId in HBase’s meta table.
When Z takes over Region-C from Y, record Hlog-Y’s MaxSequenceId in HBase’s meta table.
…

In extreme cases, a region may be migrated across multiple RegionServers many times, forming a chain of MaxSequenceIds. To guarantee strictly consistent HLog replay order for that region, each RegionServer’s HLog must be replayed in sequence until replay reaches the corresponding MaxSequenceId, then the next HLog segment is replayed. This ensures strictly consistent region replay order during Region Move and RegionServer failover.

The design for this approach has been shared with the community, and the Xiaomi HBase team plans to implement the fix. For more discussion on this issue, see HBASE-9465.

Summary

This article describes an approach to guarantee strictly consistent HLog replay order, which can resolve data inconsistency between the master and slave clusters. The trade-off is increased replication lag during region migration: the new approach must wait for one HLog segment to finish replaying before starting the next, sacrificing replication timeliness in exchange for eventual consistency between master and slave.

Summary#

References#

Summary

References