DDIA | Replication | Leaders and Followers

In previous post, we had brief history of distributed databases, from this post onwards we will look into how replication is handled in distributed databases. 

As we have seen before replication is copying data to multiple nodes. Each node that stores the copy of the database is called a replica. How to ensure that data written to one replica gets written to all replicas?

The first strategy is to designate one replica as Leader and all the other replicas as Followers. The writes are only done on the Leader, while reads can be done either by Leader or Followers. As soon as a write request comes to Leader, he makes update to local copy of database and sends a replication log to all the followers. The Followers then update their local copy of database on the basis on replication log.

This mode of replication is a built-in feature of many relational databases, such as PostgreSQL (since version 9.0), MySQL, Oracle Data Guard, and SQL Server’s AlwaysOn Availability Groups . It is also used in some nonrelational databases, including MongoDB, RethinkDB, and Espresso. Finally, leader-based replication is not restricted to only databases: distributed message brokers such as Kafka and RabbitMQ highly available queues also use it. Some network filesystems and replicated block devices such as DRBD are similar.

Synchronous Versus Asynchronous Replication

Let’s say, there are one leader and two followers and write request comes. In synchronous replication, the client will wait until both followers have confirmed replication before sending write successful to clients. In asynchronous replication, the leader will write on local, sends the replication log to followers and send the write successful notification to clients, not waiting for it to be successfully written to the followers database copy. 

The third approach is semi-synchronous, here there is synchronous communication between one follower and leader while asynchronous communication between leader and other followers. So when write request comes in, the leader updates local copy, sends replication log to all followers, wait till the selected follower return success and then sends the notification to client. If the leader suddenly fails, we can be sure that the data is still available on the follower. The disadvantage is that if the synchronous follower doesn’t respond (because it has crashed, or there is a network fault, or for any other reason), the write cannot be processed. The leader must block all writes and wait until the synchronous replica is available again.

If all were synchronous, any one node down means whole system down, if all were as asynchronous then we cannot be sure that data on all nodes is consistent with the leader. However, a fully asynchronous configuration has the advantage that the leader can continue processing writes, even if all of its followers have fallen behind.

 

Setting Up New Followers

The process of setting up new followers without loosing availability is – 

  1. Take a consistent snapshot of the leader’s database at some point in time—if possible, without taking a lock on the entire database. Most databases have this feature, as it is also required for backups. In some cases, third-party tools are needed, such as innobackupex for MySQL.

  2. Copy the snapshot to the new follower node.

  3. The follower connects to the leader and requests all the data changes that have happened since the snapshot was taken. This requires that the snapshot is associated with an exact position in the leader’s replication log. That position has various names: for example, PostgreSQL calls it the log sequence number, and MySQL calls it the binlog coordinates.

  4. When the follower has processed the backlog of data changes since the snapshot, we say it has caught up. It can now continue to process data changes from the leader as they happen.

Handling Node Outages

Follower failure: Catch-up recovery

On its local disk, each follower keeps a log of the data changes it has received from the leader. If a follower crashes and is restarted, or if the network between the leader and the follower is temporarily interrupted, the follower can recover quite easily: from its log, it knows the last transaction that was processed before the fault occurred. Thus, the follower can connect to the leader and request all the data changes that occurred during the time when the follower was disconnected. When it has applied these changes, it has caught up to the leader and can continue receiving a stream of data changes as before.

Leader failure: Failover

Handling a failure of the leader is trickier: one of the followers needs to be promoted to be the new leader, clients need to be reconfigured to send their writes to the new leader, and the other followers need to start consuming data changes from the new leader. This process is called failover.

Failover can happen manually through a database administrator, or automatic. An automatic failover looks like – 

  1. Determining leader has failed, normally done by pinging system periodically, if no response, means leader has failed. 
  2. Choosing new leader, could be done through election, normally the synchronous replica is made the leader as it has most upto date data in comparison with the failed leader node. 
  3. Reconfiguring the system to use the new leader.

Problems with failover – 

  • If asynchronous replication is used, the new leader may not have received all the writes from the old leader before it failed. If the former leader rejoins the cluster after a new leader has been chosen, what should happen to those writes? The new leader may have received conflicting writes in the meantime. The most common solution is for the old leader’s unreplicated writes to simply be discarded, which may violate clients’ durability expectations.

  • Discarding writes is especially dangerous if other storage systems outside of the database need to be coordinated with the database contents. It could cause inconsistency between systems which can lead to big failures.
  • In certain fault scenarios it could happen that two nodes both believe that they are the leader. This situation is called split brain, and it is dangerous: if both leaders accept writes, and there is no process for resolving conflicts, data is likely to be lost or corrupted.
  • What is the right timeout before the leader is declared dead? A longer timeout means a longer time to recovery in the case where the leader fails. However, if the timeout is too short, there could be unnecessary failovers.

Thanks for stopping by! Hope this gives you a brief overview in to replication and leaders and followers approach in distributed systems. Eager to hear your thoughts and chat, please leave comments below and we can discuss.


One response to “DDIA | Replication | Leaders and Followers”