Designing Data Intensive Applications - Replication (part 1) - Leaders and Followers

Table of Contents

Designin Data Intensive Applications Book notes - This article is part of a series.

Part 1: Designing Data Intensive Applications - Reliable, Scalable and Maintainable applications

Part 2: Designing Data Intensive Applications - Data models and query languages

Part 3: Designing Data Intensive Applications - Storage and retrieval (part 1)

Part 4: Designing Data Intensive Applications - Storage and retrieval (part 2)

Part 5: Designing Data Intensive Applications - Encoding and Evolution (part 1) - Formats for Encoding Data

Part 6: Designing Data Intensive Applications - Encoding and Evolution (part 2) - Modes of Dataflow

Part 7: This Article

Part 8: Designing Data Intensive Applications - Replication (part 2) - Problems with replication lag

Part 9: Designing Data Intensive Applications - Replication (part 3) - Multi-Leader Replication

Part 10: Designing Data Intensive Applications - Replication (part 4) - Leaderless Replication

Part 11: Designing Data Intensive Applications - Partitioning

Chapter 5: Replication
#

Replication simply is the act of keeping a copy of the same data on multiple machines, usually connected by a netwok. This practice has many benefits, such as:

High availability: if one machine fails, the data can still be served by another machine.
Fault tolerance: if one machine loses data, the data can be recovered from another machine.
Scalability: the load can be distributed across multiple machines (especially for read-heavy workloads).
Latency: the data can be located closer to the user.

If the data we need to replicate were to be static, replication would be easy. We would only need to copy the data to every node once.

The real challenges of replication arise from handling the changes in the replicated data. These changes in the replicated data are usually distributed in the multiple nodes by the following algorithms:

Single-leader replication: only one node in the system is allowed to accept writes. The leader node is responsible for handling writes, and the follower nodes are responsible for replicating the writes from the leader.
Multi-leader replication: multiple nodes can accept writes. The nodes need to handle the conflicts that arise when two nodes accept writes concurrently.
Leaderless replication: all nodes can accept writes. The nodes need to agree on the order in which writes are accepted.

The trade-offs often we make to chose between these algorithms often come down to weather we prioritize consistency or availability. If we want to replicate data sychronously or asynchronously, the way we handle failed replicas, etc…

Leaders and Followers
#

Leader-based replication is the most common form of replication. In this model, one node is designated as the leader (or master), and the other nodes are designated as followers (or slaves).

It works as follows:

One of the replicas is designated as the leader. When clients needs to write new data, they must send their requests to the leader, which first writes the new data to its local storage.
Whenever the leader writes new data to its local storage, it also sends the data changes to all of its follower, as part of a replication log or change stream. All the followers then takes the log from the leaders, and updates their local storage by applying the changes in the same order as they were processed by the leader.
Clients can read from both the leader or followers. But only the leader accepts writes, the followers are read-only from the client’s POV.

Leader-based replication can be found in many popular DBs such as PGSQL, in distributed messages brokers such as Kafka, in networ filesystems, etc…

Synchronous vs. Asynchronous replication
#

If the followers are synchronous, the leader waits until all of its followers have comfirmed they received and wrote locally the data, before reporting success to the user and making the writes visible to the clients.

In the case where the followers are asynchronous, the leader send the message, but doesn’t wait for any response from the followers.

The advantage of asynchronous replication, is that we are guaranteed to have an up-to-date copy of the data available on the followers. The main disadvantage however, is that the comfirmation from the followers might be slow in some cases (e.g failure recovery), or they may even not respond at all, which would not process the write at all.

For the reasons explained above, it is impractical that all followers are synchronous as any outage would cause the whole system to stop. In practice, enabling synchronous replication usually means that one of the follower is synchronous. And if the one synchronous follower becomes unavailable or slow, one of the asychronous follower is made synchronous. With this, we have a guarantee of an up-to-date copy of the data in at least two nodes (this is sometimes called semi-synchronous).

In most cases of leader-based replication, we use a completely asynchronous following. Which means that if the leader fail and is not recoverable, any write that were not yet replicated to the followers are lost. In this setting, any write is not guaranteed to be durable, even if it has been comfirmed to the client.

Setting up new followers
#

We may have to add new nodes in our replication cluster. For instance, in order to accept a higher read thoughput.

Copying the data files from one node to another is not sufficient as clients are constantly writing to the database. And locking the database would go against the goal of high availability.

To set up a new follower without downtime we may do the following:

Take a consistent snapshot of the leader’s database (without locking the whole db if possible).
Copy the snapshot to the new follower node.
The follower requests to the leader, all the data changes that happened since the snapshot (requiring that the snapshot is associated with a position in the leader’s replcation log)
The follower applies the changes to the snapshot, and then starts to apply the changes from the leader in real-time.

Handling node outages
#

Any node may go down in our system, du to hardware failure, network partition, etc… Our system should be able to handle these outages without losing data, and without downtime.

Follower failure: Catch-up recovery
#

As the follower keeps a local log of the data changes received from the leader, it can catch up with the leader by requesting the missing data changes from the leader since the last known position in the log.

Leader failure: Failover
#

If the leader fails, one of the followers can be promoted to be the new leader, the clients needs to be configured to send the writes to the new leader, and the other followers need to start consuming data changes fom the new leader. This is called failover.

The failover can be done manually, or automatically. The latter being complex, as it requires to detect the leader failure, and to elect a new leader.

Detecting the leader failure can be done by the followers, by the clients, or by an external monitoring system. Usually with a timeout, if the leader doesn’t respond in time, it is considered as failed.

The election of a new leader can be done though a leader election algorithm. The algorithm should ensure that only one node is elected as the new leader, and that the elected leader is the most up-to-date with the data (to minimize data loss).

The clients then need to be reconfigured to send their writes to the new leader. The past leader also needs to be reconfigured to be a follower, as it might still believe it is the leader.

Failover is still prone to a lot of issues, such as split-brain, where two nodes believe they are the leader, or the case where the leader is still alive but not responsive, etc… For these reasons, failover is usually done manually.

Implementation of Replication Logs
#

Under the hood, leader-based replication can be implemented in different ways. Some of these being:

Statement-based replication: the leader logs the SQL statements (insert, update, delete) that make changes to the database. The followers then execute these statements to make the same changes to their own copy of the database. This is simple to implement, but can be problematic if the database uses non-deterministic functions (e.g NOW(), RAND()), or if the database schema changes (e.g auto-increment, updates, etc…).
Write-ahead log (WAL) shipping: the leader logs every change to the database (in the form of a byte sequence) in its write-ahead log. The followers then read the log and apply the changes to their own copy of the database. This method is more robust than statement-based replication, but it requires that the database’s log format is stable.
Logical log replication: the leader sends the data changes to the followers in a custom stream of data changes. This is more flexible than WAL shipping, as it can be used with different databases, but it is more complex to implement.
Trigger-based replication: the leader defines triggers on the tables that are being replicated. When a row is inserted, updated or deleted, the trigger writes the change to a separate table, which is then replicated to the followers. This one is different from the others, as it offers more flexibility, for example, it can be used to replicate only a subset of the data. It has however more overhead.