Designing Data Intensive Applications - Replication (part 2) - Problems with replication lag

Table of Contents

Designin Data Intensive Applications Book notes - This article is part of a series.

Part 1: Designing Data Intensive Applications - Reliable, Scalable and Maintainable applications

Part 2: Designing Data Intensive Applications - Data models and query languages

Part 3: Designing Data Intensive Applications - Storage and retrieval (part 1)

Part 4: Designing Data Intensive Applications - Storage and retrieval (part 2)

Part 5: Designing Data Intensive Applications - Encoding and Evolution (part 1) - Formats for Encoding Data

Part 6: Designing Data Intensive Applications - Encoding and Evolution (part 2) - Modes of Dataflow

Part 7: Designing Data Intensive Applications - Replication (part 1) - Leaders and Followers

Part 8: This Article

Part 9: Designing Data Intensive Applications - Replication (part 3) - Multi-Leader Replication

Part 10: Designing Data Intensive Applications - Replication (part 4) - Leaderless Replication

Part 11: Designing Data Intensive Applications - Partitioning

Problems with replication lag
#

As a reminder, the main reasons to want replication are:

To tolerate nodes failures (high availability)
For better scalability (Processing more requests than a single node can handle)
For lower latency (placing replicas geographically closer to users)

In a leader-based replication setting, as all writes goes to a single node, and read-only queries can go to any replicas, we have to consider the following: Replicating the data asynchronously.

Replicating the data synchronously would make the entire system unavailable for writing on any single node failure, which is more and more likely as we add more nodes to the system.

Replicating asynchronously unfortunately comes with the risk of replication lag. This inconsistency is temporary, as, if we were to stop writing and waited for a while, all the followers would eventually catch up and become consistent with the leader.

For this reason, we usually talk about eventual consistency in distributed systems. It is a guarantee that given enough time, all nodes will eventually become consistent. It is a weaker guarantee than strong consistency, but it is often sufficient for many applications.

Reading our own writes
#

In a lot of modern applications the user is likely to write some new data (e.g a new post, a comment, etc…) and expect to see it immediately.

But with asynchronous replication, if the reads are sent to a replica, the replica may not have received the write yet. In this situation, we need read-after-write consistency, also known as read-your-writes consistency. It is a guarantee that if a user reloads the page, they will always see the updates they submitted themselves.

There are some ways to achieve read-after-write consistency:

When reading some data that may have been modified by the user, read it from the leader.
We may monitor the replication lag, so that any read from a user that has written to the data under a certain time ago is read from the leader.

Additional complexity may arise when the user is accessing the service from multiple devices. In which case, we want to achieve cross-device consistency.

For this, the approach with the stored timestamp of the last write in the user’s session will need the information to be centralized, as we can no longer rely on it being stored on one device.

If our replicas are distributed over multiple data centers, we will also need to make sure that all the user’s requests are sent to the same data center, as they probably each have their own leader.

Monotonic reads
#

Another issue with replication lag due to asynchronous replication, is that one user may experience what is called moving backward in time.

This may happen if our user makes several reads from different replicas, and for instance, the first read is from a replica that is up-to-date, and the second read is from a replica that is lagging behind (the user does not see the data they saw on the first read).

Monotonic reads is a guarantee against such anomaly. It is a lesser guarantee than strong consistency, but a stronger guarantee than eventual consistency.

With monotonic reads, a user may read old data, but they are guaranteed to never read from a replica that is more behind in time than the one they read from previously (during sequential reads).

One way to achieve monotonic reads is to have the user always read from the same replica.

Consistent prefix reads
#

Different replicas may receive the writes in a different order, which may lead to different results when reading from them.

Consistent prefix reads is a guarantee that if a sequence of writes is happening in a certain order, then any replicas that have processed these writes will see them in the same order.

Solutions to replication lag
#

When working with an eventually consistent system, we need to be aware of the replication lag.

Some applications may be fine and offer a good user experience, even with minutes or hours of replication lag. But for some applications, it may be unacceptable.

The system needs to be designed to handle replication lag if needed, and to acknowledge the asynchronous nature of the replication.

The solutions discussed earlier are ways for the application to provide a better guarantee to the user than the database itself can provide.

But dealing with replication lag with the application code can be complex and error-prone.

Transactions are an example of a feature that is difficult to implement in a distributed environment. For which, many are asserting that eventual consistency is inevitable in a distributed system.