Designing Data Intensive Applications - Reliable, Scalable and Maintainable applications

Data system is a large umbrella: New tools for data storage and processing have emerged recently. They are optimized for a variety of different use cases. They no longer fit under the traditional categories. e.g: Redis can be used both as a data store and as a message queue. When there are also message queue that have database-like durability, like Kafka.

The code is responsible for keeping the data systems together, the caches and indexes in sync for instance.

Definitions #

Reliability: The system should continue to work correctly (performing the correct function at the desired level of performance) even in the face of adversity (hardware or soft‐ware faults, and even human error)

Scalability: As the system grows (in data volume, traffic volume, or complexity), there should be reasonable ways of dealing with that growth.

Maintainability: actions, both maintaining current behavior and adapting the system to new use cases), and they should all be able to work on it productively.

Reliability #

Performs the function expected by the user
Tolerate user mistakes or missuses
Performance is good enough for the required use case, under the expected load and data volume
Prevents any unauthorized access and abuse

The things that go wrong are called faults, and the systems must try to be fault-tolerant or resilient.

There exists:

Hardware faults
Software Errors
Human errors

Scalability #

fan-out: term used in transaction processing system, to describe the number of requests to other services we need to make, in order to serve one incoming request Twitter’s rate of publications is orders of magnitude lesser than reads, so the scaling challenge was due to fan-out. And the solution is to make the work needed for the reads at write-time (update a cached timeline for reads when a publication is done)

Scalability: system’s ability to cope with increased load. It is not a true or false state. If the system grows in a particular way, what are our options for coping with the growth? There exists different load parameters

Latency and response time are often used synonymously, but they are not the same.

The response time is what the client sees: besides the actual time to process the request (the service time), it includes network delays and queuing delays.

Latency is the duration that a request is waiting to be handled—during which it is latent, awaiting the service.

Response time are never consistent, we should then treat them as a distribution of values that we can measure. We can look at average response time, but better metrics would be median (p50), p95, p99 & p999. Percentiles sorted from slowest to fastest.

Percentiles can be used in SLO (service level objectives) and SLA (service level agreements). For instance, the SLA may state that the service is up if it has a median response time of less than 200 ms and a 99th percentile under 1s.

Queuing delays often account for a large part of the response time at high percentiles. As a server can only process a small number of processes in parallel, a small number of slow requests may hold the processing of the subsequent requests. This is called head-of-line blocking. (maybe timeout in the context can prevent that).

We can test the latency with load testings.

When several backend calls are needed to serve a request, it takes just a single slow backend request to slow down the entire end-user request.

How do we cope with load ? An architecture that is appropriate for one load level, is not necessary likely to cope with 10x that load. Using several fairly powerful machines can still be simpler and cheaper than a large number of small virtual machines.

Some systems are elastic, in the sense that they automatically add computing resources when the load increases. Others needs to be scaled manually. The former is very useful when the load variation is unpredictable. The latter is simpler, predictable and generate less operational surprises.

We used to avoid distributing the stateful data systems, because unlike stateless services, taking them from a single node to a distributed setup could introduce a lot of complexity. But with the evolution of the tech, the abstractions for distributed systems are getting better and better.

An architecture that scales well for a particular application is built around assumptions of which operations will be common and which will be rare—the load parameters. If those assumptions turn out to be wrong, the engineering effort for scaling is at best wasted, and at worst counterproductive.

In an early-stage startup or an unproven product, it’s usually more important to be able to iterate quickly on product features than it is to scale to some hypothetical future load.

Maintainability #

The majority of the cost of software is not the initial development, but its ongoing maintenance: fixing bugs, keeping the system operational, investing failures, adapting it to new platforms, modifying the use cases, repaying tech debt, adding new features.

We should design software in such a way it will hopefully minimize pain during maintenance.

Aim for the three following:

Operability: Make it easy for operations teams to keep the system running smoothly.
Simplicity: Make it easy for engineers to understand the system, removing as much complexity as possible.
Resolvability: Make it easy for engineers to make changes to the system in the future, adapting it for unanticipated use cases as requirements change. (also known as extensibility, modifiability or plasticity)

Making a system simpler does not necessarily mean reducing its functionality; it can also mean removing accidental complexity. That is, removing the complexity that is not inherent to the problem.

TDD and refactoring are technical tools and patterns developed by the Agile community to manage a frequently changing environment.

Designin Data Intensive Applications Book notes - This article is part of a series.

Part 1: This Article

Part 2: Designing Data Intensive Applications - Data models and query languages

Part 3: Designing Data Intensive Applications - Storage and retrieval (part 1)

Part 4: Designing Data Intensive Applications - Storage and retrieval (part 2)

Part 5: Designing Data Intensive Applications - Encoding and Evolution (part 1) - Formats for Encoding Data

Part 6: Designing Data Intensive Applications - Encoding and Evolution (part 2) - Modes of Dataflow

Part 7: Designing Data Intensive Applications - Replication (part 1) - Leaders and Followers

Part 8: Designing Data Intensive Applications - Replication (part 2) - Problems with replication lag

Part 9: Designing Data Intensive Applications - Replication (part 3) - Multi-Leader Replication

Part 10: Designing Data Intensive Applications - Replication (part 4) - Leaderless Replication

Part 11: Designing Data Intensive Applications - Partitioning