Designing Data Intensive Applications - Encoding and Evolution (part 2)

Both forware and backward compatibility are usually required in databases, as you may have multiple applications reading and writing the same data, and evolving at different rates (you may have a new version of the application that writes data, but the old version still reads it, and vice versa).

Sometimes, the integrity of the data has to be taken care of by the application code. Let’s consider the exemple where a new field is added to a table by a new version of the application. The old version of the application may not know about this field, and in the case where it reads a new line containing this field, it won’t consider it, and may write override the line withou this field.

Different values for different clients #

In a database, usually, any value can be updated at any time. So some vale might have been written some milliseconds ago.

Also, in the opposite to application code which is deployed to all the replicas in a short time. In databases, the content is usually meant to be there for a long time, so you may encounter data that has been encoded with multiple years old schema.

Rewriting (migrating) data into a new format is a common solution to the problem of schema evolution. But it is expensive on large datasets, and may not be possible in some cases. So, some databases allows only simple schema changes, such as adding a new column with a null default value, without rewriting the entire table.

Schema evolution allows the entire database to appear as if it was encoded with a single schema, even though the underlying storage may contain records encoded with different historical versions of the schema.

Archival storage #

We can take snapshots of a database for backup, or to load it into a data warehouse. In those cases, the data will be encoded using the latest schema.

Dataflow Through Services: REST and RPC #

When data is sent over the network, there are different ways to arrange the communication. Most commonly, it is done with the roles of a client and a server, where the server exposes an API over the network, and the client can connect to the server and make requests to the API.

Some exemple of clients include web browsers, mobile apps, servers, etc…

Some servers can be a client of another server. For instance, a typical app web server will usually be a client to a database server.

This way of decomposing the system into services, usually by area of functionality, is called service-oriented architecture, (SOA). This is most recently called microservices.

The key design principle of SOA/microservices is loose coupling, which means that the client and the server are independent of each other, and can be developed, deployed and scaled independently.

Web services #

When HTTP is used as the underlying protocol to talk to a service, it is usually called a web service. But, despite that name, a web service is not only used on the web, but can be found in different contexts:

A client application running on a user’s device can talk to a server over HTTP.
Multiple services within a company’s internal network can talk to each other over HTTP.
Different services making requests to each other over HTTP, from different companies.

Two popular approaches to web services are REST and SOAP:

REST (Representational State Transfer) is a design philosophy, not a protocol. It is build upon the principles of HTTP. It emphasizes simple data formats, using URLs for identifying resources and using HTTP features for cache control, authentication, and content type negotiation. An API that follows the principles of REST is called a RESTful API.
SOAP (Simple Object Access Protocol) is an XML-based protocol for making network API requests. It is commonly used over HTTP, but it aim to be independent from HTTP and avoids using many of the features of HTTP.

Definition formats such as OpenAPI (formerly Swagger) can be used to describe RESTful APIs and provide documentation.

The problems with remote procedure calls #

RPC (Remote Procedure Call) is an idea which has been around since the 70s. Originally, the RPC model tried to make a request to a remote network service look the same as calling a function or method in your programming language. But this is fundamentally flawed as there are many differences between a local function call and a remote service call:

A local function call is predictable, while a remote service call may fail due to network issues, a slow or unavailable server, etc…
A local function can either return a value, or throw an exception, or never return. But a remote service call can also return, without a value, due to a timeout for instance. We have no way of distinguishing the causes of the failure.
We may want to retry a failed RPC, which forces us to make the RPC idempotent.
The network is not reliable and may be congested. A local function call will usually take the same amount of time to execute, while a remote service call may take a variable amount of time to return.
With local functions we can efficiently pass pointers or references to memory locations. With RPC, the data needs to be encoded, sent over the netword, and decoded, which can be expensive.

These, and other reasons show there is no reason trying to make a remote service call look like a local function call. They are fundamentally different, and it would end up hiding a lot of potential issues.

Current directions for RPC #

Various RPC frameworks have been developed over the years, such as Apache Thrift, Apache Avro, Google’s gRPC (with Protocol Buffers), etc…

These frameworks are more explicit about the fact that a remote call is different from a local function call. Some uses futures or promises to encapsulate asynchronous actions that may fail. gRPC support streams which allows a client to make a request to a server, and the server to respond with multiple messages over time.

Some of these frameworks also provide support for service discovery, which is the process of finding out the network address of a service, given only its name. This is useful in a dynamic environment where services may move around between machines.

Custom RPC protocols with binary encoding are usually more performant than RESTful services with JSON encoding. But RESTful services are simpler, easier to experiment with, and easier to debug. They are supported by all mainstream programming languages and platforms, aswell as providing a vast ecosystem of tools.

For public APIs, RESTful services is predominant, but for internal services, RPC frameworks might sometimes be a better choice.

Data encoding and evolution for RPC #

It is reasonable to assume that the server will be updated before the client, thus, we need the backward compatibility on requests, and forward compatibility on responses.

The backwards and forward compatibility of an RPC scheme are inherited from the encoding used:

Thrift, gRPC (with Protocol Buffers) and Avro can be evolved according to the compatibility rules of the respective encoding format.
RESTful services typically uses JSON (without a formally specified schema) for responses, and JSON or URI-encoded/form-encoded request parameters for requests. Adding optional request parameters and adding new fields to response object are usually considered changes that maintain compatibility.

There is no agreement on how API versioning should be done. For RESTful services, it is common to put the version number in the URL or in the HTTP headers.

Message-Passing Dataflow #

We have discussed the different ways encoded data flows from one process to another. So far we discussed REST, where a request is sent over the network, and RPC, where a response is expected as quickly as possible. As well as databases, where one process writes data to disk and another reads it in the future.

For data flows, we might want to privilege asynchronous message-passing.

It can be situated somewhere between RPC, in the way that a client’s request (usually called a message) is delivered with low latency, and a database, in the way that the message is not sent via direct network connection, but through an intermediary (a message broker, also called message queue), which stores the message temporarily.

Using a message broker has several advantages over direct RPC:

It can act as a buffer if the recipient is unavailable or overloaded. Improving the reliability of the system.
It can automatically redeliver messages to a process that has crashed. Preventing messages from being lost.
It avoids the sender needing to know the IP address and port number of the recipient (useful in a dynamic environment, such as cloud deployment, where machines come and go).
It allows one message to be delivered to multiple recipients.
It logically decouples the sender from the recipient. The sender just publish a message, without caring about who consumes it.

One big difference with RPC, is that with message-passing communication, it is usually one-way. The sender does not expect a response from the recipient.

This asynchronous communnication pattern, where the sender doesn’t wait for the message to be delivered, makes sense in systems that are loosely coupled, and where services are expected to be independently deployable and scalable.

Message brokers #

Generally, a message broker is used as follows: one process sends a message to a named queue or topic, and the broker ensures that the message is delivered to one or more consumers (also named subscribers) to that queue or topic. There can be multiple producers and multiple consumers on the same queue or topic.

Messages brokers typically don’t enforce any particular data model. A message being only a sequence of bytes, we can use any encoding format we like. We may use an encoding format that is forward and backward compatible, in order to have the maximum flexibility in evolving the publisher and subscriber independently.

Distributed actor frameworks #

Another way to do message-passing is to use a distributed actor framework. In this model, each actor is like an independent thread that has its own local state and communicates with other actors by sending and receiving messages.

With this model we don’t deal with threads directly, and avoir the associated problems, such as race conditions, deadlocks, etc…

The actor model is used to scale systems across multiple nodes.

Location transparency works better in the actor model than in the RPC model, as it assumes that messages may be lost, even within a single process.

Summary #

In this chapter, we explored the different types of encoding the data structures into bytes either for storage on disk or for transmission over the network. We saw that the choice of encoding format has a big impact on the ability to evolve the system over time, and hence, on the architecture of the system.

As many services now needs to be compatible with rolling updates (updating gradually the nodes of a service, without downtime), the ability for our system to be able to handle multiple versions of the data schema is crucial.

The different encoding formats we discussed are:

Language-specific formats
Textual formats (JSON, XML, CSV)
Binary schema-driven formats (Thrift, Protocol Buffers, Avro)

We also discussed the different ways data can flow between systems:

Databases, where the process writing to the database encodes the data, and the process reading from the database decodes it.
REST and RPC, where the client encodes a request that is decoded by the server, and then the server encodes a response that is decoded by the client.
Asynchronous message-passing, where the sender encodes a message that is decoded by the recipient.

Designin Data Intensive Applications Book notes - This article is part of a series.

Part 1: Designing Data Intensive Applications - Reliable, Scalable and Maintainable applications

Part 2: Designing Data Intensive Applications - Data models and query languages

Part 3: Designing Data Intensive Applications - Storage and retrieval (part 1)

Part 4: Designing Data Intensive Applications - Storage and retrieval (part 2)

Part 5: Designing Data Intensive Applications - Encoding and Evolution (part 1) - Formats for Encoding Data

Part 6: This Article

Part 7: Designing Data Intensive Applications - Replication (part 1) - Leaders and Followers

Part 8: Designing Data Intensive Applications - Replication (part 2) - Problems with replication lag

Part 9: Designing Data Intensive Applications - Replication (part 3) - Multi-Leader Replication

Part 10: Designing Data Intensive Applications - Replication (part 4) - Leaderless Replication

Part 11: Designing Data Intensive Applications - Partitioning