Designing Data Intensive Applications - Encoding and Evolution (part 1)

Table of Contents

Designin Data Intensive Applications Book notes - This article is part of a series.

Part 1: Designing Data Intensive Applications - Reliable, Scalable and Maintainable applications

Part 2: Designing Data Intensive Applications - Data models and query languages

Part 3: Designing Data Intensive Applications - Storage and retrieval (part 1)

Part 4: Designing Data Intensive Applications - Storage and retrieval (part 2)

Part 5: This Article

Part 6: Designing Data Intensive Applications - Encoding and Evolution (part 2)

Part 7: Designing Data Intensive Applications - Replication (part 1)

Part 8: Designing Data Intensive Applications - Replication (part 2)

Part 9: Designing Data Intensive Applications - Replication (part 3)

Chapter 4: Encoding and Evolution #

Software application changes overtime, as we justified in the chapter 1 by the need of evolvability. Data schema also needs to be open to change over time. The changes in the data schema usually requires change in the code, but unfortunately, this is often impossible to operate instant update of all the code:

With server side application, we may need to perform rolling updates
With client side application, we cannot assure that all the clients update their application

Old and new version of data formats may coexist, our applications should be able to handle both of them, thanks to:

Backward compatibility: new code can read data written by older code
Forward compatibility: old code can read data written by newer code

Backward compatibility is usually easier to achieve than forward compatibility, since as we write newer code we know the older format, but the opposite is not true.

Formats for Encoding Data #

When they are not working with data in memory (through convenient data structures, typically using pointers, optimized for CPU maniplation), the application needs to write data to files, or send it over the network. In the latter case, the data needs to be encoded as a self-contained sequence of bytes (e.g JSON document). The transformation from in-memory representation to byte sequence is called serialization, encoding or marshalling. The reverse process is called deserialization, decoding or unmarshalling.

Language-specific Formats #

Typically, programming languages come up with built-in support for encoding and decoding data. This is convenient to perform these operation without any additional code, but comes up with some limitations:

The encoding is usually tied to a particular programming language, and may not be understood by programs written in other languages.
Versioning (forward and backwards copatibility) and efficiency is usually an afterthought, and may not be well supported.

JSON, XML and Binary Variants #

JSON and XML are popular formats for encoding data in a way that is easy for humans to read and write. CSV is another popular language-independent format, but less powerful. These formats have some noticeable limitations:

Ambiguity in the encoding of numbers: for instance, JSON does not distinguish between integers and floating-point numbers. We can also raise the issue of dealing with large numbers, or with numbers that are not representable in floating-point format.
Support for unicode string in JSON and XML, but not for binary strings(e.g for images). A hack some people use is to encode binary data as base64 and end-up indicating it in the schema, but it increase by 33% the data size.
Although there is optional schema support for both XML and JSON, it is not widely used for JSON. The applications needs to offer the right encoding/decoding rules, which may be difficult to maintain.

These formats have many defaults, but are still good enought for many purposes. Especially for data interchanges, as long as people agree on what the format is.

Binary Encoding #

In contextes in which there is less pressure to be human-readable/for which we don’t need a common denominator, a more compact encoding, such as binary encoding, can be used. The impact is negligible for small data, but can be significant for large data.

Some binary encoding formats have been developped over the years, such as MessagePAck, BSON, BJSON, etc… Although they are not as widely used as the textual versions of JSON and XML.

Thrift and Protocol Buffers #

For both Apache Thrift and Protocol Buffers, we are required to provide a schema for the data that is encoded (for each in their own definition language).

They both come a code generating tool that takes the schema definitions, and produces classes that implement them in various programming languages, which can then be called by our application code to encode or decode records.

Both Thrift and Protocol Buffers are designed to be compact, efficient and extensible. Their binary encoding is slightly different, but overall quite similar.

Field tags and schema evolution #

Schema needs to change overtime, this is called schema evolution.

Here are the different general compatibility rules for schema evolution:

Compatibility type	Specificities	Evolution rules
Backward compatibility	Old code can read data written wit newcode	- Fields may be deleted - Optional fields may be added
Forward compatibility	New code can read data written with old code	- Fields may be added - Optional fields may be deleted
Full compatibility	Both backward and forward compatibility	- Optional fields may be added - Optional fields may be deleted

With Thrift and Protocol Buffers, the field type is identified by a tag number, which is used in the binary encoding to identify the field. When a field is removed, its tag number should not be reused for a different field.

Datatypes and schema evolution #

Changing the type of a field is risky. For instance, changing a field from a 64-bit integer to a 32-bit integer may cause loss of precision (due to truncation) or loss of data (due to overflow).

Avro #

Apache Avro is another binary encoding format, which is more focused on schema evolution than Thrift and Protocol Buffers.

Avro uses a schema to define the structure of the data being encoded. It has two schema languages. One, Avro IDL (Interface Definition Language), intended for human editing. And another one, based on JSON, easier to read by the machines.

In opposition to Thrift and Protocol Buffers, Avro goes through the fields in the order they are defined in the schema. Which means that binary data can only be decoded if the code reading the record is using the same schema as the code that wrote the record.

The writer’s schema and the reader’s schema #

In reality, the schemas don’t need to be the same for the read and for the write of some record, but they need to be compatible.

When reading a record, the avro library resolves the differences of the two schemas by translating the data from the writer’s schema to the reader’s schema. For instance, there is no issue with the fileds in a different order, as the schema resolution matches up the fields by name. If the code reading the data encounters a field that appears in the writers’s schema but not in the reader’s schema, it is ignored. If the code reading the data expects some field, but the writer’s schema does not have it, it is filled with a default value in the reader’s schema.

Schema evolution rules #

With abro, we usually want to aim for full compatibility, so only adding or removing fields with default values is possible. In some programming languages, the default value may be null. But with avro, you have to use a union of type null and the actual type.

Avro doesn’t have optional or required markers for fields as Thrift and Protocol Buffers do. All the fields are required in it, but we can make one nullable with a union (e.g "type": ["int", "null"]).

What’s the writer’s schema? #

With avro, the reader needs to know the writer’s schema. But including it with every records it would be inneficient. So the way to solve it depends on the context of the usage of avro:

If a large file of log records is written in avro, the writer’s schema can be stored once at the beginning of the file.
If a database stores records individually, the writer’s schema can associated to a version number, which will be stored with each record.
If the records are sent over a network, the two parties can negotiate the schema version at the beginning of the connection. The avro RPC protocol does this.

Dynamically generated schemas #

Avro is friendlier to dynamically generated schemas. The schema can be generated from the data itself, and then used to read the data back. And even if the database schema changes, the avro schema can be updated to match it.

On the opposite, with Thrift and Protocol Buffers, the fields tags would need to be assigned manually, and on each change of the schema, we should be careful to not assign a previously used field tag.

Dynamically generated schemas were a design goal for Avro.

Code generation and dynamically typed languages #

Thrift and Protocol Buffers relies on code generation, that is, once the schema is defined, we can generate the code to implement these schemase. It is usefull for statically typed languages, as it creates efficient in-memory data structures, used to encode and decode the data. But it is less useful for dynamically typed languages, as there is no compile-time type checking.

Avro provides an optional code generation for statically typed languages, but it is not required.

The Merits of Schemas #

Protocol Buffers, Thrift and Avro all use a schema to describe a binary encoding format. Their schema language are sumpler than XML and JSON, with more detailed validation rules.

Many systems implement their own proprietary binary encoding for their data. Some databases use it for network communication for instance. Usyally, vendors provides a driver that decodes responses from the network’s protocole into in-memory data structures.

Many advantages of using binary encoding are:

Compactness of the data (much more than JSON, XML or CSV)
The schema is a form of documentation.
The ability to generate code from the schema.