DDIA | Chapter 4 |  Encoding and Evolution | Formats for Encoding Data

Programs usually work with data in (at least) two different representations:

  1. In memory, data is kept in objects, structs, lists, arrays, hash tables, trees, and so on. These data structures are optimized for efficient access and manipulation by the CPU (typically using pointers).

  2. When you want to write data to a file or send it over the network, you have to encode it as some kind of self-contained sequence of bytes (for example, a JSON document). Since a pointer wouldn’t make sense to any other process, this sequence-of-bytes representation looks quite different from the data structures that are normally used in memory.

Thus, we need some kind of translation between the two representations. The translation from the in-memory representation to a byte sequence is called encoding (also known as serialization or marshalling), and the reverse is called decoding (parsingdeserializationunmarshalling)

Language specific formats

Languages have built-in support for encoding/decoding, example – Java has java.io.Serializable, Ruby has Marshal, Python has pickle, and so on. Many third-party libraries also exist, such as Kryo for Java.

Problems – 

  1. Using language specific encoding solutions gets us tied to one language for your application. Cannot integrate with systems of other organizations if they use different language. 
  2. Introduces security vulnerabilities
  3. Usually not backward/forward compatible
  4. Not efficient, example java’s serialization is known to take up lot of CPU time.

JSON, XML, and Binary Variants

Known problems – 

  1. Ambiguity around encoding of numbers. XML and CSV cannot distinguish between numbers and strings. JSON cannot distinguish between numbers and floating point numbers.
  2. JSON and XML have good support for Unicode character strings (i.e., human-readable text), but they don’t support binary strings (sequences of bytes without a character encoding).
  3. There is optional schema support for both XML and JSON. These schema languages are quite powerful, and thus quite complicated to learn and implement. Use of XML schemas is fairly widespread, but many JSON-based tools don’t bother using schemas. Since the correct interpretation of data (such as numbers and binary strings) depends on information in the schema, applications that don’t use XML/JSON schemas need to potentially hardcode the appropriate encoding/decoding logic instead.
  4. CSV does not have any schema, so it is up to the application to define the meaning of each row and column. If an application change adds a new row or column, you have to handle that change manually. 

Binary encoding

Binary encoding takes less space than JSON and XML. Some binary encodings developed for JSON are NMessagePack, BSON, BJSON, UBJSON, BISON, and Smile, etc. And some binary encodings developed for XML are WBXML and Fast Infoset, etc. But these are not widely used. Major problems with binary encoding is that they are not human readable.

Popular binary encoding based on schemas are – Apache Thrift (by facebook) and protocol buffers (google), Avro.

Advantages of these binary encodings based on schems – 

  • They can be much more compact than the various “binary JSON” variants, since they can omit field names from the encoded data.
  • The schema is a valuable form of documentation, and because the schema is required for decoding, you can be sure that it is up to date (whereas manually maintained documentation may easily diverge from reality).
  • Keeping a database of schemas allows you to check forward and backward compatibility of schema changes, before anything is deployed.
  • For users of statically typed programming languages, the ability to generate code from the schema is useful, since it enables type checking at compile time.

Thanks for stopping by! Hope this gives you a brief overview in to various encoding formats. Eager to hear your thoughts and chat, please leave comments below and we can discuss.


One response to “DDIA | Chapter 4 |  Encoding and Evolution | Formats for Encoding Data”