DDIA | Chapter 4 |  Encoding and Evolution | Modes of Data Flow

In the previous post, we went through different methods used for encoding/decoding data. In this post we will go through the three common methods through which data flows between processes.

Typically there are three ways in which data flows between processes – 

  1. Via databases
  2. Via service calls
  3. via asynchronous messaging systems

1. Data flow through Databases

Process that writes to database encodes the data and process that reads the database decodes the data. 

Both backward and forward compatibility is needed for the encoders/decoders. Say, we have multiple applications accessing the database, but not all of them are on the same code version, it may happen, one application writes to database using newer version of code but another one is reading using an older version of code. In this case forward compatibility of encoding/decoding schema is necessary.

Say we add column to the database, and only the newer version of applications know about it, but not the older one, so the older one would not write to that column. The encoding formats take care of this.

Also if you add new columns, you can rewrite the whole database using the new schema, but this is expensive operation if we have large database. And so most relational databases allow simple schema changes, such as adding a new column with a null default value, without rewriting existing data. When an old row is read, the database fills in nulls for any columns that are missing from the encoded data on disk. LinkedIn’s document database Espresso uses Avro for storage, allowing it to use Avro’s schema evolution rules.

Schema evolution thus allows the entire database to appear as if it was encoded with a single schema, even though the underlying storage may contain records encoded with various historical versions of the schema.

For cases, where you are migrating data to a data warehouse for analytics purposes, it is better to rewrite data using new encoding/decoding schema which supports column oriented format. 
 

2. Dataflow Through Services: REST and RPC

When you have processes that need to communicate over a network, then formats like REST and RPC are used. 
 

REST/SOAP

REST is a design philosophy that builds upon the principles of HTTP. It emphasizes simple data formats, using URLs for identifying resources and using HTTP features for cache control, authentication, and content type negotiation. REST is often associated with microservices. 
 
SOAP is an XML-based protocol for making network API requests. The API of a SOAP web service is described using an XML-based language called the Web Services Description Language, or WSDL. WSDL enables code generation so that a client can access a remote service using local classes and method calls (which are encoded to XML messages and decoded again by the framework). 
 

RPC

The RPC model tries to make a request to a remote network service look the same as calling a function or method in your programming language, within the same process (this abstraction is called location transparency). 

Main assumption when it comes to data encoding for RPC is that all the servers will be updated first, and all the clients second. Thus, you only need backward compatibility on requests, and forward compatibility on responses. The backward and forward compatibility properties of an RPC scheme are inherited from whatever encoding it uses. 

Thrift and Avro come with RPC support included, gRPC is an RPC implementation using Protocol Buffers, Finagle also uses Thrift, and Rest.li uses JSON over HTTP.

Drawbacks of RPC is that local function call is different from network call, lets look at differences between local function call and network call – 

  1. A local function call is predictable and either succeeds or fails, depending only on parameters that are under your control. A network request is unpredictable: the request or response may be lost due to a network problem, or the remote machine may be slow or unavailable, and such problems are entirely outside of your control. 
  2. A local function call either returns a result, or throws an exception, or never returns (because it goes into an infinite loop or the process crashes). A network request has another possible outcome: it may return without a result, due to a timeout.  In that case, you simply don’t know what happened: if you don’t get a response from the remote service, you have no way of knowing whether the request got through or not. 
  3.  In that case, you simply don’t know what happened: if you don’t get a response from the remote service, you have no way of knowing whether the request got through or not. 
  4. Every time you call a local function, it normally takes about the same time to execute. A network request is much slower than a function call, and its latency is also wildly variable.
  5. When you call a local function, you can efficiently pass it references (pointers) to objects in local memory. When you make a network request, all those parameters need to be encoded into a sequence of bytes that can be sent over the network. That’s okay if the parameters are primitives like numbers or strings, but quickly becomes problematic with larger objects.

  6. The client and the service may be implemented in different programming languages, so the RPC framework must translate datatypes from one language into another.

Message-Passing Dataflow

These are are similar to RPC in that a client’s request (usually called a message) is delivered to another process with low latency. They are similar to databases in that the message is not sent via a direct network connection, but goes via an intermediary called a message broker (also called a message queue or message-oriented middleware), which stores the message temporarily.

Using a message broker has several advantages compared to direct RPC:

  • It can act as a buffer if the recipient is unavailable or overloaded, and thus improve system reliability.

  • It can automatically redeliver messages to a process that has crashed, and thus prevent messages from being lost.

  • It avoids the sender needing to know the IP address and port number of the recipient (which is particularly useful in a cloud deployment where virtual machines often come and go).

  • It allows one message to be sent to several recipients.

  • It logically decouples the sender from the recipient (the sender just publishes messages and doesn’t care who consumes them).

Message Brokers

Usually, one process sends a message to a named queue or topic, and the broker ensures that the message is delivered to one or more consumers of or subscribers to that queue or topic. 

Message brokers typically don’t enforce any particular data model—a message is just a sequence of bytes with some metadata, so you can use any encoding format. If the encoding is backward and forward compatible, you have the greatest flexibility to change publishers and consumers independently and deploy them in any order.

Distributed actor frameworks

The actor model is a programming model for concurrency in a single process. Rather than dealing directly with threads (and the associated problems of race conditions, locking, and deadlock), logic is encapsulated in actors. Each actor typically represents one client or entity, it may have some local state (which is not shared with any other actor), and it communicates with other actors by sending and receiving asynchronous messages. Message delivery is not guaranteed: in certain error scenarios, messages will be lost.

In distributed actor frameworks, this programming model is used to scale an application across multiple nodes. The same message-passing mechanism is used, no matter whether the sender and recipient are on the same node or different nodes. If they are on different nodes, the message is transparently encoded into a byte sequence, sent over the network, and decoded on the other side.

If you want to perform rolling upgrades of your actor-based application, you still have to worry about forward and backward compatibility, as messages may be sent from a node running the new version to a node running the old version, and vice versa.

examples of distributed actor framework – Akka, Orleans, Erlang OTP

Thanks for stopping by! Hope this gives you a brief overview in to various modes of data flow between processes. Eager to hear your thoughts and chat, please leave comments below and we can discuss.