Revised mjlib serialization design (diagnostics part 2)

As discussed previously, I recently significantly revised the serialization format used by the mjbots quad A1 based on experience in previous professional domains, and from studying newer external projects like Apache AVRO.  Here I’ll describe the design of the serialized representation, which is more completely defined at: mjlib/telemetry/README.md

Refresher and definitions

As a brief refresher, this serialization format is intended to be used primarily to record telemetry from embedded systems, where that telemetry data may be persisted on disk for a long time.  Secondarily, it can be used to inspect the results of a live system.  The primitive it operates on is a “record”, which is logically a structure of elements which is emitted at some intervals over time.  For any given record, it logically breaks it up into a “schema” and a “data” portion.  The schema describes what types of elements are present in the structure, their names and relationships.  The “data” portion contains the minimum amount of information necessary to communicate one instance of the structure, assuming that the receiver already has a copy of the schema.

Schemas

A schema consists of one “type”.  There exist a number of “primitive” types which directly, or close to directly, map to machine storage.  For instance an abbreviated subset:

  • boolean can be true or false
  • float64 is a 64 bit floating point value
  • fixeduint is an unsigned integer of size 1, 2, 4, or 8
  • varuint is an unsigned integer of dynamic encoding length
  • string is a sequence of UTF-8 characters
  • bytes is a sequence of arbitrary bytes

After that, there are “complex” types, which consist of:

  • object is a list of fields, each with its own type
  • enum is an unsigned integer, along with a mapping from those integers to strings
  • array is a variable length array of some other type
  • fixedarray is a fixed length array of some other type
  • map is a mapping from strings to another type
  • union is an index discriminated union between multiple types

 Data

The data associated with each type is a direct mapping for the primitive types.  For the “complex” types, the associated data is as follows:

  • object the data consists of the data from each field in order
  • enum the data consists of a single unsigned integer
  • array the data consists of a size, followed by that many instances of the types data
  • fixedarray consists of the types data repeated the number of times from the schema
  • map just consists of the keys and values from the map
  • union contains a single unsigned integer index, followed by the selected type’s data

Encoding

For both the schema and the data there are two encodings defined, a JSON* one, and a binary one.  The JSON data encoding is what would be traditionally exchanged in Javascript applications.  It is not completely minimal, since field names and object and list delimiters are present.  For example, a simple object type consisting of a boolean, a string, and a list of fixedint might have a data representation in JSON like:

{
  "field1" : true,
  "field2" : "my string data",
  "field3" : [4, 5, 6],
}

The JSON schema encoding contains the entirety of the information from the schema.  For the above record it might look like:

{
  "type" : "object",
  "name" : "MyObject",
  "aliases" : ["AnOldName"],
  "fields" : [
    { "name" : "field1", "type" : "boolean" },
    { "name" : "field2", "type" : "string" },
    { "name" : "field3", "type" : "array", "items" : "fixedint32" }
  ],
}

A binary encoding for both the schema and the data is defined as well.  The schema is straightforward, if uninteresting and can be found in the README.  The data encoding for the primitive types for those which have direct machine analogs are the little endian machine representation.  The object data binary representation is merely the concatenation of all the field’s data fields.  This makes it possible to construct record definitions that exactly match a useful set of in memory structures to make serialization for those structures be a noop.

Next steps

In the next issue of this series, I’ll describe the C++ API for serializing and deserializing objects.

*Actually JSON5, which supports comments and final trailing commas among other improvements for human readability.