As discussed previously, I recently significantly revised the serialization format used by the mjbots quad A1 based on experience in previous professional domains, and from studying newer external projects like Apache AVRO. Here I’ll describe the design of the serialized representation, which is more completely defined at: mjlib/telemetry/README.md
Refresher and definitions
As a brief refresher, this serialization format is intended to be used primarily to record telemetry from embedded systems, where that telemetry data may be persisted on disk for a long time. Secondarily, it can be used to inspect the results of a live system. The primitive it operates on is a “record”, which is logically a structure of elements which is emitted at some intervals over time. For any given record, it logically breaks it up into a “schema” and a “data” portion. The schema describes what types of elements are present in the structure, their names and relationships. The “data” portion contains the minimum amount of information necessary to communicate one instance of the structure, assuming that the receiver already has a copy of the schema.
Schemas
A schema consists of one “type”. There exist a number of “primitive” types which directly, or close to directly, map to machine storage. For instance an abbreviated subset:
boolean
can be true or falsefloat64
is a 64 bit floating point valuefixeduint
is an unsigned integer of size 1, 2, 4, or 8varuint
is an unsigned integer of dynamic encoding lengthstring
is a sequence of UTF-8 charactersbytes
is a sequence of arbitrary bytes
After that, there are “complex” types, which consist of:
object
is a list of fields, each with its own typeenum
is an unsigned integer, along with a mapping from those integers to stringsarray
is a variable length array of some other typefixedarray
is a fixed length array of some other typemap
is a mapping from strings to another typeunion
is an index discriminated union between multiple types
Data
The data associated with each type is a direct mapping for the primitive types. For the “complex” types, the associated data is as follows:
object
the data consists of the data from each field in orderenum
the data consists of a single unsigned integerarray
the data consists of a size, followed by that many instances of the types datafixedarray
consists of the types data repeated the number of times from the schemamap
just consists of the keys and values from the mapunion
contains a single unsigned integer index, followed by the selected type’s data
Encoding
For both the schema and the data there are two encodings defined, a JSON* one, and a binary one. The JSON data encoding is what would be traditionally exchanged in Javascript applications. It is not completely minimal, since field names and object and list delimiters are present. For example, a simple object
type consisting of a boolean
, a string
, and a list of fixedint
might have a data representation in JSON like:
{ "field1" : true, "field2" : "my string data", "field3" : [4, 5, 6], }
The JSON schema encoding contains the entirety of the information from the schema. For the above record it might look like:
{ "type" : "object", "name" : "MyObject", "aliases" : ["AnOldName"], "fields" : [ { "name" : "field1", "type" : "boolean" }, { "name" : "field2", "type" : "string" }, { "name" : "field3", "type" : "array", "items" : "fixedint32" } ], }
A binary encoding for both the schema and the data is defined as well. The schema is straightforward, if uninteresting and can be found in the README. The data encoding for the primitive types for those which have direct machine analogs are the little endian machine representation. The object
data binary representation is merely the concatenation of all the field’s data fields. This makes it possible to construct record definitions that exactly match a useful set of in memory structures to make serialization for those structures be a noop.
Next steps
In the next issue of this series, I’ll describe the C++ API for serializing and deserializing objects.
*Actually JSON5, which supports comments and final trailing commas among other improvements for human readability.