Log file format (diagnostics part 4)

In parts 1, 2, and 3 I covered some motivation for the updated mjlib diagnostics system and the serialization of individual structures.  In this post, I’ll cover how those structures are written into a file from an embedded system like a robot and how diagnostic tools can access them efficiently.

Goals

The top level goals are:

  • Efficient to write live from an embedded system: The quad A1 generates log data currently at 400Hz, consisting of hundreds to thousands of telemetry data points in every update.  It does this on a relatively low-end raspberry pi 3b+.  The format should be able to support writing data at high rates without a significant CPU burden.
  • Efficient seeking by time and record: Readers of the file should be able to efficiently seek by time in the stream, as well as extract all of a single record without having to process unnecessary data from the log.
  • Self contained: While this property  in the log comes from the underlying mjlib serialization format, it is worth re-iterating here.  All information necessary to return a JSON or CSV like structure for each instance should be present within the log.

Design

The detailed design of the log format is documented at README.md, here I will give a brief summary.

The log consists of a header followed by a series of “Blocks” concatenated together.  The two primary block types are one that contain the schema for an individual record and one for the data.  For a given record the schema will only be present once in the log, typically near the beginning.  The data block, contains a single serialized instance of the record, along with some optional flags and data.  The optional flags include a timestamp, a checksum, whether the data is compressed, and a pointer to the most recent data block for this record.

Another block is the SeekMarker block, which contains a timestamp and a 64 bit long unique-ish byte code and a checksum.  When readers need to perform random seeks in the log, they can binary search to an arbitrary byte offset, then search to find an instance of this unique code.  If it is present in conjunction with the necessary header and a validated checksum, it can be assumed that the framing has been recovered and the time for that point in the log.

Finally, there is an Index block, written at the very end of the log.  This includes pointers to the schema entries for all records in the file, as well as the most recent data block for that record.  That allows readers to find the set of records in a log, and extract a single record (albeit backwards) from the log while reading no extra data.

Future extensions

Most of the entities in the log have flag bitmasks to control additional future features or extensions.  Current readers throw errors when unknown bits are discovered, which makes it safe to almost arbitrarily modify the log structure at the expense of forward compatibility.

The mostly likely extensions are related to compression.  The current per-data compression format is snappy, from google.  It is fast, but has relatively poor compresson performance.  At some point, I’d like to switch to Zstandard, which has even better runtime performance, much better compression performance, and supports incremental dictionary manipulation.  I have actually integrated into in a test manner into the C++ writer and reader and the effort was trivial, however the other languages that I support, python and TypeScript are more challenging.  With snappy, there are operating system provided packages that work just fine in Debian and Ubuntu, but not so for Zstandard.  Bazel has rules that support pulling in pip packages for python and npm for TypeScript, but both of those mechanisms don’t have very straightforward support for the recursive WORKSPACE workarounds I am using now.  For now, it is easiest just to stick to snappy.

Next

Now that we have the data structures out of the way, I’ll move on to the tools that use them!