Tag Archives: rpi

mjbots quad A0: October 2019 Roadmap

My last video gave an overview of what I’ve accomplished over the past year.  Now, let me talk about what I’m planning to work on going forward:

I intend to divide my efforts into two parallel tracks.  The first is to demonstrate increased capabilities and continue learning with the existing quad A0, and second is to design and manufacture the next revision of all its major components.

New capabilities and learning

The first, and most important capability I want to develop is an improved gait and locomotion system.  While the moteus servos in the quad A0 are capable of high rate compliant control, the gait engine that I’m using now is still basically the same one that I made for the HerkuleX servos 5 years ago.  It just commands open loop positions to each of the servos and uses no feedback from the platform at all.  This severely limits what the robot can do.  For instance, if the terrain is not level, legs will drag on the ground or it will not walk at all.  The maximum speed is relatively slow and achieving it requires careful tuning of servo-level gains.  While it is more robust than nearly any other open loop 4 legged walker while standing up, even small disturbances can cause it to fall over.

At a minimum, I’d like to switch to controlling the position and force of the endpoint of each leg and switch to a gait engine that takes into account the actual position of each joint during the gait cycle.  That should enable it to maintain good ground contact with all feet that are intended to be in a stance position.  Now, often feet will skip around, as the servo gains are all relatively high to account for the lack of feedback.

The next level would be to take into account an IMU to ensure balanced feet lift offs.  While I have an IMU in this configuration, it currently isn’t usable due to my improvised attempts to improve the overall system update rate.  I’ll need to update some of the hardware to get an IMU again before I work on this part.

New hardware revisions

I have learned a lot with the moteus controller, servo, and the quad A0 over the last year.  I’d like to take that learning and update the components to achieve:

  • Manufacturability: Many of the sub components now are very labor intensive to produce or assemble.
  • Cost: I have in many cases made things much more expensive than they needed to be in order to shave a bit of time off the prototyping process.  Also, I have enough confidence now to get larger volumes of parts, which further reduces cost.
  • Capability: There are many easy areas where the system performance can be greatly improved.

My overall goals are to make a system that is capable and affordable — to provide an ecosystem of parts and systems for researchers and hobbyists to use.  That means keeping everything open source, and providing access to all subcomponents and designs while still being able to provide a complete system that is capable out of the box.

Here’s a quick rundown of the various areas I’m tackling:

Moteus controller

My next revision of this controller will be a slightly bigger change, in that the form factor, the microcontroller, and the connectors used will all be altered.  I’m planning on switching to an STM32G4 for the controller, using daisy-chained connectors for power in addition to communications, switching to FD-CAN for communications, and reworking the form factor to enable all the primary connectors to exit from one edge of the servo.

Also, I’ll take this opportunity to design for automated end of line test.  That means I will leave enough probe points such that a test fixture can validate the correct operation of a board without intervention.

I don’t anticipate the performance to change a whole lot, although I may increase the allowed input voltage.  Even so this will still be a very capable board, with 3 phase current sensing, >=50A peak phase current, >= 400W peak power and a switching and internal control loop rate of >=40kHz.

Moteus servo

The first round of gearbox servos went through many revisions to achieve something that was functional.  That said, while functional, they are fragile, and require about a full man day of assembly time per servo.  I’m working to get pre-production volumes of around 100 units for all the components such that no post machining will be required and the servos can just be assembled together.

While I’m at it, I intend on switching everything to metal from 3d printed plastic, and reducing the overall dimensions.  It is likely the mass will increase a little from the current 410g, but I’m hoping that it won’t be that much compared to the increased strength and rigidity.

Junction board / raspberry pi 3 hat / pre-charge board

I intend to rework the architecture behind the power and data distribution internal to the quad A0.  Currently there is a raspberry pi3 hat, which merely powers the raspberry pi and provides an RS485 interface.  Then a “junction board” splits out the RS485 bus into four separate connectors as well as splitting out power to all four legs and housing the IMU.  Finally, a separate “pre-charge” board ensured that the servos can have power applied safely.

I intend on moving all data distribution and the IMU onto the board that mounts on the host computer.  Thus it will have power in, an IMU, and 4 (or maybe more) CAN bus outputs.  Then, a separate power distribution board will include the pre-charge circuit as well as connectors for power for all four legs.  Simultaneously, I plan on adding an actual power switch, so that I don’t have to keep pulling the battery in and out to cut power.

Chassis

The quad A0 chassis is a single monolithic 3D printed block.  Actually attaching legs to this is nigh impossible.  It takes a long time, and not all of the joining screws can actually be used because of interference with other pieces.  Also, using the Ryobi style batteries results in a lot of dead space due to the mounting stud.  I am planning on switching this chassis to be printed in multiple components that are assembled together with screws and threaded inserts and switch to a battery form factor that is more compact.

That should also allow for mounting the primary computer inside the chassis, and still leave enough room for a bigger computer, like an NVIDIA Jetson nano.

Legs

I’ve already had one failure in the current iteration of the leg that indicates some redesign is necessary.  Also, the two primary brackets that connect the shoulder to the upper leg, and upper leg to lower leg should be made from aluminum instead of being 3d printed.  Those brackets had to be really big and printed in odd orientations in order to be strong enough, and they could be a lot lighter, more convenient, and more cost effective in metal.

Looking forward

Thanks for all of your positive feedback so far, and I look forward to more exciting times executing these plans!

Improving the moteus update rate, part 4

In part 1, part 2, and part 3, I looked at what was limiting the update rate of the moteus controller when built into a quadruped configuration and how to improve that.  Now, it is time for the final demonstration!

That video was shot with a 150Hz overall update rate.  The plot shows the commanded and actual position of the three joints in the front right leg, although not all to the same vertical scale.  Updating the servos themselves only used about 3.5ms per cycle, but the gait logic used another 1-1.5ms, which made hitting 200Hz not super reliable, thus running at 150Hz.

Future work

I would still like to be able to perform 1kHz full system updates.  These experiments have let me come up with a plan that I think will achieve that with plenty of margin from the servo side in the next revision.

  • Switch the controllers to use FD-CAN:  I had initially not used CAN as the communication mechanism because I didn’t want to be limited to 1Mbps and 8 byte frames.  However, recent STM32 controllers come with FD-CAN support, which allows up to 64 byte frames and 8Mbps.  The hardware FD-CAN receiver implements CRCs for free, should be more reliable, and manages some amount of pre-filtering and processing, which should further reduce the turnaround time of querying a device.
  • Integrate with the host computer over SPI: While I was able to make serial work by busy loop polling on a dedicated CPU, the SPI bus has an even higher possible bitrate and even if its kernel driver is just as problematic, it can still be polled in the same way.
  • Operate 4 separate busses:  This will be done by having probably 2 STM32’s on the host computer daughterboard, each managing two busses.  This way each leg will have its own CAN bus.

 

Improving the moteus update rate, part 3

Back in part 1 and part 2, I looked at problems that limited the rate at which the host computer could command the full quadruped and some of the solutions.  Now, in part 3, I’ll cover more of the solution.

More solution steps

Previously, I switched to using PREEMPT_RT, switched bridging strategies, and optimized the turnaround of the individual servo.  Now, I’ll move on to optimizing the host software.

4. Host C++ software micro-optimizations

The primary contributor in the host software to the overall update rate is the time it takes to turn around from receiving a reply from one servo, to sending the next command.  I first did some easy micro-optimizations which came up in profiling.

  • error_code: My implementation of error_code with strings attached was doing lots of string manipulation even when no one asked for it.  Fixing that saved a fair amount of time throughout.
  • Memory allocation: There were a few sites in the code that generated packets where a persistent buffer could be used each time, instead of having to allocate a fresh one.

5. boost::asio

The host software was using boost::asio to interact with the serial port.  It is high performance for what it does, allowing multiple external operations to happen in the same thread, but necessarily relies on an epoll loop and non-blocking write operations.  These aren’t particularly fast in the linux kernel, and the best turnaround time on the rpi I could achieve with asio was around 80us.

I implemented a standalone proof of concept which just uses a single thread to read and write to the serial port in a blocking manner.  Doing that allowed me to get the turnaround down to around 30us.

6. Protocol design

The register protocol that is used for high rate control had one opcode for setting a single register, and another for setting multiple consecutive ones.  The multiple consecutive one requires an additional byte to identify how many registers are set.  The same thing is true for queries.

I shaved a byte off of the common case of both by allowing writes and reads of up to 3 registers to encode their length in the primary opcode.

With that change, the full query packet to each servo was 23 bytes, and the full reply packet was 28 bytes.

7. Linux serial driver latencies

With the blocking thread approach from step #5 and that thread set to real-time priority, the average turnaround was indeed 30us.  However, when run in the full control software (which does other processing as well), occasionally latencies would be in the several millisecond range.  Also, since the PL011 on the raspberry pi only has a 16 byte FIFO, reading any frames larger than that was unreliable, as the kernel didn’t always get around to servicing things fast enough.  Even with sub-16 byte frames, and a blocking reader, the kernel would still delay reads by a millisecond or so sometimes.  I believe this is because the PL011 only provides interrupt notification on even 4 byte boundaries of its FIFO, and otherwise relies on the kernel to poll it.

Well, I can poll too, so I fixed this by just disabling the kernel’s serial driver, opening up “/dev/mem” and polling the IO memory of the controller manually in a busy loop from a RT thread.  This let me get turnarounds down to 4us, and also let me receive packets of arbitrary length without loss.  See https://github.com/mjbots/mjmech/blob/master/mech/rpi3_raw_uart.h and https://github.com/mjbots/mjmech/blob/master/mech/rpi3_raw_uart.cc

8. Linux scheduling latencies

With the above changes, the serial port was doing fine, but linux still had problems scheduling the primary process to run with less than 1ms precision, even when it was RT priority.  To solve that, and make the serial thread a bit better performing too, I used the “isolcpus” feature of linux to exclude 2 of the 4 raspberry pi processors from normal scheduling.  Then the main thread of the application got processor 3, and the serial thread got processor 4.  With those changes, the time required to poll the full set of 12 servos is rock solid.  Here’s a plot showing the cycle time required to poll a full set of telemetry data  (but not command them, which adds a bit more time per query but doesn’t affect the variability).

20190919-improved-latency-jitter.png

Next steps

And in the next and final post of this series, I’ll demonstrate the final result, showing all of these changes are integrated into the primary control software.

 

Improving the moteus update rate, part 2

Back in part 1, I looked at the driving factors that limited the update rate of the full quadruped.  Now in part 2, I’ll cover the first half of the solution.

Background

To begin with, there were two major paths that I could take based around the network topology.  In one path, I would remove the active bridging capability from the junction board, and rely on the Raspberry Pi to drive all the servos directly, and in the other the active bridge would stick around.  There were a number of key disadvantages to both approaches:

Passive bridge: In this model, the raspberry pi has no choice but to rapidly turn around 12 separate queries and responses.  There is no hope for parallelization.

Active bridge: Here, the junction board’s STM32 can offload the multiple queries.  However, there are two big downsides.  The first are that the data must flow across 2 separate 485 busses.  The second, and possibly more problematic, is that it only works out better if the junction board can stream a large amount of data consecutively to the raspberry pi.  In my previous experiments, I had run into what I believe was a kernel bug that killed the serial port until a power cycle upon receive overruns.  Debugging that could easily be a large project.  Implementing the active bridge would also be a lot of work, as I don’t currently have a protocol client that runs on the STM32.

My initial back of the envelope calculations, surprisingly, indicated that both approaches could potentially get up to 250 or 300Hz with sufficient margin to still do other things.  I had expected that parallelization would win the day, but it turned out that duplicating the data across both busses had the potential to completely negate that advantage.

Solution steps (first half)

1. RT_PREEMPT

The first thing I did was to switch to using a RT_PREEMPT enabled kernel with the governors set to performance.  This by itself reduced the Raspberry Pi’s reply to query turnaround from 200us to 100us.  I wanted to at this time also upgrade to the 4.19 kernel, as I hoped that it would have some fixes for high serial rates.  However, it doesn’t look like anyone has a RT enabled 4.19 kernel that supports USB and ethernet, both of which are moderately useful on a board with not many other interfaces.

2. “Passive” bridging

In lieu of spinning a new board right away, I instead modified the firmware of the junction board to remove almost all functionality and instead just busy loop shoving individual bytes around between the interfaces.  It is fast enough to just barely be able to achieve this at 3Mbit if interrupts are off.  The downside is that the IMU won’t be usable.  To fix that, I’ll have to spin a new board that just passively wires all the busses together and dangles the STM32 off the bus like any other node and hope that the star topology won’t matter over such short distances.

Doing this removed the double transmission penalty, as well as the additional 90us latency in the junction board on all transfers.  This, combined with RT_PREEMPT, got the overall cycle time down to about 6727us, or ~150Hz.

3. Controller turnaround time

Next, I started in on the moteus firmware, in order to improve the turnaround time between when a query is sent, and when the corresponding reply is generated.  This resulted in many optimizations throughout the code.  The first set were made to the primary control ISR, which runs at 30kHz currently.  Tiny improvements here make everything else run much faster.

  • Constructors: The ARM gcc toolsuite, regardless of the optimization level, seems to implement a constructor of an all zero structure as a call to memset.  This is true even if the structure is a small number of bytes.  There were several places in the firmware where a structure was zeroed out by calling its default constructor.  Switching that to a “clear” method which manually assigned all the fields drastically reduced the time spent there.
  • Appropriate types: When calculating a smoothed velocity, the filter was using an int64_t as an intermediate variable, which is not very efficient on an ARM.  Nothing more than an int32_t was actually needed.
  • fmod: I was wrapping between zero and two pi by using fmod.  The cases where wrapping occurs were never more than one or two phases off, so just dividing and truncating was much faster with no appreciable loss of precision.
  • Pre-computing some variables: Some of the calculations done in the ISR were repeated unnecessarily, so now they are on re-computed upon a configuration change.
  • Make debug output optional: The moteus board has a high rate RS422 debug output, which is only rarely used.  I added a configuration knob to turn it off when not in use.
  • SPI overhead: I was using the ST HAL API to access the position sensor over SPI.  I still do that for setup, but now just twiddle the raw registers to do the actual transfer, which saves a little bit of overhead in the HAL.
  • SPI frequency: The AS5047P has a nominal SPI frequency limit of 10MHz.  I asked the HAL for 10MHz, but the closest it could actually do was 6MHz.  I decided to brave some flakiness and upped it to 12MHz to shave another microsecond off.

After these changes, the ISR uses around 7-8us when stopped, and around 13us per iteration when in position control mode, or around 40% of the CPU budget.

Next I made more optimizations to the software which runs in the main loop:

  • StaticFunction: I was using a home-grown solution to a bounded size type-erased function callback that wasn’t particularly fast.  I switched it out for the SG14 inplace_function from github, modified to allow shrinking to a smaller type.  That sped up everything in the main loop by a fair amount.
  • Protocol parsing: A number of steps in the protocol parsing and emitting eventually delegated to a call to “memcpy” into a particular type.  I broke some abstractions so that all that parsing and emitting eventually compiles down to just loads and stores.
  • Reply encoding: The RS485 register protocol allows clients to query multiple consecutive registers.  As a shortcut, the server was responding to those with a series of single register replies.  I fixed that by implementing sending multiple replies at once to shave off a few bytes and some processing.

I experimented with using the STM32F446’s built in CRC unit, however it only implements a fixed CRC-32 variant.  So updating that would have required reflashing all my bootloaders.  Using that is probably best taken by updating to an STM32F7 or STM32G4 which have more configurable CRC units.  I also tried hand assembling an optimized version of the CCITT16 checksum I was using, but was only able to achieve the same 8 cycles per byte that boost already had.

With all of these changes, the average turnaround time for a single servo was down from 140us to in the 70-80us range for a full control update.

Next steps

Coming up, in part 3, I’ll cover the remainder of the steps I took to improve the overall update rate.

 

Improving the moteus update rate, part 1

The moteus brushless controller I’ve developed for the force controlled quadruped uses an RS485 based command-response communication protocol.  To complete a full control cycle, the controlling computer needs to send new commands to each servo and read the current state back from each of them.  While I designed the system to be capable of high rate all-system updates, my initial implementation took a lot of shortcuts.  The result being that for all my testing so far, the outgoing update rate has been 100Hz, but state read back from the servos has been more at like 10Hz.  Here I’ll cover my work to get that rate both symmetric, and higher.

In this first post, I’ll cover the existing design and how that drives the update rate limitations.

Individual contributors

There are many pieces that chain together to determine the overall cycle time.  Here is my best estimate of each.

RS485 bitrate

The RS485 protocol that I’m using right now runs at 3,000,000 baud half duplex.  That means it can push about 300k bytes per second in one direction or the other.  While the STM32 in the moteus has UARTs capable of going faster than that, control computers that can manage much faster than 3Mbit are rare, so without switching to another transport like ethernet, this is about as good as it will get.

This means that at a minimum, there is a latency associated with all transmissions associated with the amount of data, which is roughly bytes * 10 / 3000000.

Servo turnaround

The RS485 protocol moteus uses allows for unidirectional or bidirectional commands.  In past experiments, all the control commands were sent in a group as unidirectional commands, then the state was queried in a series of separate command-reply sequences.  The firmware of the moteus servo currently takes around 140us from when a command finishes transmission and the corresponding reply is started.  The ideal turnaround for a bare servo is then (txbytes * 10 / 3000000) + 140us + (rxbytes * 10 / 3000000).

IMU junction board

The current quadruped has a network topology that looks like:

imu-junction-block-diagram

The junction board is an STM32F4 processor that performs active bridging across the RS485 networks and also contains an IMU.  This topology was chosen so that the junction board could query both halves of the quadruped simultaneously, then send a single result back to the host computer.  However, that has not been implemented yet, thus all the junction board does is further increase the latency of a single command.  As implemented now, it adds about 90us of latency, plus the time required to transmit the command and reply packets a second time.  That makes the latency for a single command and reply now: 2 * (txbytes * 10 / 3000000) + 110us + 90us + 2 * (rxbytes * 10 / 3000000)

Raspberry pi command transmission

As mentioned, the current system first sends new commands to the servos, then updates their state.  When sending the new commands, the existing implementation makes a separate system call to initiate each servos output packet.  Sometimes the linux kernel groups those together into a single outgoing frame on the wire, but more often than not those commands ended up being separated by 120us of white space.  That adds 12 * 120us of additional latency to an overall update frame.  So, 12 * 120us = 1440us

Raspberry pi reply to query turnaround

During the phase when all 12 of the servos are being queried, after each query, the raspberry pi needs to receive the response then formulate and send another query.  This currently takes around 200us from when the reply finishes transmission until when the next query hits the wire.  This is some combination of hardware latency, kernel driver latency, and application latency.  It sums up to 200us * 12 = 2400us

Packet framing

The RS485 protocol used for moteus has some header and framing bytes, that are an overhead on every single command or response.  This is currently:

  • Leadin Framing: 2 bytes
  • Source ID: 1 byte
  • Destination ID: 1 byte
  • Payload Size: 1 byte for small things
  • Checksum: 2 bytes

That works out to a 7 byte overhead, which in the current formulation applies 12x for the command phase, and 48 times for the query phase.  12x for the raspberry pi sending, 12x for the junction board sending, and 24x for the combined receive side.  That makes a total of (12 + 48) * 7 = 420 bytes * 10 / 3000000 = 1400us

Data encoding

In the current control mode of the servo, a number of different parameters are typically updated every control cycle:

  • Target angle
  • Target velocity
  • Maximum torque
  • Feedforward torque
  • Proportional control constant
  • Not to exceed angle (only used during open loop startup)

The servo protocol allows each of these values to be encoded on the wire as either a 4 byte floating point value, or as a fixed point signed integer of either 4, 2, or 1 bytes.  The current implementation sends all 6 of these values every time as 4 byte floats.  Additionally two bytes are required to denote which parameters are being sent.  That works out to: ((6 parameters * 4 byte float + 2) * 12 servos * 2 for junction board * 10) / 3000000 = 2080us

The receive side returns the following:

  • Current angle
  • Current velocity
  • Current torque
  • Voltage
  • Temperature
  • Fault code

And in the current implementation all of those are either sent as a 4 byte float, or a 4 byte integer.  That makes ((6 parameters * 4 bytes + 2) * 12 servos * 2 for junction board * 10) / 3000000 = 2080us

Overall result

I put together a spreadsheet that let me tweak each of the individual parameters and see how that affected the overall update rate of the system.

I made a dedicated test program and used the oscilloscope to monitor a cycle and roughly verified these results:

overall-latency

Thus, with a full command and query cycle, an update rate of about 80Hz can be achieved with the current system.

Next up, working to make this much better.

 

Pocket NC Raspberry Pi Wifi Bridge

The primary UI the Pocket NC presents is a web interface accessible over a virtual USB based ethernet port.  I wanted to be able to run mine not immediately near an ethernet jack, but also didn’t want to have to tote a laptop over every time to check on it.  I had plenty of raspberry pi’s lying around, so rigged one up as a wifi bridge.

First, I found a random case to print from thingiverse, the TurboPi:

turbopi

Then I gave it a fixed IP address on my wifi network and set up IP forwarding with NAT.

iptables -A POSTROUTING -o eth1 -j MASQUERADE

Then, I saved that config:

iptables-save > /etc/iptables.ipv4.nat

And restored them in /etc/rc.local

iptables-restore < /etc/iptables.ipv4.nat

Finally, I enabled forwarding in /etc/sysctl.conf

net.ipv4.ip_forward=1

Finally, I added a route on my desktop with the raspberry pi’s address as the gateway.  Now I can transparently access the PocketNC from anywhere!

dsc_0154
Raspberry Pi forwarding network traffic to the Pocket NC

rpi3 interface board

Now that I have a chassis that can walk a little bit, I need to get the onboard computer working.  To do that, I needed to update the raspberry pi 3 daughterboard I built for the previous turret for the new bus voltage and communication format.

The rpi3’s UART is incapable of controlling the direction line on the RS485 transceiver, so I added a small STM32 micro in line to control the transmit / receive direction.  It adds a little bit of latency, but testing the firmware I was able to get it down to only a byte’s worth or so.

Here’s a quick render:

rpi3_hat_r2

And a shot of the actual board:

dsc_1703

Unfortunately, I managed to omit a necessary pullup on the enable line of the primary buck converter.  To get things moving along, I blue-wired it under the microscope:

dsc_1710dsc_1707

And here’s a shot of it installed on the rpi3 b+:

dsc_1716

Sources:

bazel for gstreamer – plan

After OpenCV, the other major dependency the mjmech software has, which is necessary to complete the raspberry pi 3b+ bazel build setup, was gstreamer. Unlike the previous dependencies, this one is a doozy — gstreamer has an enormous transitive dependency set. Additionally, we needed to use its ffmpeg wrappers, which brings in more dependencies.

In this post, I’ll just try to map out the dependencies that we ended up actually needing, so that they can be tackled one by one.

gstreamer core

The core gstreamer base, actually has a minimal dependency set, only really glib. glib in turn has a small dependency set, just libffi, pcre, and zlib.

gst-plugins-base

Most of the functionality within the gstreamer ecosystem is contained within “plugins”, i.e. modules of software which implement a constrained interface and can be installed into media pipelines. All plugins use the common routines defined within gst-plugins-base, and a few low level plugins are defined here. This begins the dependency explosion, as some of these base plugins start to call into non-trivial pieces of the free desktop graphical stack. Direct dependencies here include libx11, libxv, from x.org, and pango.

In turn those packages have a large fanout. libx11 and libxv depend upon videoproto, libxext, xorgproto, libxcb, libxau, libxdcmp, xtrans, and xcb-proto. pango depends upon freetype, harfbuzz, fribidi, cairo, bzip2, freetype, pixman, fontconfig, util-linux, and expat.

gst-plugins-good and gst-plugins-bad

mjmech doesn’t rely on any of the plugins in gst-plugins-good, so nothing was necessary there! We do use some from gst-plugins-bad, but those have no additional dependencies.

gst-plugins-ugly

Here, mjmech uses the x264 plugin, which in turn depends on the x264 external library.

gst-libav

These are the plugins which wrap ffmpeg (nee libav, nee ffmpeg). ffmpeg in turn depends additional on nasm to assemble for x86.

Next steps

Now that we’ve enumerated all the dependent packages, we can start to work on packaging them. First up are those necessary for gstreamer core, so libffi, pcre, and zlib.

bazel for opencv

The next level of difficulty in bazel-ifying packages for mjmech was opencv.

First, for the impatient, Apache 2.0 licensed sources are available on github: https://github.com/mjbots/bazel_deps/tree/master/tools/workspace/opencv

OpenCV’s native build system consists of nearly 200 cmake files with over 20,000 total lines of code, plus assorted helper scripts and prototype files which are substituted into.  Fortunately, I didn’t need to support the full complexity of the opencv build system.  Things I didn’t bother to touch:

  • Autodetecting the location of any packages:  Since this is embedded within a bazel project, all the dependencies I was going to use were already bazel-ified.
  • GPU support:  I didn’t bother with CUDA, OpenCL, or anything else GPU, as the GPU on the raspberry pi3 has only a minimally supported OpenCL stack, at best.
  • Examples: I had no real need to build any of the sample or demonstration applications.
  • Modules: I only needed, for now, a small fraction of the total opencv module set.

Intended result

OpenCV is broken up into numerous “modules”, each of which is largely independent and implements a relatively self-contained piece of functionality.  Each module can depend upon other opencv modules and other external packages.  I wanted to reach a point with the bazel configuration where each module could be described at a relatively high level, which didn’t seem too implausible since the module structure is relatively constant across modules.

Think something like:

cvmodule(
  name = "core",
)

cvmodule(
  name = "imgproc",
  deps = [":core"],
)

# ... etc

Bazel implementation

To achieve this in bazel requires three phases: first, a repository rule to actually download the opencv tarball, second something written in Starlark which could implement a cvmodule like rule, and finally a BUILD file equivalent to call it and enumerate the opencv modules I cared about.

The repository rule was relatively straightforward.  It just uses the standard bazel mechanisms to download a tarball and template expand a known BUILD file into place.

The Starlark file is where all the real work lies:

First, we just define a common set of preprocessor defines which will be applied to all opencv translation units:

_OPENCV_COPTS = [
    "-D_USE_MATH_DEFINES",
    "-D__OPENCV_BUILD=1",
    "-D__STDC_CONSTANT_MACROS",
    "-D__STDC_FORMAT_MACROS",
    "-D__STDC_LIMIT_MACROS",
    "-I$(GENDIR)/external/opencv/private/",
]

The only real interesting piece there is the $(GENDIR) fragment, which causes bazel to substitute in the path to the generated output directory. Since a number of the headers that we will be using later on are generated as part of the bazel build process, we need to ensure that the compiler can actually find them.  The alternate formulation which doesn’t require knowledge of GENDIR is to list it in the includes attribute.  That however, exposes the headers to all downstream modules, which we don’t want.

Next, we enumerate the set of processor specific options that we will support:

_KNOWN_OPTS = [
    ("neon", "armeabihf"),
    ("vfpv3", "armeabihf"),
    ("avx", "x86_64"),
    ("avx2", "x86_64"),
    ("avx512_skx", "x86_64"),
    # ...

OpenCV has support for a comprehensive set of special instruction set extensions for a variety of different processors. They can either be compiled in and always used, or dispatched at runtime, although these bazel rules ignore runtime dispatching entirely. Some of the generated headers need to know the full possible set of options for a given architecture, thus this list here.

Module definition macros

Now we can start getting into the module definition Starlark macros.  To handle some generated headers that are common to every opencv module, I ended up creating an opencv_base macro that is module independent.  It needs to be called from the BUILD file, to generate a number of the private headers, but exposes no logically public labels.  It does require a passed in “configuration” dictionary (which will also be passed in to the primary module generation macro).

CONFIG = {
    "modules" : [
        "aruco",
        "calib3d",
        "core",
        "imgcodecs",
        "imgproc",
    ],
# ...

The big annoyance here is that you have to enumerate the set of modules twice.  Once in that config, and again in the set of invocations to the module creation macro.  The reason is the generated opencv_modules.hpp file, which provides preprocessor defines describing which modules are included in the build.

Next is the implementation of the module generation macro, which I ended up calling opencv_module.  It needs just a few arguments for the modules I ended up converting:

  • name – should be self-evident
  • config – the same configuration dictionary from above
  • excludes – source files to skip compiling from this module
  • dispatched_files – the list of files in this module which contain processor specific specializations
  • deps – a set of bazel labels describing the dependency set

This macro goes about generating the appropriate .simd_declarations.hpp and .simd.hpp files, a non-functional stub OpenCL implementation, and then goes on to define the actual bazel cc_library.

Resulting BUILD file

At this point, the BUILD file can glue all these things together. It ended up being not too far away from my initial ideal:

opencv_base(config=CONFIG)

opencv_module(
    name = "core",
    config = CONFIG,
    dispatched_files = {
        "stat" : ["sse4_2", "avx"],
        "mathfuncs_core" : ["sse2", "avx", "avx2"],
    },
    deps = ["@eigen"],
)

opencv_module(
    name = "imgproc",
    config = CONFIG,
    dispatched_files = {
        "accum" : ["sse2", "avx", "neon"],
    },
    deps = [":core"],
)

# ...

Download

As mentioned at the start, all the source are available on github:

bazel-ifying simple autoconf packages

This is part N in a series describing how I created the bazel infrastructure to build all the third party packages for mjmech.  Previously we have:

We left off with the first, very simple packages configured to build with bazel.  In this installment we will tackle those that require at least minimal configuration, i.e. those that have some files which are normally generated as part of the build process.

snappy / template_file

snappy (https://github.com/google/snappy), a file compression library, is largely a pure C++ project, but it does have a single generated header file.  There are a few possible options within bazel to handle this.  The simplest would be to use a macro, although that has complications with respect to namespacing and label resolution.  Here, I created a custom rule, called “template_file”.

As seen in the source on github the interface is relatively simple:

template_file(
  src,
  is_executable,
  substitutions,
  substitution_list,
)

There are two possible forms for passing in substitutions, either a string_dict when using “substitutions”, or a list of KEY=VALUE strings in “substitution_list”.  The two forms are present because starlark can make it awkward to deal with dictionaries, so often working with a list is much more convenient.

The snappy build rule uses this to generate snappy-stubs-public.h:

template_file(
    name = "snappy-stubs-public.h",
    src = "snappy-stubs-public.h.in",
    substitutions = {
        "${HAVE_STDINT_H_01}": "1",
        "${HAVE_STDDEF_H_01}": "1",
        "${HAVE_SYS_UIO_H_01}": "1",
        "${SNAPPY_MAJOR}": "1",
        "${SNAPPY_MINOR}": "1",
        "${SNAPPY_PATCHLEVEL}": "7",
    },
)

log4cpp / autoconf

Next up in the difficulty level is log4cpp, which uses autoconf as its native build system.  autoconf typically has a config.h.in file, defined in a semi-standardized format, mostly comprised of lines like:

/* Define to 1 if you have the  header file. */
#undef HAVE_STRING_H

The pre-processing stage turns each of those lines into either an actual #define, or just a a comment saying that it continues to be undefined.

That is substantive enough that the simple built-in bazel template rule won’t cut it though.  For this, we split the work into two pieces, 1) a python helper script to do the actual formatting and 2) a starlark bazel rule which defaults in some common autoconf defines.  For the bazel_deps repository, the bazel autoconf rule has basically every autoconf define that is shared by more than one project in it, to reduce duplication and make it more likely that all the dependencies are configured consistently.

With the two of them together, the resulting BUILD file defines a cc_library as normal, but it depends upon the new autoconf rule like found in the log4cpp instance:

autoconf_config(
    name = "include/log4cpp/config.h",
    src = "include/config.h.in",
    package = "log4cpp",
    version = "1.1.3",
    defines = autoconf_standard_defines + [
        "DISABLE_SMTP",
        "DISABLE_REMOTE_SYSLOG",
    ],
    prefix = "LOG4CPP_",
)