Improving the moteus update rate, part 2

Back in part 1, I looked at the driving factors that limited the update rate of the full quadruped.  Now in part 2, I’ll cover the first half of the solution.


To begin with, there were two major paths that I could take based around the network topology.  In one path, I would remove the active bridging capability from the junction board, and rely on the Raspberry Pi to drive all the servos directly, and in the other the active bridge would stick around.  There were a number of key disadvantages to both approaches:

Passive bridge: In this model, the raspberry pi has no choice but to rapidly turn around 12 separate queries and responses.  There is no hope for parallelization.

Active bridge: Here, the junction board’s STM32 can offload the multiple queries.  However, there are two big downsides.  The first are that the data must flow across 2 separate 485 busses.  The second, and possibly more problematic, is that it only works out better if the junction board can stream a large amount of data consecutively to the raspberry pi.  In my previous experiments, I had run into what I believe was a kernel bug that killed the serial port until a power cycle upon receive overruns.  Debugging that could easily be a large project.  Implementing the active bridge would also be a lot of work, as I don’t currently have a protocol client that runs on the STM32.

My initial back of the envelope calculations, surprisingly, indicated that both approaches could potentially get up to 250 or 300Hz with sufficient margin to still do other things.  I had expected that parallelization would win the day, but it turned out that duplicating the data across both busses had the potential to completely negate that advantage.

Solution steps (first half)


The first thing I did was to switch to using a RT_PREEMPT enabled kernel with the governors set to performance.  This by itself reduced the Raspberry Pi’s reply to query turnaround from 200us to 100us.  I wanted to at this time also upgrade to the 4.19 kernel, as I hoped that it would have some fixes for high serial rates.  However, it doesn’t look like anyone has a RT enabled 4.19 kernel that supports USB and ethernet, both of which are moderately useful on a board with not many other interfaces.

2. “Passive” bridging

In lieu of spinning a new board right away, I instead modified the firmware of the junction board to remove almost all functionality and instead just busy loop shoving individual bytes around between the interfaces.  It is fast enough to just barely be able to achieve this at 3Mbit if interrupts are off.  The downside is that the IMU won’t be usable.  To fix that, I’ll have to spin a new board that just passively wires all the busses together and dangles the STM32 off the bus like any other node and hope that the star topology won’t matter over such short distances.

Doing this removed the double transmission penalty, as well as the additional 90us latency in the junction board on all transfers.  This, combined with RT_PREEMPT, got the overall cycle time down to about 6727us, or ~150Hz.

3. Controller turnaround time

Next, I started in on the moteus firmware, in order to improve the turnaround time between when a query is sent, and when the corresponding reply is generated.  This resulted in many optimizations throughout the code.  The first set were made to the primary control ISR, which runs at 30kHz currently.  Tiny improvements here make everything else run much faster.

  • Constructors: The ARM gcc toolsuite, regardless of the optimization level, seems to implement a constructor of an all zero structure as a call to memset.  This is true even if the structure is a small number of bytes.  There were several places in the firmware where a structure was zeroed out by calling its default constructor.  Switching that to a “clear” method which manually assigned all the fields drastically reduced the time spent there.
  • Appropriate types: When calculating a smoothed velocity, the filter was using an int64_t as an intermediate variable, which is not very efficient on an ARM.  Nothing more than an int32_t was actually needed.
  • fmod: I was wrapping between zero and two pi by using fmod.  The cases where wrapping occurs were never more than one or two phases off, so just dividing and truncating was much faster with no appreciable loss of precision.
  • Pre-computing some variables: Some of the calculations done in the ISR were repeated unnecessarily, so now they are on re-computed upon a configuration change.
  • Make debug output optional: The moteus board has a high rate RS422 debug output, which is only rarely used.  I added a configuration knob to turn it off when not in use.
  • SPI overhead: I was using the ST HAL API to access the position sensor over SPI.  I still do that for setup, but now just twiddle the raw registers to do the actual transfer, which saves a little bit of overhead in the HAL.
  • SPI frequency: The AS5047P has a nominal SPI frequency limit of 10MHz.  I asked the HAL for 10MHz, but the closest it could actually do was 6MHz.  I decided to brave some flakiness and upped it to 12MHz to shave another microsecond off.

After these changes, the ISR uses around 7-8us when stopped, and around 13us per iteration when in position control mode, or around 40% of the CPU budget.

Next I made more optimizations to the software which runs in the main loop:

  • StaticFunction: I was using a home-grown solution to a bounded size type-erased function callback that wasn’t particularly fast.  I switched it out for the SG14 inplace_function from github, modified to allow shrinking to a smaller type.  That sped up everything in the main loop by a fair amount.
  • Protocol parsing: A number of steps in the protocol parsing and emitting eventually delegated to a call to “memcpy” into a particular type.  I broke some abstractions so that all that parsing and emitting eventually compiles down to just loads and stores.
  • Reply encoding: The RS485 register protocol allows clients to query multiple consecutive registers.  As a shortcut, the server was responding to those with a series of single register replies.  I fixed that by implementing sending multiple replies at once to shave off a few bytes and some processing.

I experimented with using the STM32F446’s built in CRC unit, however it only implements a fixed CRC-32 variant.  So updating that would have required reflashing all my bootloaders.  Using that is probably best taken by updating to an STM32F7 or STM32G4 which have more configurable CRC units.  I also tried hand assembling an optimized version of the CCITT16 checksum I was using, but was only able to achieve the same 8 cycles per byte that boost already had.

With all of these changes, the average turnaround time for a single servo was down from 140us to in the 70-80us range for a full control update.

Next steps

Coming up, in part 3, I’ll cover the remainder of the steps I took to improve the overall update rate.


Improving the moteus update rate, part 1

The moteus brushless controller I’ve developed for the force controlled quadruped uses an RS485 based command-response communication protocol.  To complete a full control cycle, the controlling computer needs to send new commands to each servo and read the current state back from each of them.  While I designed the system to be capable of high rate all-system updates, my initial implementation took a lot of shortcuts.  The result being that for all my testing so far, the outgoing update rate has been 100Hz, but state read back from the servos has been more at like 10Hz.  Here I’ll cover my work to get that rate both symmetric, and higher.

In this first post, I’ll cover the existing design and how that drives the update rate limitations.

Individual contributors

There are many pieces that chain together to determine the overall cycle time.  Here is my best estimate of each.

RS485 bitrate

The RS485 protocol that I’m using right now runs at 3,000,000 baud half duplex.  That means it can push about 300k bytes per second in one direction or the other.  While the STM32 in the moteus has UARTs capable of going faster than that, control computers that can manage much faster than 3Mbit are rare, so without switching to another transport like ethernet, this is about as good as it will get.

This means that at a minimum, there is a latency associated with all transmissions associated with the amount of data, which is roughly bytes * 10 / 3000000.

Servo turnaround

The RS485 protocol moteus uses allows for unidirectional or bidirectional commands.  In past experiments, all the control commands were sent in a group as unidirectional commands, then the state was queried in a series of separate command-reply sequences.  The firmware of the moteus servo currently takes around 140us from when a command finishes transmission and the corresponding reply is started.  The ideal turnaround for a bare servo is then (txbytes * 10 / 3000000) + 140us + (rxbytes * 10 / 3000000).

IMU junction board

The current quadruped has a network topology that looks like:


The junction board is an STM32F4 processor that performs active bridging across the RS485 networks and also contains an IMU.  This topology was chosen so that the junction board could query both halves of the quadruped simultaneously, then send a single result back to the host computer.  However, that has not been implemented yet, thus all the junction board does is further increase the latency of a single command.  As implemented now, it adds about 90us of latency, plus the time required to transmit the command and reply packets a second time.  That makes the latency for a single command and reply now: 2 * (txbytes * 10 / 3000000) + 110us + 90us + 2 * (rxbytes * 10 / 3000000)

Raspberry pi command transmission

As mentioned, the current system first sends new commands to the servos, then updates their state.  When sending the new commands, the existing implementation makes a separate system call to initiate each servos output packet.  Sometimes the linux kernel groups those together into a single outgoing frame on the wire, but more often than not those commands ended up being separated by 120us of white space.  That adds 12 * 120us of additional latency to an overall update frame.  So, 12 * 120us = 1440us

Raspberry pi reply to query turnaround

During the phase when all 12 of the servos are being queried, after each query, the raspberry pi needs to receive the response then formulate and send another query.  This currently takes around 200us from when the reply finishes transmission until when the next query hits the wire.  This is some combination of hardware latency, kernel driver latency, and application latency.  It sums up to 200us * 12 = 2400us

Packet framing

The RS485 protocol used for moteus has some header and framing bytes, that are an overhead on every single command or response.  This is currently:

  • Leadin Framing: 2 bytes
  • Source ID: 1 byte
  • Destination ID: 1 byte
  • Payload Size: 1 byte for small things
  • Checksum: 2 bytes

That works out to a 7 byte overhead, which in the current formulation applies 12x for the command phase, and 48 times for the query phase.  12x for the raspberry pi sending, 12x for the junction board sending, and 24x for the combined receive side.  That makes a total of (12 + 48) * 7 = 420 bytes * 10 / 3000000 = 1400us

Data encoding

In the current control mode of the servo, a number of different parameters are typically updated every control cycle:

  • Target angle
  • Target velocity
  • Maximum torque
  • Feedforward torque
  • Proportional control constant
  • Not to exceed angle (only used during open loop startup)

The servo protocol allows each of these values to be encoded on the wire as either a 4 byte floating point value, or as a fixed point signed integer of either 4, 2, or 1 bytes.  The current implementation sends all 6 of these values every time as 4 byte floats.  Additionally two bytes are required to denote which parameters are being sent.  That works out to: ((6 parameters * 4 byte float + 2) * 12 servos * 2 for junction board * 10) / 3000000 = 2080us

The receive side returns the following:

  • Current angle
  • Current velocity
  • Current torque
  • Voltage
  • Temperature
  • Fault code

And in the current implementation all of those are either sent as a 4 byte float, or a 4 byte integer.  That makes ((6 parameters * 4 bytes + 2) * 12 servos * 2 for junction board * 10) / 3000000 = 2080us

Overall result

I put together a spreadsheet that let me tweak each of the individual parameters and see how that affected the overall update rate of the system.

I made a dedicated test program and used the oscilloscope to monitor a cycle and roughly verified these results:


Thus, with a full command and query cycle, an update rate of about 80Hz can be achieved with the current system.

Next up, working to make this much better.


Revisiting machining the sun gear holder

My very first sun gear holder I machined myself was something of an artistic feat.  Each operation was re-run many times, and as a result the part was largely a one-off.  The final part properties were not really indicative of the final program.  My next step in my learning adventure was to up my Pocket NC game and get to a single reproducible program that would emit a part that I could use, then be able to run it over and over again.

The biggest problem I had in making this happen was pull out problems that manifested intermittently, but persistently.  When machining the part in a single operation, the mill needed to be able to reach all the way to, and slightly past the center of rotation of the B axis of the machine.  With the Z travel of the Pocket NC, that means that you need a stickout of almost 27mm or sometimes even more.  Standard length tools can kind of do that, but they don’t result in much of the shank being in the collet.  All my testing with them resulted in occasional pull out at some point during the hour or two of machining.

I tried getting a Kyocera 2 inch endmill 2 flute end mill, which has a minimum stickout of about 34mm.  To run it without mind numbing chatter required dialing the feeds and speeds so far back that the part would take twice as long to complete.  That hardly seemed worth it.

Next, I stepped back and decided to try a 2 operation approach using the Sherline 4 jaw chuck.  This has the advantage that the part is always kept well within the Z travel of the mill, so that standard length end mills can be used, but the disadvantage that you have to manually flip the part over and try to keep things registered.  I don’t have great, or really any, metrology that would allow me to measure the resulting concentricity errors so I was really trying to avoid using this approach, however, I was kind of at a loss at this point so decided to give it a go.

For this configuration, I used a 1.5″ round stock cut to approximately 1″ long.  The first operation used a Datron 3mm end mill to do nearly everything on the back side, with a final chamfer mill pass to break the edges.  The second operation used a Datron 2mm end mill to do everything there, with the chamfer mill once again to do the countersinks and break edges.

The parts that come off here are usable, and about what I expected to achieve without heroics in terms of final accuracy from a Pocket NC.  All the diameters and dimensions are around at most a thousandth off from what I intended.  The walls aren’t as vertical as I would like, but they are serviceable.

Final part
Final part
My various experiments, single op in back, multi-op in the front

Granted, I probably won’t be using these parts for much of anything going forward, but it was a great learning experience in making the Pocket NC do what I wanted.  Next in this adventure is probably machining a planet input, which I forsee continuing to use in future iterations of the gearbox.

Improved lighting for Pocket NC

Now that I’m making a lot of videos of machining with my Pocket NC, it was getting annoying setting up lighting for each one.  Thus I rigged up some LED strips in the interior of the enclosure.  Now I can shoot 60fps video any time of the night without having to set up external lighting!  Here’s to hoping a chip doesn’t short it out.

The strip enters through a small drilled hole
The power switch is taped to the side


It lights up!

Even when the case is closed!

Pocket NC 4 jaw chuck workholding

Workholding on the Pocket NC is still, well, a work in progress for me, and it is for many people.  There aren’t a lot of off the shelf solutions.  The machine does come with a mini-vise, which can hold a surprising amount, but it has some limitations.  For one, it isn’t referenced to the axis of rotation of the B axis.  Another, anything held in it can often be far away from the furthest Z travel available, resulting in the need to use extended reach tooling.

Enter, once again, Ed Kramer @ekramer3 on IG, who came up with a solution consisting of strictly off the shelf parts that results in a 4 jaw chuck being placed about 1.75in off the surface of the B axis plate.  This can hold round stock out to 70mm.

In case Instagram forgets itself, the relevant parts are:

A few pictures of my setup:



And here is a short video of it cutting:

Finally, if it is of use to anyone, this is my F360 file containing the fixture.  Bewarned, it is hard-coded to my V2-50’s B-table offset as calibrated by PocketNC.

First quadruped jump!

To demonstrate the dynamic capability of the full rotation quadruped, I figured I would start by doing some full machine jump tests to a relatively low height, just to show that it was capable.

Thus, I rigged up an open loop script which squatted a small fraction of the available distance, and then powered up at a relatively small fraction of the available maximum speed.  I don’t have the telemetry yet to extrapolate how high this will be able to go at maximum, but I think it should be a fair amount higher.  For now, I want to do some more instrumentation and walking testing (and have more spares) before I manage to break things by jumping really high.

Improved Pocket NC installation

After having used my PocketNC V2-50 for a while just sitting on top of the air compressor, I decided to try and improve its installation a bit.  For one, when the compressor kicked on or off, it would impart a significant vibration to the whole assembly.  Also, I needed a place to hold stock, tools, and intermediate parts.  Here’s a picture of my new setup.


The table isn’t particularly rigid, but at least it is now decoupled from the compressor.  The wire shelving below satisfies my storage requirements for now.

First walking on the full rotation quad

Last time, I had finished physically assembling all the motors for the updated quadruped with legs that can rotate freely 360 degrees.  After the long summer break, I powered up and configured all the servos.  Then, after setting up the gait engine for the new configuration (for which there are still a TODOs when the lateral shoulder offset is non-trivial as in this configuration), I was able to achieve some amount of walking.  Here is one of the first videos I took, without much in the way of tuning or work.  The control is a little wobbly still, but so far there are no signs of any mechanical failures as with the older design.


I am away until mid-August

If this space is looking stale, that’s because it is!  I am traveling until the middle of August.  Look forward to more robot and Pocket NC updates then!

Full rotation quadruped build continued

The next step in (re-)building the quadruped with a full rotation leg was getting all the motors ready.  I had to first install reinforcing rings on 6 of the motors:

My epoxying station
4x gearboxes with reinforcing rings installed

Then, I needed to lengthen the power leads on 3 of the motors to serve as the lower leg joint.

Motors with longer power leads

Then I had to assemble all the new legs:

Upper leg joints mounted
Lower leg joints mounted
All three remaining legs built

I mounted them all to the chassis:

All the legs!

And then re-installed the battery stud and “resting” feet:


Next up, will be actually powering them and getting it to walk!