The combination of GPU optimized edge data processing paired with Formant’s observability & teleoperation platform creates a super efficient command and control center right out of the box.
Until recently, robots were mostly confined to fixed locations along assembly lines. As a result, they enjoyed relatively stable connectivity, with wired ethernet connections or robust local wi-fi networks at their disposal. Today, however, more and more robots are moving beyond the confines of the factory, working within increasingly expansive footprints and varied, complex environments. This migration has introduced countless new possibilities for robotic applications, however, the question of connectivity has also become infinitely more challenging.
In our experience, network connectivity is simultaneously the most-overlooked and most asked about topic among our customers — meaning, it typically isn’t until after their robots are in the field and encountering problems that the question of connectivity rears its ugly head. To help mitigate this phenomenon, we’ve developed and implemented tests designed to determine the boundary conditions for our product, and to make sure we’re handling different connectivity conditions in the best ways possible.
In this two-part article, we’ll be walking you through this process in hopes that it might help you navigate your own connectivity quandaries, and help your business avoid one of the most common disruptions in robot deployment.
The fundamental goal of our testing process is to evaluate product performance under different connectivity conditions. In turn, this enables us to determine the optimal network and data buffering/storage settings one must consider and configure under different conditions.
This is not a question of whether or not a specific network environment is viable in order to operate on the Formant platform. Instead, it is an investigation into the extreme conditions under which the Formant edge software will be run; and what are the configurations necessary to survive these boundary conditions. In the simplest of terms, we consider our edge software to be adequate when it is:
Once baseline adequacy is established, more complex, granular questions must be considered. For each unique robotic application, our team must thoroughly examine how the product performs under a plethora of different connectivity conditions; especially under those conditions which would be considered "degraded". For example: What happens when connectivity is lost? What happens when connectivity is lost for extended periods of time? What happens when “spotty” or weakened connectivity is encountered? How much data ought to be buffered in this specific use-case and network environment?
To answer these questions, we want to clarify to the following underlying uncertainties:
Many of our customers, understandably, also have limited budget for bandwidth. Hence, another side goal of this experiment is to relay all these metrics back to the user so that they are empowered to make informed financial decisions based on said data.
It's important to note that many of these considerations are made necessary due to certain qualities inherent to data streaming. Some of the inherent features that necessitate such testing include:
These unique features, along with others, make it highly important to consider and test all of these variables associated with connectivity. Specific considerations and their position on the hierarchy of significance will, of course, depend upon the unique characteristics of each specific application and network environment.
We measure agent performance while replaying robot data in a loop within a constrained network environment. We use an open source tool, Comcast, to simulate varying levels of packet loss and latency. Our agent is instrumented to collect detailed metrics on its performance. These metrics are described in the following section.
We’ve gathered a great deal of metrics and made them available internally. However, this section will focus on uploader and network metrics, which require more detailed contextualization than something like CPU or memory usage.
Instance Configuration: 8 vCPU, 16GiB Memory
Operating System: Ubuntu 18.04
We used a custom dataset with 3+ image streams and 40+ numeric streams ingested at 5Hz frequency. This is a higher amount of data than what most of our customers would send.
Tests are performed at 3 packet loss levels (0, 1, 3 %) and with 4 levels of latency (20, 100, 180, 260 ms) for each dimension of packet loss.
Test script below for reference:
for packet_loss in 0 1 3
do
for latency in 20 60 100 140 180 220 260
do
sleep 60
echo latency: $latency packet_loss: $packet_loss
echo test start time: date +%s
comcast --device=ens5 --stop
sudo systemctl stop formant-agent
sudo systemctl start formant-agent
comcast --device=ens5 --latency=$latency
--packet-loss=$packet_loss% --target-port=443
sleep 600
echo test end time: date +%s
comcast --device=ens5 --stop
echo =====================================================================
done
done
The KPI (Disk Steady State) indicates if the agent is able to keep up with the data being ingested (Y) or if the on-disk item storage usage is growing (N) for the duration of the test. As a reminder, tests are performed at 3 packet loss levels (0, 1, 3 %) and with 4 levels of latency (20, 100, 180, 260 ms) for each dimension of packet loss.
Latency | Packet Loss | Steady State |
20 | 0 | Y |
100 | 0 | Y |
180 | 0 | N |
260 | 0 | N |
20 | 1 | Y |
100 | 1 | Y |
180 | 1 | N |
260 | 1 | N |
20 | 3 | Y |
100 | 3 | N |
180 | 3 | N |
260 | 3 | N |
Even for network-optimized systems, it is obvious that as the latency and packet loss increase, the usage of on-disk storage will increase. The key question is - What are some of the most significant bottlenecks in the system that can be improved? Can we slow down the on-disk growth by reducing any unnecessary overhead?
To answer these questions, we analyzed a bunch of metrics on our internal dashboards, two of which are noted below.
TCP connection establishment, the number of short-lived connections and close rate needs to be evaluated
An inefficient use of TCP connections can have a negative performance impact at many layers. For example, we end up performing more TLS handshakes, which increases bandwidth and CPU utilization. In healthy network conditions, we should be re-using TCP connections across Cloud Objectstore asset requests. We want to ensure this is happening. As can be seen from the graph below, at 180ms of latency, we are no longer able to maintain a steady upload state.
TX/RX bandwidth utilization ratio should be pretty high
This experiment is mostly uploading data, and hence the RX bandwidth should be relatively low. A higher than expected RX bandwidth utilization could indicate unnecessary traffic that needs to be analyzed for further optimization.
We will go over more observations and improvements in Part-2 of this blog.
This network-constrained testing environment provides us with rich feedback on exactly how our agent performs under varied network conditions, in the form of dependent and independent variables. We're able to measure the latency to Formant APIs, bandwidth usage, and any amount of packet loss. We can observe how the agent locally buffers messages when it has trouble communicating with the outside world, and how it handles redelivery attempts, if and when the network conditions improve. The dependent variables largely revolve around the question of whether or not the agent successfully transmits data to the cloud.
While testing robots under differing network conditions did expose some opportunities for optimization in our edge software, overall, we were extremely happy with the fault tolerance, durability and reliability improvements we were able to make.
The observations and improvements are explained in detail in part 2 of this blog.
The combination of GPU optimized edge data processing paired with Formant’s observability & teleoperation platform creates a super efficient command and control center right out of the box.
By storing one’s observational data on-device and accessing it selectively—rather than passively transferring a constant stream of data to the cloud—businesses can often reduce their cellular data expenditures by upwards of 80%.
Today we’re launching a new program called Walk With Spot to help accelerate the development of robotics applications. We’re doing this by giving people from all walks of life (like you) the opportunity to remotely control a Boston Dynamics Spot (robotic dog) from anywhere in the world using a simple web based application.