Testing for streaming robot data in network-constrained environments, part 1

Until recently, robots were mostly confined to fixed locations along assembly lines. As a result, they enjoyed relatively stable connectivity, with wired ethernet connections or robust local wi-fi networks at their disposal. Today, however, more and more robots are moving beyond the confines of the factory, working within increasingly expansive footprints and varied, complex environments. This migration has introduced countless new possibilities for robotic applications, however, the question of connectivity has also become infinitely more challenging.

In our experience, network connectivity is simultaneously the most-overlooked and most asked about topic among our customers — meaning, it typically isn’t until after their robots are in the field and encountering problems that the question of connectivity rears its ugly head. To help mitigate this phenomenon, we’ve developed and implemented tests designed to determine the boundary conditions for our product, and to make sure we’re handling different connectivity conditions in the best ways possible.

In this two-part article, we’ll be walking you through this process in hopes that it might help you navigate your own connectivity quandaries, and help your business avoid one of the most common disruptions in robot deployment.

Test goals and determinants 

The fundamental goal of our testing process is to evaluate product performance under different connectivity conditions. In turn, this enables us to determine the optimal network and data buffering/storage settings one must consider and configure under different conditions. 

This is not a question of whether or not a specific network environment is viable in order to operate on the Formant platform. Instead, it is an investigation into the extreme conditions under which the Formant edge software will be run; and what are the configurations necessary to survive these boundary conditions. In the simplest of terms, we consider our edge software to be adequate when it is:

  • Fault tolerant - Under extreme network conditions, the Formant agent does not crash (e.g., when bandwidth goes up, down, or even to zero)
  • Durable - Does not cause data loss (e.g., data is stored automatically in case of network degradation)
  • Reliable - Continues to transmit data even when optimal conditions aren’t met 
  • Fair usage of edge resources - To whatever extent possible, data streaming should not interfere with the normal functioning of the robot by going beyond its expected resource consumption.

Once baseline adequacy is established, more complex, granular questions must be considered. For each unique robotic application, our team must thoroughly examine how the product performs under a plethora of different connectivity conditions; especially under those conditions which would be considered "degraded". For example: What happens when connectivity is lost? What happens when connectivity is lost for extended periods of time? What happens when “spotty” or weakened connectivity is encountered? How much data ought to be buffered in this specific use-case and network environment? 

To answer these questions, we want to clarify to the following underlying uncertainties:

  • Given a dataset, what are the minimum network conditions required to upload in a steady state without data loss?
  • What relationships exist between packet loss, latency, and upload performance?

Many of our customers, understandably, also have limited budget for bandwidth. Hence, another side goal of this experiment is to relay all these metrics back to the user so that they are empowered to make informed financial decisions based on said data.

It's important to note that many of these considerations are made necessary due to certain qualities inherent to data streaming. Some of the inherent features that necessitate such testing include: 

  • The automatic streaming of data on a continuous or triggered basis
  • In the case of an unstable network, Formant allows device to store, or buffer, data until network conditions improve
Architecture Robot Data Platform

These unique features, along with others, make it highly important to consider and test all of these variables associated with connectivity. Specific considerations and their position on the hierarchy of significance will, of course, depend upon the unique characteristics of each specific application and network environment.

Test framework & metrics of interest

We measure agent performance while replaying robot data in a loop within a constrained network environment. We use an open source tool, Comcast, to simulate varying levels of packet loss and latency. Our agent is instrumented to collect detailed metrics on its performance. These metrics are described in the following section.

We’ve gathered a great deal of metrics and made them available internally. However, this section will focus on uploader and network metrics, which require more detailed contextualization than something like CPU or memory usage.

  • Network Metrics
    • Transmit Bandwidth: The amount of bandwidth consumed on the "default route interface" in the egress direction. Care should be taken when using this metric to understand agent-specific bandwidth usage as it measures traffic from ALL processes on the interface. The EC2 instance we use in our experiments is not running any additional software that would consume a measurable amount of bandwidth; therefore this metric is a fairly accurate measurement of agent-consumed bandwidth.
    • Receive Bandwidth: Same as Transmit Bandwidth but in the ingress direction.
    • Connection Establishment Rate: The rate at which TCP connections are successfully established to upstream services
    • Connection Failed Rate: The rate at which TCP connections to upstream services fail to complete a 3-way handshake.
    • Connection Attempt Rate: The rate at which TCP connections to upstream services are attempted.
    • Connection Close Rate: The rate at which TCP connections to upstream services are closed.
  • Uploader Metrics
    • Uploader Buffer Fill Percent: Heatmap of the uploader's in-memory buffer usage on a percentage basis. If this buffer becomes full, the agent persists data to the item store on disk. This measurement shows us how close we are to transitioning into an unsteady upload state.
    • Uploader Item Store Usage: Percentage of on-disk buffer consumption. The on-disk buffer is used when uploads fail, or the in-memory buffer is full. If the Item Store usage is 0% we are in a steady upload state.

Test environment

Instance Configuration: 8 vCPU, 16GiB Memory
Operating System: Ubuntu 18.04

Test data

We used a custom dataset with 3+ image streams and 40+ numeric streams ingested at 5Hz frequency. This is a higher amount of data than what most of our customers would send.

Test code

Tests are performed at 3 packet loss levels (0, 1, 3 %) and with 4 levels of latency (20, 100, 180, 260 ms) for each dimension of packet loss.

Test script below for reference:

for packet_loss in 0 1 3
 do
   for latency in 20 60 100 140 180 220 260
   do
     sleep 60
     echo latency: $latency packet_loss: $packet_loss
     echo test start time: date +%s
     comcast --device=ens5 --stop
     sudo systemctl stop formant-agent
     sudo systemctl start formant-agent
     comcast --device=ens5 --latency=$latency 
 --packet-loss=$packet_loss% --target-port=443
     sleep 600
     echo test end time: date +%s
     comcast --device=ens5 --stop
     echo =====================================================================
   done
 done

Test results

The KPI (Disk Steady State) indicates if the agent is able to keep up with the data being ingested (Y) or if the on-disk item storage usage is growing (N) for the duration of the test. As a reminder, tests are performed at 3 packet loss levels (0, 1, 3 %) and with 4 levels of latency (20, 100, 180, 260 ms) for each dimension of packet loss. 

LatencyPacket LossSteady State
200Y
1000Y
1800N
2600N
201Y
1001Y
1801N
2601N
203Y
1003N
1803N
2603N

Even for network-optimized systems, it is obvious that as the latency and packet loss increase, the usage of on-disk storage will increase. The key question is - What are some of the most significant bottlenecks in the system that can be improved? Can we slow down the on-disk growth by reducing any unnecessary overhead?

To answer these questions, we analyzed a bunch of metrics on our internal dashboards, two of which are noted below. 

TCP connection establishment, the number of short-lived connections and close rate needs to be evaluated

An inefficient use of TCP connections can have a negative performance impact at many layers. For example, we end up performing more TLS handshakes, which increases bandwidth and CPU utilization. In healthy network conditions, we should be re-using TCP connections across Cloud Objectstore asset requests. We want to ensure this is happening. As can be seen from the graph below, at 180ms of latency, we are no longer able to maintain a steady upload state.

Uploader Asset Buffer Fill Percent (Average)

TX/RX bandwidth utilization ratio should be pretty high

This experiment is mostly uploading data, and hence the RX bandwidth should be relatively low. A higher than expected RX bandwidth utilization could indicate unnecessary traffic that needs to be analyzed for further optimization.

Receive Bandwidth

We will go over more observations and improvements in Part-2 of this blog.

Conclusion

This network-constrained testing environment provides us with rich feedback on exactly how our agent performs under varied network conditions, in the form of dependent and independent variables. We're able to measure the latency to Formant APIs, bandwidth usage, and any amount of packet loss. We can observe how the agent locally buffers messages when it has trouble communicating with the outside world, and how it handles redelivery attempts, if and when the network conditions improve. The dependent variables largely revolve around the question of whether or not the agent successfully transmits data to the cloud.

While testing robots under differing network conditions did expose some opportunities for optimization in our edge software, overall, we were extremely happy with the fault tolerance, durability and reliability improvements we were able to make.

The observations and improvements are explained in detail in part 2 of this blog.

How the autonomous farm robot, Burro, went to market and scaled in a single step.

Explore our robot data and ops platform

Superior observability, operations, and analytics of heterogeneous fleets, at scale