In part-1 of this blog, we explained the goals, determinants and a summary of the results from the network constrained tests we ran on streaming robot data. In this blog, we will walk through the detailed observations and improvements we made based on the outcome of the tests.
TCP Connection Churn
We instrument all of our HTTP clients using go-conntrack. This library provides us with TCP dialer metrics (connections attempted, failed, established, closed). The graphs below showcasing these metrics are from the 0% packet loss 20ms latency experiment.
TCP connections are being established and closed at a rate of ~40/sec. This is unexpected as we should be re-using TCP connections across cloud object storage requests, especially in this healthy network experiment. An inefficient use of TCP connections can have a negative performance impact at many layers. For example, we end up performing more TLS handshakes which increases bandwidth and CPU utilization. The TLS and TCP handshakes lead to increased upload latency, causing our in-memory buffer to fill at a faster rate. If this buffer overflows we end up writing data to disk.
Long Fat Network (LFN) Performance
By increasing the network latency in our experiments we are increasing the bandwidth-delay product. As the amount of unacknowledged data at any given time increases we rely on the TCP receive and congestion windows to be sufficient in size. If these windows are too small, we spend too much time waiting for acknowledgements, unable to utilize all of the available bandwidth.
When a TCP connection is established the TCP congestion window starts out small and grows over time (assuming a stable network). Combining this knowledge with what was observed in the previous section (Reduced TCP Connection Churn), we can assume that lots of short-lived connections will have a negative impact on our overall TCP performance due to the behavior of the congestion window and the time it takes for a connection to be established.
Starting at 180ms of latency we are no longer able to maintain a steady upload state as observed via the Uploader Item Store Usage and Uploader Cloud Object Storage Asset Buffer Fill Percent graphs. These graphs are copied below and are from the 0% packet loss 180ms experiment.
TX/RX Bandwidth Utilization Ratio
The ratio between transmit and receive bandwidth is roughly 3:1. This is unexpected as we are only uploading data. While some return traffic is expected (acks, handshakes, etc) the majority of data should be flowing in the transmit direction. The graphs below are from the 0% packet loss 20ms experiment.
Setting higher value for HTTP Transport MaxIdleConnsPerHost
// MaxIdleConnsPerHost, if non-zero, controls the maximum idle
// (keep-alive) connections to keep per-host. If zero,
// DefaultMaxIdleConnsPerHost is used.
const DefaultMaxIdleConnsPerHost = 2
Go’s HTTP Transport handles connection pooling has a default value of 2.
Go’s HTTP transport has a knob for the maximum number of idle connections it will keep open to a given host (ip address). The default value for this knob is 2. Our uploader is highly concurrent -- if you have 4 cores on your system you end up with 8 goroutines (green threads) uploading data to cloud object storage. When a thread has finished uploading an object, the underlying connection can reach an idle state. If the number of idle connections exceeds 2 the transport will close them. This can lead to lots of connection churn, latency, and poor TCP performance. Increasing this value above the number of threads allows us to maintain connections to cloud object storage for a longer period of time, greatly improving upload performance.
Reduced TCP Connection Churn
Reduced CPU utilization
Reducing connection churn greatly reduced TLS handshake rate, improving CPU utilization
Reduced RX Bandwidth Utilization
TLS handshakes have large payloads.
Improved resilience to latency & packet loss
Having a test environment to simulate different network conditions has been an essential tool for us to measure and improve the reliability of our robotic software solutions. What we at Formant can’t stress enough, however, is that this process should take place long before you’ve actually deployed machines in the field to serve customers. Just like any other form of quality assurance, network constraint testing serves the essential role of identifying problems (and either resolving or preparing for them), before customers do.
Like most tests, this is not a one-and-done testing, we plan to scale this testing to include real physical devices with cellular connectivity and run them through real-world network conditions to see if the virtual tests we ran skipped any real-world edge cases. We will then have a mix of virtual and physical devices all running on a simulated network environment over long periods of time, thus adding to the core of our automated testing framework.