Observability, in the context of distributed systems, is what allows us to measure, introspect, and understand the behavior of a system at runtime.
Observability can be used by engineers to diagnose failures and iterate on the system’s design, or by the system itself, to support, for example, automated recovery.
The world of web services has benefitted from decades of investment in observability tooling. Hundreds of open-source projects and SaaS products provide monitoring, logging, and distributed tracing to help us understand the behavior of increasingly complex web applications.
But in the robotics domain, the breadth and depth of our observability tooling lags far behind. For robotics applications to achieve the same high availability we’ve come to expect from web services, we need better observability in robotics.
Observability in an unbounded system
One challenge for observability in robotics is that the physical world is a continuous and high-dimensional domain. And failure modes often span the hardware / software boundary.
Consider some of the scenarios we’ve heard about from customers:
- Electromechanical failure: A servo motor overheats due to a software bug in an embedded controller.
- Localization failure: A mobile robot fails to localize due to changes in an office floor plan.
- Computer vision failure: An object classifier fails due to unforeseen lighting conditions.
- Sensor calibration failure: A pose estimator fails due to changes in camera extrinsics after a collision.
- Unmodeled human interaction: A sidewalk delivery robot is blocked by a group of protestors.
- Unmodeled environmental interaction: An inventory scanning robot is unable to navigate past a spill in the juice aisle.
In all of these scenarios, observability tools designed for web applications fall short.
Text logs alone are not enough, nor are scalar metrics. Debugging requires context about the physical environment, the deployed hardware, and multiple distributed software components. Root causing an issue often requires high-resolution data rather than summaries and histograms. To compound the problem, access to network resources to centralize this data may be scarce.
Failure modes in robotics are difficult to predict, difficult to observe, and as a result expensive and time consuming to recover from. They require a new class of observability tools.
A different mode of observability
Debugging issues in a robotics application requires a different type of observability. Text logs and metrics are a good start. But context also comes from data types with a visual or geometric interpretation: maps, point clouds, geolocation, poses, video, odometry.
Debugging in robotics is often highly visual.
And while sensor data tells part of the story, equally important is the associated metadata. An observability solution for robotics must support multiple dimensions of high-cardinality metadata (software version, hardware version, experiment ID, customer ID, site ID, etc) and first-class support for one of the most important dimensions of all: time.
In addition to these requirements, we can also add:
- Integrations with common robotics protocols and platforms
- Robustness in the presence of unreliable network connectivity
- Visualization and proper semantic interpretation of robotics data types
- Ingestion and playback of historical data
- Anomaly detection
- API extensibility
- Compatibility with the broader observability ecosystem
A brighter, more transparent future
Observability in robotics is nascent, but developing quickly as the industry expands and grows into new applications. If you are having trouble debugging your robots, giving context to your sensory data, or retracing the steps of your failure modes, we’d love to have a conversation.