Beyond autonomy: ROS in solution architecture

Introduction

All right, good morning everyone. My name is Ian Sherman, I am the head of software at a company called Formant. If you haven’t heard of us, please take a minute to come say hi. There’s a few of us here. But anyways, I’m really glad to be here. Because it was a hell of a trip across the pacific, and I wanted to start by asking why did so many of us come from so far, and why are we here? And I think for some of us, it’s to learn ROS. Maybe we’re new to it. For some of us, it’s to get inspired. I feel that way. Maybe it’s to stay in the loop with things going on, I’ve learned about so many great projects and companies over the past few days. Or to meet new people. But most universally, I think it’s because we care about ROS, and for some of us, that care springs from a professional self interest, like a company might depend on it. But for many in the room, there’s a personal dimension too I know I feel that way, but there’s folks here who relative to me, have poured way more years of their life into ROS, which is something really extraordinary, and I think it’s kind of tangible. I think this is really invested community. So as a group of people that care about ROS, we have this opportunity once a year to get together, and in addition to celebrating, we get to kinda ask tough questions, advocate for the things we think will make ROS succeed, which is how I’d like to use my time today. It’s gonna be maybe a little bit unorthodox, but hopefully interesting. And my hope is to simply share some thoughts as just a humble ROS practitioner about some facets of ROS that I increasingly think are key to its long term success, and I still think are generally under-discussed, although there has been some evidence to the contrary this week.

So, my presentation is in three sections. First, a personal story, so you know kinda where I’m coming from. That’ll be illustrated by a favorite book of mine. Then, some thoughts that’ll be mostly text, and sort of thinking, and a couple of experiments that will be hands-on, nerdy.

Personal history

So, first some personal history. I was introduced to ROS in 2011 as a software engineer at a company called Bot and Dolly. We used industrial robots for motion control and visual effects in film, in advertising, and in entertainment. And for me, ROS was an incredibly exciting tool to discover. It opened up the physical world in a way that Arduinos and microcontrollers never did. As a medium that could be controlled by software. And when I discovered ROS, I felt like I had walked into something really grand, like a cathedral. I felt like anything was possible.

Bot and Dolly

So at Bot and Dolly we ran Orocos, and KDL, and MoveIt, and camera calibration packages. We built bridges between ROS and all the software tools that we loved, many of them outside of the robotics domain. As an example of how these things came together, we had a visual effects supervisor ask if we could shoot repeatable, hand-held camera moves, and within a week or so, we had found ROS drivers for a face based motion capture system, and adapted a hand-eye calibration routine from the PR2 to have an expert camera person teach repeatable camera motion to a robot, which visual effects supervisors can do all sorts of amazing things with, once you have that capability. So it was quite messy, in more ways than one. This was a table that may look familiar to some of you in the early stages of your efforts. At the time we were running ROS Diamondback on Ubuntu 10.04 with an RT-Preempt patch kernel. When he had to commission a new control machine, we would clone the hard drive, and I’m ashamed to admit write the version number with a sharpie and use that as the base for our next machine. This was our CI process, that’s me at a computer typing “SSH get pull rossmake.” But none of those things were ROS problems. ROS was letting me do amazing things that I didn’t think were possible. Even if there were some gaps on the floorboards.

My first ROSCon (in 2012)

So I attended the first ROSCon in 2012, met this guy, in that amazing ROS dress shirt, which I think should be resurrected. And I loved it so much that I bargained with my boss to send me to ROSCon again in 2013. I’m really glad I did, because the organizers really outdid themselves with the swag that year. This is my favorite piece of conference swag I’ve ever, ever gotten. And then we got some big news.

Acquisition by Google

Late in 2013 our company was acquired by, duh-duh-duh, Google, as part of a robotics initiative that most of you are probably familiar with. So, inside Google, or Alphabet technically, we were told to think really big. We had the resources of the richest company in the world at our disposal. And we had some big egos there too, but I was quickly disillusioned of my opinion that ROS was something great by my colleagues, and I was told repeatedly that it wasn’t the cathedral that I thought it was, it was more like a barn. A tool built by researchers, for researchers, after all, what was it? Just IPC and a collection of sort of half maintained algorithms. Never mind that years into this initiative new engineers would join and ask “Why aren’t we using ROS?” and “Why can’t I do this extremely basic thing that I’ve been able to do in ROS forever?” But, it was exciting. That’s kind of what it felt like. It was the kind of place where no technical challenge was too big. There was no shortage of talent, and product would you know, kind of figure itself out eventually. But during my time at Google, ROS receded into the distance, and I sort of lost sight of that cathedral, and saw it as a humble structure outside of the castle gates. So as a side note, it’s only with some distance that I think about what a missed opportunity it was to sort of co-invest and accelerate ROS development.

Experience with ROS after leaving Google

But it did recede in the distance from me for a few years, but when I emerged from the lovely gated community in 2017, I was reminded of a few things. So one, I was thrilled to get to play in the public, open source ecosystem again. I felt all powerful. There were a lot of technologies, especially in the cloud world, where the speed of development in the time that I was at Google felt quite honestly, you know, quite startling. And I had forgotten what a delight it is to be untethered by the requirements of a hundred million line code base, or whatever it is now inside Google. And I found that in robotics, ROS was still at the center of the robotics software ecosystem, just as I remembered. I had gotten a taste of some things at Google that ROS didn’t have, so things that some of which have come up with week, some of which haven’t, but systems to manage software packaging deployment, virtualization, configuration management, horizontally scalable build test and simulation infrastructure, tools for batch and stream processing, huge amounts of sensor data. Systems for keeping track of fleets of hardware, patterns and practices for API evolution and versioning, et cetera, et cetera, et cetera. All of which, it turns out, are very useful when you’re building robotics applications. And simultaneously as I was looking for my next project, I was talking to fellow engineers and founders in robotics, many of whom are in the room, and I noticed that things felt a little different than when I had gone into the castle. So many folks are now bringing their second or third robotics product to markets, and they had some huge evictions. They weren’t so interested in founding a robotics company because they new how to build robots. Many were asking instead “What are people willing to pay for?” whether it’s done by a robot or not. And I notice less interest in building custom hardware, and more interest in picking problems where robot capabilities existed, or nearly existed, but hadn’t yet been applied to a domain. I noticed more focus on reliability, scaling, SLAs, robot operations, and all those layers wrapping autonomous capability that are required to run a robot business.

Formant begins

And that’s why what I chose to do after I left Google was to join and help start a company called Formant that’s focused on giving robotics companies the infrastructure they need to run, scale, and operate real world fleets. It turns out, those tools are more common than you might think, and harder to build than you might think. And we find that most people we talk to have their hands full with 15 other disciplines, required to build a successful robotics product. Through that work, I started hiring and working with people who had been deeply embedded in that fast moving cloud technology world over the last five to ten years. Some of them are sitting down here. And what just really impressed me with the tools, and the thinking that they brought to bear on these problems that I was hearing roboticists start to articulate and complain about, there were really just so many tools that could be applied to this field that the roboticists were not thinking about. And I would say this is the point at which I started thinking about ROS not as kind of a cathedral, or a barn, but sort of as an amazing tool that’s really good at developing autonomy, especially single agent autonomy. But also as a component that in any commercially viable application, must be embedded in a larger solution architecture, which is what this talk is about.

Why it matters to consider ROS as part of a larger solution architecture

Okay, so part two is where I get to share some thoughts on kinda what that means to me, why it matters to consider ROS as part of a larger solution architecture, and I will share some opinions, but most of them are probably wrong. I want to kind of hear where I’m wrong, and just encourage us to think about it. More words, less pictures. Okay, so ROS is undeniably great for building single agent autonomy, and ROS 2 has ambitions far beyond that. But if we think about this problem for a second, this is the domain of sensor drivers, sensor integration, perception pipelines, control loops, in the say one to 100 hertz range, software components that are trafficking in robotics primitives. But is this enough for a robot application? Well, depending on the application there might be one or more layers of coordination or federation across multiple agents, and we’re starting to see some talks about how ROS 2 might enable that. I think it’s sort of unknown territory for us, it looks like it should be possible on paper, but really curious to see how that plays out. And practice we hear from companies like Fetch, where coordination is happening outside of the ROS layer. Then he had this category of concerns that we can call business logic, or application logic.

So if we think back to Morgan’s example from the hospital yesterday, I’m not even thinking about collision avoidance and shared navigation, shared maps. I’m thinking about the layer of a robot product that tells me which patient in the hospital just ordered the coffee, whether they’re even allowed to order the coffee, and which credit card’s gonna get charged for that coffee, and somehow interact with our autonomy to achieve that business goal. So, whether or not this bleeds into the autonomy stack I think depends on the application. I was thinking about, well, user interface. Is this business logic, should it be a capability of the ROS ecosystem, user interface? At first I think, no. It’s a great candidate for sort of, clean separation from the autonomy stack, and there’s much more mature technologies and ecosystems out there to help us with that. But what if the user interface includes an e-stock, right? So, it quickly can get very fuzzy depending on the application. And then we came out at all this other stuff too. And this is, I think you’re seeing more participants in ROScon this year who are focused on supporting companies in solving all of these problems.

What happens when we start thinking about ROS as a subsystem

So my question is, which of these concerns do you at your companies plan to implement in ROS? And if the answer is not all of them, which I suspect it is, then your ROS system is gonna be participating in some sort of solution architecture. So here’s a quote I saw recently on the internet. (“Always consider your design a subsystem.” – Jade Bloom) So what happens if we start thinking and talking about ROS as a subsystem? So here’s some questions that come to mind for me when I start thinking about ROS as a subsystem. How might ROS be observed by external systems? How might it be driven, or configured, in code by external systems? How might ROS artifacts be consumed by external systems? How might ROS describe its interface to them? How is a ROS application packaged? And don’t get me wrong, it’s not that I don’t see it, us grappling with this at all, I think for example back to Andre from VXWorks’s presentation yesterday where it’s pretty clear that the work he’s doing is possible because of the design decisions in ROS 2, to change the way they were packaging ROS by limiting external dependencies.

So it’s not that people aren’t thinking about these, I just wanna think about them more. I wanna show you a couple potential opportunities. So another question to ask is, okay who at your company is building the rest of that necessary solution architecture? Are they people that know ROS? What kind of background are they coming from? What should they need to know about ROS to do their job well? Who is this person on your team when you think of them? And how do they react to the current interfaces that ROS exposes? Do they seem rational, do they feel productive? When they are on your engineering team? Working with you as a ROS expert.

Here’s another one. What patterns exist for robotics application solutions architecture? So I’m not talking about ROS best practices, but at a higher level, what do we do when ROS is just part of the solution? Where would you go to learn these today? I certainly wish there was a resource, or a bigger catalog, more conversation about how we’re solving all of these problems that surround the local autonomy problem. I’m quite excited about the robot operations working group, which Julian announced in his talk yesterday. I think that is one piece of the puzzle, but not all of it.

The intended responsibilities of ROS in a solution architecture

Here’s another question. I’ve been sort of dancing around it. In a solution architecture, what do we want the responsibilities of ROS to be? Is ROS a robot operating system, where we’re talking about kind of robot in the singular, or is it a robot application operating system? We’re seeing people tackle applications we don’t even think of robotics applications with ROS. Is this wise? I have my own hunch about this, and again I’m hoping to talk to people about this, which is that I think that the ROS layer in an application architecture should end as close to the sensors and actuators as the application allows. And I think the reason for that is because we’ll move faster, and cheaper, and more reliably by leaning on the developer communities outside of robotics that are bigger and better resourced to solve any problems that we don’t need ROS to solve. So if it requires sensor fusion and zero copy image transport, by all means let’s do it in ROS. But if it’s anywhere above that in the stack, let’s do it with tools that have a trillion dollars of market capitalization behind them instead.

So, you know, we love our tools. It’s the thing that made me miss ROS. And the community, when I was not using it for a few years. But let’s not push them into areas they don’t belong just because we know the tooling, or because the tooling’s incompatible with other technology ecosystems, and so it’s easier to just kinda push it further and further out. We have so much opportunity to leverage the thinking of our better resource neighbors, in backend distribute systems, IOT, and other related domains. Especially if we can start thinking about ROS as just one component in a more holistic system.

Prototypes to demonstrate how ROS might be observed by external systems

Okay, so this is part three. I thought I would share a couple examples of little prototypes I built to try to illustrate what I’m getting at here. And I’m focused in these two prototypes really on one of the questions, which is how might ROS be observed by external systems? So if we’re thinking of it as a component. So the toy example I’m gonna use is just a toy warehousing and logistics application. Let’s assume we have four mobile manipulators in a warehouse picking items from an order manifest and transporting them to some sort of packaging station where they will be put into a box to get shipped out. And so our application depends on these robots. I think the application is…

The business owner doesn’t think of the robot as the application, the business owner thinks of eCommerce as the application. And so that’s more than just the robots, it includes cloud services that accept an order, that identify the right warehouse to ship from. A warehouse management system that is queuing and prioritizing orders for fulfillment, perhaps, and interfacing with the robots. So let’s assume we have ROS running on robot, and we have HTTP running above the robot. And we want to ask the question why is it taking so long to fulfill an order? It may be notified by some monitoring if we’re a pretty mature organization, or maybe someone’s just shouting at us in the warehouse. And the technical question is how do we reason about requests that are flowing across the ROS boundary and the non-ROS boundary in this application where ROS is just one component? So one technology that has evolved to meet this need in the world of backend distributed systems is distributed tracing. So, if you’re not familiar, the idea is that we can trace the lifetime of a request in a heterogeneous distributed system by propagating context along with the request every time it crosses a boundary. And the boundary could be a process boundary, a network boundary, could be a thread or a function call, or a nodelet, a green thread, whatever you want it to be. And that context is propagated effectively by passing along the idea of the caller. And then step two is that we admit a span when a unit of work is done during the request, which means in practice that we fire off a UDP packet that records a start and end time for that unit of work, and could possibly include additional annotations about what happened during that unit of work.

And then step three is that we assemble these spans into a trace tree, or a DAG, and then we can query the tree to understand the timing of the request, so how long the request spent in each part of the subsystem, but also the structure of where the request was forwarded, and what services it spent time in. And again, services here could mean ROS nodes, that could mean functions. And this works across both sort of RPC paradigms, and pub sub paradigms. And in practice there’s a lot of great tooling out there that does this for you.

Demo: distributive tracing

So in this demo, I patched ROS Pi to propagate distributive tracing headers using an emerging convention around this called open tracing. So I’m gonna start my toy warehouse application, and I’m gonna visit Jaeger, which is an open source trace collector and visualizer, and we’ll see kind of what we can see here. Okay, so we’ll look at some historical traces that I ingested this morning, or were automatically ingested by the warehouse demo. By virtue of the instrumentation. And so for those of you coming from backend microservices world, this will probably look very familiar, but for those of you that aren’t, I’ll try to orient you. Basically, we can filter down on the traces that we have based on what service we care about, and we can look back for traces that include that service. And if I click into a trace, this one came in two hours ago. I get this really nice timing break down of how long this request to fulfill an order spent and every part of the stack, and that includes part of the stack that are outside ROS, so all the databases you’ve heard of come with tracing instrumentation, all of the programming languages you’ve heard of come with tracing instrumentation, they’ll let you annotate a function, as a tracing boundary, similarly for other RPC frameworks and things like that. But then now we also get visibility into what’s happening in ROS.

So, we can see here, I’m just gonna walk over here. We can see here that we called a service “fetch item” on this robot one node. It took eight seconds to move to the goal, and then 26 seconds to pick the item. And of that 26 seconds, it spent you know, four seconds in motion planning, eight seconds in trajectory execution. A few milliseconds closing its gripper. And so we can search a reason about okay, well where would it actually be, where would I get the biggest wins by doing some work in parallel? And kind of, where is this request flowing? If I click into something, I can actually see an example of some additional information that might’ve been attached to a span, so here’s an example of the order that actually came in to the warehouse manager, and we’ll see that there’s two item IDs attached to this order. And in the case that there’s any errors, like this one down here, we see that spans are associated with this error message, in this case perception couldn’t locate the item. But what excites me about this is that we’ve traced this request one in a way that I haven’t seen in the ROS ecosystem to date, not to say that anyone hasn’t done it. But more importantly, in a way that allows us to trace requests across that non-ROS to ROS boundary. So this is the world of distributed tracing.

Are the components of my application healthy and online?

Okay, so the other experiment that I want to look at is a pretty simple question. It’s just “are the components of my application healthy and online?” And in the ROS ecosystem, the tools that I’m aware of to answer that question are tools like ROS diagnostics, and Robot Monitor. And these definitely serve an important role and what I’m gonna show you is not intended to replace these, but rather to compliment them. But there’s, you know, a lot of systems in the world that need to do this job of sort of pulling together hierarchical health information from very heterogeneous distributed systems. And a pattern that has emerged as a powerful one in the microservices world is to expose some health of a component at an HTTP endpoint in a standard format, and then scrape it. So we’re gonna look at an example here again of a very small patch to ROS Pi that just does some instrumentation of the middleware. I think this is pretty much the entirety of the code in the patch. If we provide our ROS node a metrics port environment variable, then it’ll start a lightweight HTTP server on that port where it will report health metrics. And then we can use, again, Python libraries to collect metrics, and attach dimensions to them like which topic maybe we’re writing messages on, or what the type of the message is. And then we increment or change these metrics, sort of while we’re passing data around the middleware.

Okay so we’re gonna look at what happens when you expose metrics in this way, that is sort of why they adopted it in the cloud world. And we scrape them again with great open source tooling. We have a super mature and powerful set of tools to start visualizing this data. So this is a tool called Grafana that works extremely well out of the box with Prometheus. And we start to see metrics about some things that may or may not be of interest to us in the ROS middleware, but things like kind of how many service calls are flowing into and out of each node, how many messages might be flowing in, how many bites were sent by topic or by node. How many service executions there were, and the average execution time of that. It’s hard to sort of explain without letting you get hands on with it, but the user experience here is really unrivaled. It’s an extremely powerful tool, and this is a result of me as a pretty inexperienced Grafana user just writing the queries that I could figure out to write, but if you look at what a real dashboard looks like, we’ll look at the metrics that were coming from our host computer at the same time as ROS was running. And again, these are libraries that have been very heavily optimized to run on cloud VMs and where it’s in the interest of the developers to save save every CPU cycle they can. So the overhead is less than you might think, but I basically get a ton of instrumentation out of the box, more than I would ever need, about network traffic, broken down by physical interface, by port, by process. CPU and memory information, disk space information, there’s a lot here. And I wanted to show one more thing in the ROS dashboard, which is I have sort of this ability to expose my ROS node as dimensions, which means that I can look at data just from those dimensions.

Conclusion

So you kinda get your hands a little bit dirty with this stuff, and you don’t want to give it up. And I think that if we start to consider as a community how we might play nice with this ecosystem, we might save ourselves a lot of work, and we might be better set up to deliver robotics applications in sort of the holistic sense of the word. Okay so I think if you take anything away from this talk, besides the pretty pictures, my hope is that we can talk more about the solutions architectures required to address all the problems of a real world robotics product, and I strongly think it extends beyond the boundaries of ROS. I think that we need to evolve ROS to behave as a component, rather than a monolith, and I think that we should borrow rather than build the tools for building highly available, scalable applications on top of autonomous capability.

And just to stretch my metaphor to its breaking point, and end with one more nice picture, I’m not thinking as ROS as a cathedral, I think ideally it’s more of a business in a thriving commercial district, and we’re selling one high quality part that can be combined with others. A couple acknowledgements to my teammates who have been great thought partners in this domain, and also to the author of this beautiful book that I grew up with. And just to stretch my metaphor to its breaking point, and end with one more nice picture, I’m not thinking as ROS as a cathedral, I think ideally it’s more of a business in a thriving commercial district, and we’re selling one high quality part that can be combined with others.

Explore our data ops platform

Superior observability, operations, and analytics of heterogenous fleets, at scale.

Explore our data ops platform

Superior observability, operations, and analytics of heterogenous fleets, at scale.

Product

Solutions

Corporate
Team