Software at Scale 15 - Ben Sigelman: CEO, Lightstep
Ben Sigelman is the CEO and Co-Founder of Lightstep, a DevOps observability platform. He was the co-creator of Dapper - Google’s distributed tracing system and Monarch - an in-memory time-series database for metrics. Finally, he’s also the co-creator of the OpenTelemetry and OpenTracing standards.
We spent this episode discussing Dapper and Monarch - their design, rollout, and lessons learned in practice.
[Intro] [00:00]: Welcome to Software At Scale, a podcast where we discuss the technical stories behind large software applications. I'm your host Utsav Shah and thank you for listening.
Utsav Shah: Hey Ben, welcome to another episode of the Software At Scale Podcast. Could you tell our guests just about your story, because there's so much in your background that is interesting to me, so right from, starting off at Google, they're like creating LightStep.
Sure. Thanks for having me, I'm excited to be here. I don't know whether my background is interesting or not, but to me it's kind of boring, but yeah. I graduated from college right in the thick of the.com bust and the sort of 2003 era, and I was very fortunate to get an offer to work at Google at the time. And when I went over there, they actually put me on some stuff in the ad system that was incredibly boring, to be honest with you. And also of course, ads make a lot of money at Google, but it was this particular part of the ad system that wasn't making any money, so it was kind of boring, not very lucrative for Google and I didn't like it very much. And the way I got into dapper and distributed tracing was actually incredibly arbitrary, but it's a funny story. They had this one time event where you could opt in to this, I don't know what they called it, but it was this program where they would take everyone who opted in. They look at a bunch of different dimensions, like how long you've been at a school, what office you worked in, what languages you worked in, where you were in the org chart; that kind of stuff, and they think they have 10 dimensions. And then they found the person who also opted into this program, who is literally the furthest from you in this 10 dimensional space and then set up a half an hour meeting with no agenda, and that was it. So I was working on this stuff and ads that, as I was to say, totally pointless. And they paired me up with this woman named Sharon Pearl, who was a very distinguished researcher who had come over from Digital Equipment’s Research Lab when it kind of fizzled out after the merger in the late nineties. And she, and some of the other old guards at Google were doing all the really cool system stuff. And she asked me what I was doing. I don't want to talk about it, what are you doing? And then she went through this list of really interesting systems projects. One of them was kind of like a predecessor to an S3; it was like a blob storage . There was some NLP thing she was working on and then in this list was this prototype of a distributed tracing system called dapper that never really saw the light of day, it was just kind of an idea and she described it to me. I just thought it sounded incredibly useful and really fun , and my manager at the time had 150 direct reports, direct reports. I don't think that is more of a hundred, but he had no idea what I was doing, obviously. How could you, and so I just started working on it, basically switch
Utsav Shah: Teams or anything.
Well, Google famously had this 20% program, so it was kind of that type of thing, but I really liked it and I thought it was quite valuable actually, and so I moved to New York for personal reasons and I just started working on dapper , full-time my manager Yorick also had like a hundred direct reports.
So he had also had no idea what I was doing and I got it to the point where it was in production and it was actually solving problems pretty quickly, just because it was IT. Well, I can get into that if you want, why it was possible to do that ,and I got hooked on that stuff and I really haven't looked back. That was early 2005 and now sixteen years later, I'm still basically working in that same overall space of how do you observe complex distributed systems and what you, what kind of improvements can you make to the software engineering process? If you are able to observe them effectively after working on dapper for awhile, I just wanted to do something different. So I went over, did a couple of systems projects that really didn't work that are not well known because they were failures. I'm happy to talk about those too, if you want, but I eventually found my way over to Monarch. I started to create a multitenant high-availability time series database, basically and it in terms of the open source world, probably the closest parallel would be M3 or something like but ended up working on that for about three or four years and then left, Google started a social media company that was as a product of complete failure, about a year into it. I realized that it was never going to work, abandoned the product, but realized I enjoyed being an entrepreneur and I wouldn't even say Pivot's the wrong word, because pivot implies that you keep one foot in the same place. I just started playing a different sport, but with the same investors and that's actually LightStep. LightStep was founded as a social media company in 2013 and a year and a half in, I was just that I completely changed what I was doing added some co-founders at that point and here we are six years after that, and I'm still working on building stuff and really enjoying it.
Utsav Shah: Yeah. I think that is a super interesting background. Next, the [05:00] first question on that is, was that the era when like Larry Page or whatever, decided that there's no need for managers, that's why they just hired all of them.
I don't think so much they fired them, but they would just hire a lot of engineers and hire managers to go along with it.
Utsav Shah: Yeah.
I think there was this idea that management was bad in some capacity, and I understand where they're coming from. I definitely don't agree. I think good management is actually one of the most incredible supportive things you can possibly have in an organization. But I think that they were a lot of the people who had come to believe that we're just coming from really bad management. Certainly bad management is worse than no management, but good management better than all of it.
Utsav Shah: The other thing that was interesting about Google at the time was that they were growing so quickly that if you didn't like what you were doing, you only had to wait a couple of months and some new person would take over.
So that paper's over a lot of issues. I think there was a belief that Google had solved the management problems through software or something like that.
That was another thing, there was a belief that by writing internal software systems to do a lot of the blocking and tackling that managers might do, and they certainly had tech leads, which serve a managerial purpose for just dividing workup. And there was a belief that they've solved that issue, and once the company everything, there's a law of large numbers, even though Google has been very successful at some point they had to grow slower. And when that started to happen, the need for managers became much more obvious and sure enough, at this point, I don't know what the ratio is, but I'm sure it's not 151 more so that they realized that they needed to correct that. But it was liberating in the sense that you could do whatever you want, but I think it was pretty disorganized and not very efficient.
Utsav Shah: Yeah. Could you talk about the architecture? You said you worked on some part of the ad system that wasn't particularly interesting from my understanding, and I could be completely wrong about this, there was one monolith, Google web server, like DWS and not that many services around, is that like roughly accurate? Because I'm also thinking why did diaper make sense if it's just like one large server, but I guess that's clearly wrong.
Yeah, I don't think that's correct. Certainly if you go back far enough into 1999 or something, it was probably true, but by the time I showed up, we didn't call them micro services, but they absolutely were. And I would say again that the micro services at Google were the best , probably the only good reason to adopt micro services is going back to management. It's difficult to get more than 15 or 20 engineers to work on anything efficiently in a single code base that's deployed as a single unit; it's just difficult to do that from earliest engineering standpoint. So micro services serve a purpose from a software development management standpoint, where you can create a unit of deployment that micro services at Google were much more about horizontal scaling. And that was a necessary thing. They add throughput that required that kind of horizontal scaling, but they definitely had. I remember when he turned dapper on, in production, we'd never really been able to visualize it before, but a cache, miss and Google web search. Certainly what two quests GWS, which you're referring to web server at the top of the stack, but by the time it got down to the bottom of the stack between the front end load balancers, the final thing that actually would look through some index on disc, it was 10 or 20 levels of depth to get down, so yeah, it was definitely quite distributed and also huge fan out. Oftentimes a parent would have with paralyzed request to 30 or 50 or a hundred and in some cases, children that had different parts of the index. And so you had a tail latency, things were really scary and stuff like that. So yeah it was quite distributed, especially on the web search side, early on. There were other parts of the system, like the ad system was the front end of that system that merchants would actually use was basically like a database than a Java web server. So there wasn't everything, but that was for the high throughput, low latency stuff. It was pretty distributed from early on.
Utsav Shah: Interesting. And just out of curiosity, did Google prioritize the consistency or availability ? Because of that large fan out, I'm assuming availability and it just dropped a data coming from like a few shards that they were too slow or something, but yeah.
Yeah. I don't think there's one answer to that, but Jeff Dean did a talk that the slides are online at Berkeley and like 2010 or 2012, it was really good talk where he discussed a lot of the techniques that they would use depending on the situation to deal with, tell and see. And I understand you're referring to the cap there but another trade-off that I think we had to wrestle with a lot was basically just cost or efficiency versus latency. [10:00] And we would often end up with something that was more expensive in order to put us a tighter bound on latencies. So if you had three copies of some service, you'd send the request to two of them in parallel and just take the first one that came back in order to manage high latency, outliers and things like that. But I don't think there's a single answer from a availability to consistency standpoint. It really depends on I guess, the business requirements.
Utsav Shah: Yeah. I've seen the tail at scale stuff, setting that might be what you're referring to. That's interesting and, you turned on dapper in 2005 is what you said and what was the immediate engineering impacts from engineers at Google where your customers. So, what was their reaction and did you see like some immediate changes based on releasing it and showing it to people?
That's a great question. And one of the most interesting things about dapper is that when we first got it out there in the world, well at Google, it was definitely not something where everyone's like, oh my God, this is incredible, and telling your office mates about it; it was nothing like that. In fact, I would basically go and find a tech lead for, you name it like Gmail, web search and anything that was operating at scale had a lot of services and I would kind of beg them to like meet me and then I would show up in their office, the UI admittedly was terrible, but it was still good enough to be useful. And I would show them some traces and they would always be like, wow, this is actually really interesting, I didn't know this. And would often, explore it with me and we'd find something that was troublesome and novel to them. So know they would get something that was interesting to them. And sometimes they would go in and fix that issue, but it wasn't like we had our own dashboards to track activity and it really didn't get a lot of use. I did generate a lot of value in the sense that we're able to find some, highlight the outliers and understand where that latency was coming from and make some substantial optimizations. But it was very much a special purpose tool used by experts doing performance analysis in the study state.
That was really what it was primarily used for initially and there are some technical reasons for why that was the case. But if you were to think of it from a product standpoint, the issue is that we weren't integrated into the tools that people were already using. And that is still the number one problem with the sophisticated side of the observability spectrum is that the insights that are generated are genuinely useful and insightful. And even self-explanatory when you put them in front of someone, but they simply are not going to find them themselves unless it's integrated into the tools that they're already using. And it's still, I think the number one barrier to value and observability is just that it's not integrated into the kind of daily habit tools, whatever those may be. At some point we did make a change, Josh McDonald, who actually still works with me at LightStep, who was working at dapper in 2005 as well. He eventually made a change to stubby, which is the internal name for GRPC essentially anyway. And particularly this library called request C, which was used to look at active requests that are going through the process to basically just cordon off the request that had a dapper trace, that set to true. And so you could go to any process where people are already using this request, see thing all the time to see requests going through their service. And it kept a cache of slow requests from an hour or whatever at different latencies. And we had a little table of requests that had the upper traces where you could click on the link and go directly to the trace. And then it was something people are already using and the number of people that used, dapper I don't remember exactly, but it must've been like a 20 X improvement when we released that, and it was a huge change. And the only lesson dapper didn't get any more, it didn't get any more powerful when we did that. It just got a lot easier to access. And so it's all about being in the context of the workflow. That's something where some people it's kind of Jonathan similar, who incredibly smart person, much smarter than I am that's for sure. But he ended up really pressing us to build kind of a bulk data API to run Map Reduce and things like that over the dapper data. And he was in charge of something called Terra Google, which was actually the largest part of Google's index, but also the least frequently accessed. It's a very complicated system, the way that it worked, I won't go into it just because we don't have time. I don't know if I'm allowed to talk about it, but suffice it to say it was really complicated. And he did some fascinating work to understand the critical path of the system using both , it's some really substantial improvements as a result of it. So there are people like that who made these big improvements, but it's a big difference between having, delivering quote unquote business value to Google, usually in the form of latency or reductions and having a lot of daily activity, but daily activity really didn't come until we integrated into these everyday tools. [15:00] And I think that was one of the most important lessons from, the dapper stuff is that the cool technology really is not enough to get retention from engineers who are busy doing other things.
Utsav Shah: That's super interesting. And I think I've heard the term Tara Google maybe five years ago when I interned there. And I think I finally learned what it meant. I'm sure I forgot about it in like three months. That's, interesting and request, see it seems like a front end towards like visualizing a context or Google's context in a sense, is that like an accurate way of phrasing it? And why did engineers user requests? That's something I'm curious about now?
Well, for different things, but what was particularly nice, but request C also known as well, RPC Z container put requests C but was the part that we're really talking about, what it allowed you to do, I guess it was basically just a table. That's all that you saw the table would have a row for every RPC method that you had in your stubby service, your GRPC service and then, so each row is a different method. Okay, fine, that's simple enough and then the columns were basically different latency buckets. So you'd have requests that took less than 10 microseconds, less than a hundred microseconds, less than one millisecond, less than 10, et cetera and it would go all the way up until I don't know, things that took longer than 10 seconds. And you could examine a very detailed kind of micro log of what took place during that request. So you could think of it as just a little snippet of logs that were pertained to that request and only that requests. And then as I was saying, if the thing was that portrays, you could then link off to the distributed version of it and see the full context. The thing that was particularly powerful though, is that it had one special column for requests that were still in flight that he would be taking a really long time.
So what would happen is you could have a request that was stuck and you were trying to debug it in an alive incident. And you could inspect the logs just for requests that were stuck usually because of , let's say it was often that there is a new tech slot that was under contention restock, waiting on it. You can go and see that exact thing had happened and there was a lot of really pretty clever stuff. They did an implementation to defer any of the evaluation of any of the strings in the logs until someone was looking at them. So you could afford extremely proposed with the logging on this thing and then you only evaluated the logs when an actual human being was sitting there, hit a refresh in the browser, looking at it. So unlike most logging frameworks for that sort of, that's not generally how it works. So there's a lot of pretty complicated reference counting and things like that, but it was also that if you were having an issue, you could figure out that you were blocked on this big people tablet server, and it was this particular UTEC flock that was contended. And that was being contented by these other transactions, which you can look at and figure out where the contention came from. But to be able to pivot like that in real time was pretty powerful and then having that linked into dapper to understand the context was also pretty powerful. I don't know if that makes sense, but it was a tool that was one of those things that I really haven't seen. I've seen, I think open senses and Z pages have some of that functionality, but it doesn't really make sense unless everything is using that little micro logging framework. And I just haven't seen that outside of Google or Google open source, so I still miss that. It was a really useful piece of technology.
Utsav Shah: No, I think that's amazing given this was like so long ago, and then it makes me think about taking a step back, I think maybe five years ago or ten years ago internal tools at Google are probably better than like the development experience externally. Right? You have so much stuff for free. You talk about blaze and you talk about all of these different tools, but things have evolved a lot recently, it seems like there's so many startups coming up with like different things, and even like Datadog and stuff are like fairly mature, now you get a lot of stuff for free from them. Would you say that the development environment externally is probably better than anything that Google can offer in the sense of you get the holistic experience now? Or do you think there's still things as you said, request like some functionality that's just missing because of the lack of consistency and you have to integrate like a million different things in maintaining like this Rube Goldberg set of integrations to get like a similar development experience?
Yeah, that's a good question. It's really hard to compare the inside and outside of Google experience and it's not that Google was all better, and a lot of stuff that Google was actually really annoying. And I was just talking to someone about this yesterday, but the trouble with Google was that everything had to operate at Google scale and there's this idea, [20:00] which is totally false in my mind, that things that operate at higher scale are better and they're usually not, there's a natural trade-off between the scale that something can operate at and the feature set. And so a lot of the stuff we had at Google was actually pretty feature poor, compared to what you can use right now in open source. But the only thing that really had going for it is that it scaled incredibly well. The exceptions were mostly areas where having a monitor with almost no inconsistency is in terms of the way things are built it gives you some leverage. And of course there are a lot of examples of that request, to find example of dapper is actually a fine example, too.
The instrumentation for dapper to get that thing most of the way there was a couple thousand lines of code for all of Google ovens, but whereas, just look at the scope of puppet telemetry or something, to get a sense of like how much effort is going to be required to get that sort of thing to happen in the broader ecosystem, so that they had this lever around consistency. A lot of the tooling at Google was it's not that it was bad, it was very scalable, but it didn't have a lot of the features that we would expect from a tooling outside. I'd also say that in my seven years of Google Workman infrastructure and observability, I never had a designer on staff and I barely had a PM ever and it really showed it's like having worked now with really talented designers, not just who can make UIs look nice, but who really think about design with a capital D and stuff like that. And just a completely different ball game in terms of how discoverable some of the value and the feature set and things like that. So like the Google technology often lacked that sort of Polish and I think there are many different vendors out in the world right now that I have built things that are much easier for an inexperienced user to consume, even if the technology is equivalent, I think the user experience is not. So that's another area where I think what we had at Google and is actually unfortunately a pretty far cry from what you can get now; it's just by buying SAS.
Utsav Shah: Yeah. Well, we thought about if people ran infrastructure teams like product, like you're trying to sell each piece of your infrastructure to potential buyers, you would create a better product because you'd have to think about user experience and you have to think about, make customers actually getting value. So, trying to make that happen, it's not the easiest thing in the world, but that makes sense. When you released dapper, did you have any sampling at all or was it like, I'm assuming yes, but wouldn't have just worked before?
Yeah. Sampling is an interesting topic. There's a lot of places you can perform sampling in, a tracing system and dapper performed it almost in every one of them but yeah, dapper had actually a pretty aggressive sampling. We started with one for 1024, so that was the base sampling rate in dapper ,and then we realized that even after that cut of one for a thousand centralizing the data. When we initially wrote the data just to local desk where we wrote log files and we deployed a Damon that ran on every host at Google is actually by the way, if you ever want to like jump through some hoops, try to deploy a new piece of software that runs as root on every machine, tell you that was a real nightmare from a process standpoint. But anyway, so the day this thing would sit there, it would scan the log files and basically do a binary search anytime someone was looking for a trace, and so that's the thing that I started with , that was honestly a terrible way to build that system really bad. So eventually we moved to a model where we would try to centralize that data somewhere for all the reasons you might imagined, but it turned out that the network costs and centralizing that data, even after the one for 1000 cut was substantial and the storage costs were also really substantial. So we did another one for 10 on top of the ones with 1000. So we were doing one for 10,000 sampling randomly before we got to the central store that was used for things like now, producers and stuff like that. And it pretty much means you can never use dapper for all sorts of applications. Like for web search was fine because I don't rememb6er the number, but it was order of like a million grids for a second, so fine but for something like people check out where people are actually buying stuff.
It's of course intrinsically a much lower throughput service, but the transactions are actually more valuable, so you're getting cut both ways ,and we didn't have a dynamic sampling mechanism on that for when I was working there and people could adjust to the sampling rates themselves, but they usually didn't. So the technology that's really not that useful except for the high throughput services where that sampling, wasn't a complete deal breaker. I think with LightStep and with other systems that have been written in the last couple of years, there's a recognition that sampling really serves a couple of purposes. One is to protect the system from itself. So you don't want to have [25:00] an observer effect and actually create latency through tracing with dapper. We had that issue because we wrote a local disc we're basically entering the Colonel at least on disc flush and for hosts that were doing a lot of disk activity. We could actually create latency with high sampling percentages, but there's no need to do that, you can just flush the stuff over the network and especially that was 2005, that works a lot faster. Now, next to a lot faster, now you can actually get the data out of the process without sampling in almost all situations. They're probably an outlier cases here or there, it's not true, but overall there was no issue with flushing all the data out of the process. And then you just need to decide how much you're willing to spend on network and how much you're willing to spend on storage, and that's a whole set of other constraints. The other thing that I have recognized is that long-term storage is quite cheap. The wiring networking costs in terms of a lifetime, that data end up being almost as expensive or in many cases, more expensive than storing it for a year, so if you can find some way to push the storage closer to the application itself, even if it's just in the same physical building or availability zone, that's a pretty big win as well. So a lot of the work that we've done at LightStep is actually trying to take advantage of some of those, you're just trying to be on the right side of those cost curves in terms of where we actually do the high throughput, and then where we do the sampling stuff like that.
Utsav Shah: This reminds me of how Monarch is designed. I was just reading up on the paper before this, it's the same where you're trying to flush something that's in a local data center or the local availability zone. And then finally, when you query, you're getting such less data that you can do that once and ask questions of multiple regions. Is that roughly accurate?
Yeah, I think there are definitely some similarities in modern. I have to say, if we could go back in time, I would have pushed back harder on some of the requirements that were put on UIs. I don't think we did the wrong thing, given the requirements that were handed down, but the requirement that we depend on, almost nothing except for, physical DM or physical, DM's kind of a misnomer, but the fact that we weren't allowed to take advantage of Google's other infrastructure beyond just the scheduling system and the kernel and things like that, it really limited what we could do. And then when you also pile on some other requirements around performance and availability and kind of forced to store everything in memory, and we did and then the paper goes and talks about the number of tasks, which are basically virtual machines that moderate consumes and the steady state. And I remember correctly, the paper's number is like 250,000 VM steady state and that is just extraordinarily expensive system right there. A VM of course is not the same size as the physical machine, but it's a lie and that's not even counting the VMs that are being used for durable storage and long-term storage of the data. And wherever they're putting that stuff in Google's longer-term storage systems, I mean just a tremendously expensive system and that's not a good thing and I'm not convinced that's the right approach. We've certainly, with some of the work we've been doing lately LightStep, we basically had to write our own time series database from scratch and rather than trying to re-implement what we did with Monarch, I think a lot of the lessons we've learned is that there are ways to do that are far more efficient without really paying a penalty in terms of performance. And, yeah, I remember that we felt like we had no choice, but to do everything in memory, there are some similar systems that Facebook like the grill system, I think also ends up making the same decision at about the same time, maybe it was because flash wasn't quite commodity at that point and so we felt like it was disc or like physical spinning disc or memory. And now of course there's some interesting things, but that was expensive. I don't know if it's a cool system and it's very powerful, but awfully expensive.
Utsav Shah: Yeah. So just for listeners, Monarch is a monitoring system. You can see it provides the same end interface too, as like Promethease, it's designed in a very different way internally, including to the user. I think the configuration system is different, but it provides kind of like the same purpose. A design did replace Boardman, which is the original monitoring system, which like engineers had to deploy for themselves, whereas like Monarch was like a SAS service in the sense that you just had to add your metrics and things would work automatically. Is that like a good summary?
I think that's exactly right about what Monarch is. The Promethease thing is a little funny though. Promethease is architecturally much more similar to Boardman than Monarch important. Boardman had a lot of issues that Promethease, I think has improved upon that were, self-inflicted like Boardman to actually use Boardman to monitor your system, [30:00] you had to use not one, but five different, totally unique to Google domain, specific languages, all of.
Utsav Shah: PSM.
Yeah. All of which were totally arcane if you want my honest opinion had like lots of got you's, like for instance, sorry, this is the ramp, but if you wanted to do at arithmetic, which of course is something you'll want to do when you're writing queries, you could use the minus operator. No surprise, but if you had variables that you were subtracting and you didn't separate it by spaces, it allowed hyphens to be a variable names and it would just like silently failed it. Oh, that's not, it would just substitute a zero for that expression and crazy stuff like that. And of course, since it was a handwritten DSL that wasn't particularly well documented or maintained, there really wasn't staffing to improve that there was definitely a period of Google where it was kind of awkward on anytime you ran into a new problem to write some kind of language, some of these languages in the borderline university were pretty small to be fair. But the point I'm making, if they each have their own grammar and their own rules, and most people basically just copy paste in someone else's portal in order to hit their launch criteria. So there wasn't a lot of thought and care being given to writing maintainable code , and it definitely is code. I think if I remember correctly, Gmail's configurations, which, you know, admittedly were generated programmatically, but those borderline configurations for like 50,000 lines of code and it was totally inscrutable.
So there's a lot of frustration about that kind of stuff. Whereas problem I think is far more sensible, I could critique this for that, but it generally makes sense , I get it, I think that I don't want to sound overly critical if this is not a good or a bad thing, that's just sort of recognizing that every system is designed for a certain set of problems or whatever. But for me, if you did in here at one of the most problematic from the sweat of mine, which is that it wasn't really designed for distributed pretty evaluation, you can kind of do it, but you have to manually share the thing yourself. And that's a very difficult thing to maintain, to do all the rebalancing and things like that ,and I think that the initial effort at Google was actually, it wasn't Promethease, but it was almost like community is let's fix Boardman and building a new system that has the same scaling characteristics, but has one language, not five better language improvements to this or that; a better internal time series or things like that. But it was still basically the same architecture. And my recollection is this guy, Alan Donovan is another person who's lot smarter than I am; really clever person, but he was working on this stuff at the time. And I think his observation was if we're saying that the system board of mine has tons of issues, how could it be the right thing to architect it and have the kind of block diagram be exactly the same, but how each block just be better? Shouldn't , we be thinking about this a little bit more holistically and to really examine the problems that people are having. And I think when we did that, we realized that the number one problem that was causing a lot of the other weird stipulations people are doing with the fact they had to manually shard and balance this thing, and that distributed credit evaluation was kind of a hack. So the thing that made modern so interesting and also so difficult was that it really was horizontally scalable and that users did not need to worry about where their data was being balanced is also a multitenant from day one, which was allowed a central team to run it for all of Google, instead of trying to repeat that effort with every team in their own little cluster.
And it made the design much harder, but ultimately I think more robust and I'm not knowledgeable enough about for me, if he has to know how much effort would have to go into making it really do that. I've seen Thanos has added some functionality like this, but I think that the pretty evaluation really pushing that down and making sure that you do as much of the aggregation as you can at the lowest level, and then bring things back up, have a lot of. There's a lot of subtleties that I think we felt like we had to build into the design pretty early. That's the thing that we are really trying to escape from with Boardman was a design that made it difficult to do distributed per evaluation and that's difficult to handle really large datasets that don't fit in a single feed because that's the underlying pain point in Boardman that led to a lot of other pain points.
Utsav Shah: That is super interesting. The name Alan Donovan, I think I've seen it with basil get logs. He wrote like Star Luck For Go, and I think I might be in that Google group.
So yes, that's right. I think he wrote one of the official Looking Go Programming Language has books. He has some languages background, really nice person, very intelligent guy. But, I credit him with sort of forcing us to step back and really think about what problem you're solving with Monarch. And yeah, that was really fun though, I loved building that system. That was probably my happiest time that Google was the summer that we were prototyping, that it was just like amazing team that went very quickly. It was a lot of fun.
Utsav Shah [35:00]: Can you talk more about the district query evaluation? I don't fully get why it's problematic. So let me explain to you and you can tell me where I'm wrong. So what you're saying with Bergman Boardman that query evaluation mostly happened at the higher layers where, I guess if you could just explain to me because I don't fully grasp it. What exactly is the difference in lecture?
Yeah. I wasn't being clear, so totally makes sense, so let's take a simple query. You want to understand the ratio of your error rate to your total request rate across your application and you want to group it by RPC method. So let's assume that the amount of data that you have for all of these types of series is large; to put this in context, some of Gmail's metrics were distribution value. So the actual value type was a histogram and a single metric with all the cardinality turned into 250 million times series in the steady state. So very high cardinality surface area that we were trying to aggregate around and the problem that you have that you're trying to do that sort of query that erode ratio query, it's a joint. So you have two different queries, you have a rate query and account query, and then you have to compute, you have to create buckets for each of these RPC methods, just doing a group buy and then within each of those, you have to do a bunch of math. One option is to basically have a credit evaluated at the top of the stack that just talks to all of the sort of like leaf nodes. And in Monarch, we called them leaves each leaf node. And you would say, okay, give me all the data you have for this particular metric , and they would stream the data back to you and you just do that. You do the math, it turns out the data size is large enough that if you do that, you're pretty evaluation times it moves into the tens of seconds or minutes in some cases was kind of a non-starter. So instead, what you'd like to do is say, okay, fine. So we'll compute this at the leaves, but the problem is, and this is the most important point. If you're a grouping by RPC method, there's absolutely no guarantee at all. In fact, it's just not true that all of the data for one RPC method is going to be on one server or another. So each of the services in each of the Monarch leaves is going to have some portion of the data, so what you want to do is compute what we would call a partial aggregation. So everyone confused, they're part of this particular query, so they each make the RPC method buckets, and then they pass those partial results up the stack to the mixer level where now you've done the aggregation so the data size is pretty small over the wire.
And then you complete the aggregation now that you have all the data, get the final numbers for both the error and the account and then at the last step at the top, you join the two things, divide them all and you've finished yourself. So that the most important thing to understand is that it's not possible for the lower level nodes where the horizontal scaling has happened. They cannot compute the final number because they don't have enough data to do it. So they have to have some way of communicating partial results back to the top of the stack, and that example, it's not that difficult, but in terms of the full query plan for an arbitrary query and the language that we are doing, it's a lot of subtlety and complexity to how those different types of praise can and cannot be pushed down to be evaluated at the leaves. And if you ever end up in a situation where you need to pull all of the data up into the mixer level, the whole thing totally falls apart from a performance standpoint. And oftentimes even from a feasibility standpoint, you end up owning that thing if you're not careful. So there's a lot of streaming, evil and pushback on channels, so you don't flood the thing that's getting this huge fan and from all the children and stuff. So this guy, John Benning, who another person much smarter than I am, but he designed that thing and worked on it for years and to kind of optimize it. And yeah, it was a really interesting piece of technology, but it was just quite subtle. And I think if we hadn't designed it for that initially, it's just hard to make the query model that isn't designed to create these like partial aggregates. I think it was hard for me to imagine how you would send that in after the fact, because of the way you approach the computation is you have to be able to kind of truncate, the computation and send it as a partial computation up the stack instead of as a set of query results. And I'm not saying it's impossible, I'm assuming it gets pretty ugly. So that's the thing that I was referring to. Does that make sense?
Utsav Shah: Are you saying that the query language itself also needs to be designed with this thing in mind? Or is that mostly just the way you shared out and you make your query plan?
We really tried not to put constraints on the query language because of this. There were certain types of joins that were really hard. Like a full relational join, it was deeper than that, this problem, but it wasn't designed to be. And relational [40:00] database and those things are really quite difficult to implement, but the joints are aware. I think we had to kind of draw the line on some of the functionality, but a lot of other stuff, it was kind of a query language has some very powerful capabilities, especially dealing with the time dimension, but also was much more limited than you. It wasn't as powerful as a lot of SQL like languages for doing just general purpose computation, stuff like that. So it was very much designed for time series data, but I would say it seems to have been general enough to represent a wide variety of like operational use cases. I just wouldn't want it, it wouldn't take the place of inquiry or something like that for general purpose computation.
Utsav Shah: Yeah. And one thing I think listeners should note is that you might think that, users are not running that many queries. How does it matter? But like a lot of people write alerts and like monitors and they're basically super complicated and they have to be evaluated a lot. And I'd imagine that a lot of your load was from these alerts that have to be continuously evaluated to make sure, and it's super critical that they fire quickly. You don't want there to be like a production outage and there's a slow down due to the monitoring system.
Certainly the case!
Utsav Shah: So it's a really interesting problem. I think Dropbox tried to build their own and I think they rolled it out successfully and they just put cardinality limits. You can have queries with super high cardinality, but very few other limit, because I think the cardinality exclusion is where a large source of problems is, is that accurate?
Yeah, that's definitely accurate. So that's another rant if you don't mind me going in that direction for a minute. This is one of the things I did not realize written on Monarch, but I've come to believe that there really are two types of monitoring data. There was telemetry, I guess you could call it, there is statistical data, which we usually call metrics and there's transactional data, which we usually call traces or sometimes logs or structured events. And those are the two flavors of data and the Achilles heel for the structured events, the traces for the transactional data.
But the Achilles heel there is that at high throughput just retaining it for a long time. It's just really expensive to so much of it, like you were processing a lot of transactions. It's big data, it's expensive. I don't mean big data in the sense of big data, but it's a lot of data. And then on the metrics side, you can handle the high throughput very naturally because the only thing that changes the value of the counter, it just goes up, but it doesn't actually make the metric data larger; it just changes the numbers if you have high throughput. So the Achilles heel for metrics data is high cardinality. I actually wrote some stuff about this on Twitter a few weeks ago. I'm checking send to you afterwards, if you want to include it in the article, but, the thing that's so frustrating is that cardinality is necessary. You it's totally inevitable that you're going to want to include some tags in order to understand and isolate symptoms of interest, and I think metrics should be used to isolate symptoms of interests that connect the health at some part of your system to the business. at some abstract level, that's really what you're supposed to be doing metrics. And then I think because that's the tool that people use, it just becomes the hammer for every nail that they see. And you just try to use cardinality to address every aspect of observability, which is a complete disaster from a cost standpoint and from a user experience standpoint, and I will try to elaborate on that. So let's go back to this example of RPCs or something like it's totally fair and smart actually, to have a tag on your RPC metrics for the method, because you want to distinguish some reads. Totally fine, because those are different things you might want to independently measure from a health standpoint, but then you might want to say, Traffic spike or latency spikes and I want to understand why. And so if you're using a metric system, the only tool you really have to understand that variance or that change is to do a group buy on some other tag and hope to see that one tag value explains that blip.
And so you now have a lead to go and follow that tag value where it leaves you, whether it's a host name and I understand the appeal of that, but there's two problems. One like you add a couple more tags and the confident, total explosion is immediate and you're suddenly spending a ton of money, whether it's on from Promethease or a vendor, it's really expensive, and then I think even more pressing issue is that in the code, you only have access to things that are locally available. So you have access to your own host name and things like that, but it's often the case. In fact, I think the numbers I've seen, which I would bring true for me is that 70-75% of incidents in production are caused by an intentional change, like a deployment or a config push elsewhere in the staff that version change or config push is not going to be in your local tags. If you're in service A or if you're in service B [45:00] at the bottom of the stack and service A at the top of the stack pushes a new version, that's flooding you with traffic service A's version is not going to be available for grouping and filtering anyways, so you're paying a lot of money to have all this cardinality and you can't even group by the thing. They have to explain it now in the transaction data and the tracing, you absolutely do have that. The traces flow through both services. These health metrics are linked to the transactions via hosts, by a service names, by method names, all sorts of stuff like that. There are ways to pivot over to the tracing data programmatically in an observability solution, and in the transaction data, the high cardinality is not an issue. You can do an analysis of thousands of traces in real time, and actually understand that the thing that changed is that before, service A above, you was on version five and now it's in version six, and then that explains the difference in the health metric you started with, but using cardinality as a way to do it, sort of ability is a big mistake. Sorry, metric cardinality is the way the observability is a big mistake.
I think that metrics should be used to understand health and nothing more. And then you have to be using observability tooling. That's smart enough to pivot over to the transaction data where it's both cheaper and more effective to understand these sorts of systemic changes that lead to these health changes in the first place. So at Google, I think we were actually way behind where we are now, honestly, with LightStep in terms of how we would pivot from time series data over to transaction data and back again, but that's really the essence of it because you can't do high throughput with one and you can't do high cardinality with the other, so you have to feel that to use tools that pivot from one to the next intrinsically. And I think that's the thing that allows you to kind of get out of that cardinality trap more than anything else, it's not setting a cardinality limit, it's just not needing it to be in your metric data in the first place. I think that's really the solution that we'll find ourselves pursuing over the next couple of years.
Utsav Shah: Well, I think that makes total sense. And maybe just super quick to explain the lessons, like why is high cardinality bad? And I think the answer is because like in a time series database, when you have like a data point with a different cardinality, you have to basically store it as like a different drawer or different columns, and that's what causes the explosion.
Yeah, that's basically it. And high cardinality isn't bad, high hardly metrics are bad. I think the issue is that in a time series database, you can basically think of it as a huge spreadsheet and each row is a different time series. And it turns out that creating a new row is a lot more expensive than creating a new cell.
Utsav Shah: So you're just incrementing like a number in that cell?
Yeah, or adding another data point to an existing time series as far cheaper than creating a new time series. I think that the issue with high cardinality, it sounds so esoteric, but it ends up being an issue that like your chief financial officers quit and start caring about. Because you can add one piece of code, literally one line of instrumentation that says, I've got this metric, that tracks requests, I'm going to add customer ID and host name that just explodes, and then every single value, every single interaction of that line of code is going to create a new time series and you're probably dealing with, it's not vendors being evil by the way, they have cogs, they have to pay for it. But if you write a piece of code that incurs, 10 million time series in some PSTB somewhere, that's going to be expensive, no matter what he has to be are using. So some are more expensive than others, but it's still just different flavors that are expensive and I think ideally the developer can add that code a platform team can write some kind of control that says, I never want cardinality for any metric to exceed X, and then you can do some kind of top cave thing to retain the high frequency data and aggregate the rest. I think that's the kind of long-term vision I have for how cardinality should work, but really it would be great use event data tracing data for high cardinality where you don't pay a penalty at all. It just literally doesn't matter in terms of the cost of the solutions and then stick to the metrics, high throughput data where you need precise answers about critical symptoms.
Utsav Shah: And then the flipside of that is why is cardinality not a problem for tracing? How do you store tracing data that in a way that cardinality is an issue at all?
I think it's depends on how you index it. So cardinality it's an issue with the index and the TSB.
Utsav Shah: Yes.
In the tracing database. I am actually interested, there are many different ways of doing it. Some people have column stores. I think I don't work on any of them, so I don't want to misspeak, but I'm pretty confident that their underlying data database is a column store where, you have different trade-offs in that type of world. But LightStep does is too complicated for me to explain right now, but we have our own way of managing cardinality and the trace data. So it's not that it's free or something, but it's not an issue like it is for a time series database. So I think that different people have addressed it in different ways, [50:00] but I don't feel, it's a very satisfying answer to your question in a time series database, because of locality constraints you have around time series, like adding cardinality just has a fixed cost that's relatively high. I think that's probably the best way I can explain it.
Utsav Shah: Okay. Is the index and electricity data, just the trace ID or the hash basically, or a service, which is relatively straightforward index?
There's a lot of different ways to do it. That's why I'm kind of hedging on this because I'm trying to be precise, I like being precise and it's actually quite a diversity of ways that it's done right now. So I can't specify how it's done everywhere. In the dapper ecosystem, we did have a few indices that we special case, and then if you want anything that, wasn't one of those indices, you had to write a MapReduce, which is super high latency. But other systems have no limits and cardinality others as well, index up to end values of every key and no more. I mean, there are a lot of different approaches to it.
Utsav Shah: That makes sense. And I can read up more on how Jaeger works, but when you search for a trace ID in, you get that fast, everything else is kind of not that fast, so that makes sense to me, and I think there was a lot of good information. I feel I've just learned a lot on how all of this infrastructure works for sure. Do you have anything that you want to add on top of this? Just what you've learned, all of this stuff, you left Google in 2012, what was like the one thing that was just really something that you still use or some information that you still remember from that time that is one design principle perhaps, or just one way of thinking about things?
That's a good question. I kind of referred to it, but not as precisely, but one thing that Jeff Dean said a few times, which really did stick with me, was just this idea that you really can't design a system that's appropriate for more than like three or four orders of magnitude of scale appropriate being the keyword. And this goes back to this idea, that system is Google. We're not better because they're more scalable, and one thing I liked, it's not about Google, I think most companies are like this, but the in-house technology at Google came with pretty accurate advertising for what it was good for and what it wasn't good for. And there's no shame in saying, yeah, this database is good until you hit the scale and this one's terrible if you go below that, because it doesn't have all these features that you'd expect or what have you, and my experience outside of Google, or whether it's open source software or vendor software, it's just that people are understandably reluctant to describe the scale that their system is appropriate for. And I mean, it's a great question to ask, actually, if you're talking to someone about something that they're really excited about and they're trying to pitch you on it, just say, so tell me like, what's too much skill for this? and what's too little skill for it? And if people can't come up with an actual answer to that, I think that's a bit of a red flag in my book. And that was something that really stuck with me after Google and I think it applies to any technology that you're building, and also it's good to be humble about that too. You can remind yourself, you built some really scalable thing, it probably doesn't do something that less scalable thing could do for a site and just to try and think about fitness for purpose, that's maybe the thing I feel that's comes up over and over and over again, anything from engineering ,to product ,almost to marketing. To think about what market segment is this really appropriate for? What, where, would the scale that we're targeting live in the marketplace? So I think it's relevant at that level too.
Utsav Shah: Yeah. It reminds me of this meme, MongoDB is like web scale.
What does that mean? No comment.
Utsav Shah: The web is so different for so many different things, there is no concept of like web scale, it just sounds fancy. My current company uses MongoDB, it seems to work so far, it's probably fine. And I have so many more questions in terms of like, I want to ask you about open tele metrics and open tracing and all of these things and lifestyle. But I think it would be nice if you do that in a follow-up search. This was great, and I feel like I learned a lot, so thank you so much for being a guest.