Software at Scale
Software at Scale
Software at Scale 18 - Alexander Gallego: CEO, Vectorized
0:00
-1:01:40

Software at Scale 18 - Alexander Gallego: CEO, Vectorized

Alexander Gallego is the founder and CEO of Vectorized. Vectorized offers a product called RedPanda, an Apache Kafka-compatible event streaming platform that’s significantly faster and easier to operate than Kafka. We talk about the increasing ubiquity of streaming platforms, what they’re used for, why Kafka is slow, and how to safely and effectively build a replacement.

Previously, Alex was a Principal Software Engineer at Akamai systems and the creator of the Concord Framework, a distributed stream processing engine built in C++ on top of Apache Mesos.

Apple Podcasts | Spotify | Google Podcasts

Share Software at Scale

Highlights

7:00 - Who uses streaming platforms, and why? Why would someone use Kafka?

12:30 - What would be the reason to use Kafka over Amazon SQS or Google PubSub?

17:00 - What makes Kafka slow? The story behind RedPanda. We talk about memory efficiency in RedPanda which is better optimized for machines with more cores.

34:00 - Other optimizations in RedPanda

39:00 - WASM programming within the streaming engine, almost as if Kafka was an AWS Lambda processor.

43:00 - How to convince potential customers to switch from Kafka to Redpanda?

48:00 - What is the release process for Redpanda? How do they ensure that a new version isn’t broken?

52:00 - What have we learnt about the state of Kafka and the use of streaming tools?

Transcript

Utsav [00:00]:Welcome Alex, to an episode of the Software at Scale podcast. Alex is the CEO and co-founder of Vectorized, which is a company that provides product called Redpanda and you can correct me if I'm wrong, but Redpanda is basically a Kafka replacement, which is a hundred times faster or significantly faster than Kafka itself. I am super fascinated to learn about why we're building. I understand the motivation behind building Redpanda, but what got you into it and what you learned on process and thank you for being here.

Alexander “Alex” Gallego: Yeah. Thanks for having me though. A pleasure being here. This is always so fun to get a chance to talk about the mechanical implementation. To a large extent to this day, this thing is pretty large now. So I get to do a little bit of code review of the business, and it's always fun to get the details. Yeah. I've been in streaming for a really long time, like just data streaming for like 12 years now and the way I got into it was through startup in New York. I was doing my PhD in crypto. I dropped out, went to work for this guy on a startup called Yieldmo. Doesn't mean anything, but it was an Ad tech company that competed against Google on the mobile market space and that's really how I got introduced to it. The name of the game back then was to use Kafka and Apache Storm and ZooKeeper 32 and honestly it was really hard to debug. So, I think I experienced the entire life cycle of Kafka from the zero point maybe seven release or 0.8 release back in 2011 or something like that. All the way until my previous job at Akamai, where I was a principal engineer and I was just sort of measuring latency and throughput. And so I sort of seeing that I will listen to Kafka and before they required ZooKeeper and all of these things. I guess the history of how we got here is, at first we were optimizing ad then we're using a Storm and this and then Mesos has started to come about. And I was like, oh, Mesos is really cool. The future of streaming it’s going to be. 

You're going to have an Oracle and then something that's going to schedule the container. In retrospect now, that Apache Mesos got archived or it got pushed to the Apache Attic a couple of weeks ago. We chose the wrong technology. I think it was the right choice at the time like Kubernete is barely working on a couple of hundred nodes and Mesos was proven at scale and so it just seemed like the right choice. What we focused on though is streaming. Streaming is a technology that helps you both sort of extract the value of now or in general deal with time sensitive data like a trade or fraud detection, Uber eats are all really good examples of things that need to happen relatively quickly, like in the now and so streaming systems are kind of technology designed to help you both deal with that kind of complexity. So what Concord did was, hey, we really liked the Apache Storm ideas. Back then when it was a [Inaudible 3:17], Storm was really slow and it was really hard to debug this thing called Nimbus and the supervisors and anything, stack traces on poor languages at [Inaudible 3:27] closure and Java. And I was like, I need to figure out three standard libraries to debug this thing. And so we wrote Concord in C++ on top of Mesos, with that squarely on the compute side. And so streaming is really where store and compute came together and then at the end of it, you do something useful. You say, hey, this credit card transaction was fraud claim. That's what I did for a long time. And, you know, long story short I've been using Kafka really as a storage medium. I personally couldn't get enough performance out of it with the right safety mechanics. So in 2017, I did an experiment where I took two optimize edge computers and literally with a wire back to back to each other. So no rack, latency, nothing. It's just an FPF wire connected back to back between these two computers and I measure it. Let me start up at Concord Server I think maybe two, four, something like that, or two, one at the time, a Concord Server and Concord Client and measure what it can drive this artwork to for 10 minutes by both in latency and throughput. Let me turn that down and let me write a C++ program that bypasses the cornel and bypasses the page cache for the storage level too, and see what the hardware is actually capable of. I just wanted to understand what the gap is. Where does this accidental complexity comes from? Like how much funding are we leaving on the table? 

[5:00] The first implementation was 34X tail-latency performance improvement and I was just floored. I took two weeks and I was comparing the bytes of the data on this, just making sure that the experiment actually worked. And so, yeah, honestly, that's, that was the experiment that it got me thinking for a long time, that hardware is so fundamentally different to how hardware was a decade or more ago when Kafka and Pulsar and all these other streaming technologies were invented. If you'd look at it, actually the Linux scheduler, block algorithm and IO algorithms, basically, it's the thing where you send data to a file and the Linux sort of organize it safe for optimal writes and reads is fundamentally different. It was designed for effectively a millisecond level latencies and the new disks are designed for microsecond level latencies. So this is a huge gap in performance improvement. Not to mention that now you can rent on Google 220 cores on a VM. You can rent a terabyte of Ram.

So the question is, what could you do differently with this new [inaudible 6:11]? Like it's so different. It's like a totally different computing paradigm. And I know that [Inaudible6:19] has coined the term, "there's no free lunch." you basically have to architect something from scratch for the new bottleneck and the new bottleneck is the CPU. The delayed [Inaudible 6:30] of this, are so good and same thing with network and all these other peripheral devices that the new bottleneck is actually the coordination of work across the 220 core machine. The future is not cores getting faster; the future is getting more and more cores. And so the bottlenecks are in the coordination of work in the CPU's. And so, we rewrote this thing in C++, and that's kind of maybe a really long-winded way of saying how we got here.

Utsav: That is fascinating. So maybe you can explain a little bit about where would someone deploy Kafka initially? You mentioned like fraud, and that makes sense to me, right. Like, you have a lot of data and you need to stream that. But the example that was, I guess, a little surprising was like Uber eats. So how would you put Kafka in like Uber eats? Like where would that fit in like the pipeline?

Alex: Great Question. Let me actually give you a general sense and then we can talk about that case. Event streaming has been a thing for a really long time. People have been trading systems, but in the modern stack, it's called event streaming. And what is an event? An event is what happens when you contextualize data. So you have some data, let's say a yellow t-shirt. Right. Or like a green sweater, the one I'm wearing today. That's just data doesn't mean anything. But now if I say I bought this green t-shirt with my visa credit card and I bought it from, let's say, just Korean seller that is coming through www.amazon.com. And then I start to all of this context that makes an event. There's a lot of richness there. Implicitly, there's also a lot of time to that transaction. If I buy it today, there's this immutability about this facts. And so event streaming is this new way about thinking on your architecture as this immutable, contextualized data, things that go through your architecture.

And so in the case of Uber eats, for example, when I go into like my Uber eats app and I select my favorite Thai restaurant, and I said, hey, get me number 47, I can't pronounce it, but I know it's the item I always get from the third restaurant across the corner. It’s like Chinese broccoli. And so, it's immutable that I paid $10 for it. It is immutable that the restaurant got it, 30 seconds later after this order. And so you start to produce kind of this chain of events, and you can reason about your business logic as this effectively a function application over a series of immutable events. It's kind of like functional programming at the architectural level. 

And why is that powerful? That's powerful because you can now actually understand how you make decisions. And so to go back to the case of fraud detection its really useful, not in making the decision. Like you can just create a little micro-service in node or in Python. It doesn't matter. And you just say, hey, is whatever is the credit card, both $10,000 is probably a fraudulent for, for buying Thai food. That's not the interesting part. The interesting part is that it's been recorded in an order fashion so that you can always make sense of that data and you can always retrieve it.

[10:01]: So there are these properties about Kafka that Kafka brought to the architect, which were durability. I mean, this data actually lives on disk, and by the way, it's highly available. If you crashed one computer, it's going to live on the other two computers. And so you can always get back this data and 3 it’s replayable. If the computing crash, then you can resume the compute from the previous iterator. And so I think those were the properties that I think the enterprise architects started to understand and see it. They're like, oh, it's not just for finance in any way of doing trades, but it works almost across any industry. Today, we have customers, even us and we're a relatively a young company, in oil and gas measuring the jitter between oil and gas pipelines, where you have this little raspberry pie looking things. And the point of this Kafka pipeline, which was later replaced with Redpanda was just a matter of how much jitter there is on this pipeline. Should we turn it off or it's really cool?

We've seen it in healthcare where people are actually analyzing patient record and patient data. And they want to use new technologies like spark and mail or TensorFlow and they want to connect to real-time streaming. For COVID, for example, we're talking with the hospital in Texas, they wanted to measure their COVID vaccines in real time and alert things for all sorts of suppliers. We've seen people in the food industry. It's like in sports betting. It's huge outside of the United States. To me, it feels like streaming is at this stage where databases were in the seventies, like before that people are writing to flat files and it works like that's the database. Every customer gets a flat file, you read it. Every time you need to change it, you just rewrite the entire customer data. And that's kind of like a pseudo database, but then database gave users and higher level of abstraction and modeling technique. And to some extent, that's what Kafka has done for the developer. It's like, use this pluggable system that has the three Tico to them as the new way to model your infrastructure as an immutable sequence of events that you can reproduce, you can consume, it's highly available. So I think those were the benefits of kind of switching to an architect like company.

Utsav:

Well, those customers are super interesting to hear. And that makes sense, like IOT and all of that. So maybe one more question that would come to a lot of people's mind is, at what point should you stop using something like SQS? Which seems like it provides like a lot of that similar functionality, just that it'll be much more expensive and you don't get to see like the bare-bone stuff, but like Amazon supports that for you.  So why do customers stop using SQS or something like that and start using the Kafka Redpanda directly?

Alex: Yeah. So here's the history. So Kafka is 11 years old now and the value to developers on Kafka is not Kafka the system, but the millions of lines of code that they didn't have to write to connect to other downstream systems. It's like the value is in the ecosystem, the value is not in the system. And so when you think about that, let's think about Kafka as two parts, Kafka the API and Kafka the system. People have a really challenging time operating Kafka the system with ZooKeeper. Even if I know that there's might be some listeners that are thrilled and they're like, oh, ZooKeeper 500 was released. Then we could talk about that about what KRaft means and the ZooKeeper 2 later. But anyways, so if you look at Kafka, it's two things, the API and the system. The reason is, and why someone would want to use the Kafka API, which by the way, Pulsar are also started supporting, it’s not just Redpanda, really like a trend in the data streaming system is you can take spark ML and TensorFlow and [inaudible 00:14:04] and all of these databases, it just floats right in, and you didn't write a single line of code.

Alex: You start to think about the systems of these Lego pieces. Of course, for your business logic, like you have to write the code, it's what people get paid for but for all of these other databases, all of these downstream systems, whether you're sending data to anything: Datadog for alerting, or a Google BigQuery or Amazon Redshift, or [Inaudible00:14:32] to be, or any of these databases. There's already this huge ecosystem of code that works that you don't have to maintain because there’s people already maintaining it and so you're just flogging into this ecosystem. So I would say the largest motivation for someone to move away from a non-Kafka API system, which you know, is before Google gloves pops up and Azure event hub and there's like a hundred of them is in the ecosystem.

[15:00] And realizing that it is very quickly at the community. I think Redpanda, for example, makes Kafka the community bigger and better. We start to expand the uses of the Kafka API into embedded use cases. For example, for this security appliance company, they actually embed Redpanda because we're in a C++, super well footprint, but they didn't embed as processed to do intrusion detection. So every time they see an intrusion into the network, they just write a bunch of data to disk, but it's through the Kafka API and in the cloud, it's the same code. It's just not one Redpanda local instance is the collector. And so I think for people considering other systems, whether it's SQS, or [Inaudible00:15:46] or pops up or Amazon event hub. First of all, there are specific traders that we need to dig really to the detail, but at the architectural level is plugging into this ecosystem of the Kafka API is so important in getting to leverage the last 10 years of work that you didn't have to do and it takes you have five seconds to connect to these other systems

Utsav:

That is really fascinating. I guess large companies, they have like a million things they want to integrate with like open source things and like all of these new databases, like materialize all of that. So Kafka is kind of like the rest API in a sense.

Alex:

I think it's become the new network to some extent. I mean, people joke about this. Think about this, if you had an appliance that could keep up with the throughput and latency of your network, but give you auditability. It gives you access control. It gives you a replay ability. Why not? That I think some of our more cutting edge users are using Redpanda as the new network, and they needed the performance that Redpanda brought to the Kafka KPI ecosystem to enable that kind of use case which is where every message gets sent to Redpanda. It could keep up. It could saturate hardware, but now that they get this tracing and auditability. They could go back in time. So you're right. It's almost like you have the new rest API for micro services.

Utsav: Yeah. What is it about Kafka that makes it slow? Like from an outsider's perspective, to me, it seems like when a code base gets more and more features continued by like hundreds of people over like a long time span. There's just so many like ifs and elses and checks and this and that, that tend to like blow the API service and also slow things down. And then somebody tries to profile and improve things incrementally. But could you maybe walk me through, like, what have you learned by looking at the code base and why do you think it's low? Like, one thing you mentioned was, you just do cardinal bypass and you skip like all of the overhead there, but is there anything inherent about Kafka itself that makes it really slow?

Alex: Yeah. So it's slow comparatively speaking and we spent 400 hours benchmarking before it comes out because I have a lot of details about this particular investment. Let me step back and think. An expert could probably tune Kafka to get much better performance than most people. Most people don't have 400 hours to benchmark different settings of Kafka. Kafka is multimodal in performance, but I can dig it out a little bit. But assuming that you're an expert and assuming that you're going to spend the time to think of Kafka for your particular workload, which by the way, it changes depending on throughput. The performance characteristics of running sustain workloads and Kafka are actually varying. And so therefore you're threading model of areas, the number  of threads for your network and the number of threads for your disk and the number of threads for your background workloads and the amount of memory, I think is this, [Inaudible00:18:51] of tuning Kafka that is really the most daunting task for an engineer. Because it is impossible, I think in my opinion to ask an engineer who doesn't know any of the internals of Kafka unless they go and they read the code to understand, well, what is the relationship between my IO threads and my disk threads and my background workloads and how much memory should I reserve for this versus how much memory do you serve reserve for that. There's all of this, trade-offs that do matter as soon as you start to hit some form of saturation. 

So let me give the details on the parts that where we improve performance which is specifically in the tail latency and why that matters to the messaging is great and the throughput. So by and large, Kafka can't drive hardware to the similar throughput as Redpanda. With Redpanda there's always at least as much as Kafka and in some cases which we highlight is in the block [Inaudible 00:19:51]. In some cases we're a little better. Let's say like 30% or 40% better. The actual improvement in performance is in the tailgate and the distribution. 

[20:00] Why does that matter? I'm just going to focus on what I think that Redpanda brings to the market rather than the negatives of Kafka because I think we are built on the shoulders of Kafka. If Kafka didn't exist we wouldn't have gotten to learn the improvements or understand the nuances of oh, maybe I should do this a little different. So on the 3 latency performance improvement, latency, and I've said this a few times, is the sum of all your bad decisions. That's just what happened at the user level. When you send the request to the micro-service that you wrote, you're just like, oh right should I have used a different data structure? There's no cache locality, etc. And so what we focused on is how do we give people predictable tail-latency. And it turns out that for all of our users, that predictable tail-latency often results in like 5X hardware reduction. So let me materialize. All of this performance improvement, where we are better and how that materializes for users. We paid a lot of attention to detail. That means we spent a ton of engineering time and effort and money and benchmarking and test suite on making sure that once you get to a particular latency, it doesn't spike around. It's stable because you need that kind of predictability. Let me give you a mental example or mental model, which you could potentially achieve really good average latency and terrible tail-latencies.

Let’s say that you write, and you have a terabyte of heap and you just write to memory and every 10 minutes you flash a terabyte. So every 10 minutes you get one request that is like five minutes long because you have to flash the terabyte with the disk and then otherwise the system looks good. So what happens is that people need to understand that you start to hit those tail-latency spikes that Kafka has the more messages you put in the system. Being that you are a messaging system monitor of your users are therefore going to experience the tail-latency. So we said, how can we improve this for users? And so in March we said, let's rethink this from scratch and that really had a fundamental impact. Now we don't use a lot of the Linux Kernel facilities. So there are global locks that happen in the Linux Kernel when you touch global object. For example, the page cache. And I actually think is the right decision for the page cache to be global because if you look at the code, there's a ton of edge cases and things that we have to optimize for it to make sure that it even just work. Then a lot more to make sure that it worked fast. So it's a lot of engineering effort that we didn't know it was going to pay off, to be honest and then he happened to pay off. So we just believe that we could do better with modern art. And so we don't have this global locks kind of at the low level on the Linux Kernel objects and because we don't use the global resources, we've partitioned the memory across every individual core. So memory allocations are local. You don't have this global massive garbage collection that has to traverse terabytes heaps. You have this like localized little memory arenas. 

It's kind of like taking a 96 core computer and creating a mental model of 96 little computers inside that 96 core computer and then it's structuring the Kafka API on top of that. Because again, remember that the new bottle making computer and the CPU is rethinking the architecture to maximize and really extract the value out of hardware.  My philosophy is the hardware is so capable, the software should be able to drive hardware at saturation at all points. If you're not driving hardware saturation at throughput, then you should be driving hardware basically at the lowest latency that you can. And these things need to be predictable because when you build an application, you don't say, oh, let me think about what is my tail latency for this and that and most of the time I need five computers, but there's other 10% of the time we need 150 computers limit. Let's take an average of 70 or 75 computers. So it's really hard to think about building applications when your underlying infrastructure is not predictable and so that's really a big improvement. And then the last improvement on the Kafka API was that we only expose safe settings. We use rapid as a replication model and I think that was a big improvement on the state of the art of streaming. If you look at the actual implementation of Kafka, ISR replication model, Pulsar, I think it's the primary backup with some optimization replication models versus our rapid implementation. You know, that we didn't invent our own protocol. 

[25:00] So there's a mathematical proof of replication. But also you understand as a programmer, oh, this is what I'm used to have two or three replicas. So this is what I meant to have three or five replicas of the memory. So it's kind of all of this context. So that was a long-winded question, but you ask such a critical thing that I had to be very specific just to make sure I don't give room for ambiguity or try to.

Utsav: Yeah. Can you explain why is it important the partition memory per core? Like what happens when you don't? Like, one thing you mentioned was, does the garbage collection that has to go through every day. What exactly is wrong about that? Can you elaborate on that?

Alex: Yeah. So there's nothing wrong and everything works. It's the traders that we want to optimize for is reduced. Basically make it cost efficient for people to actually use data stream. To me, I feel that streaming is in that weird space [Inaudible00:25:57] a few years ago where there's all this money being put into it, but very few people actually get value out of it. Why is this thing so expensive to run and how do we bring this to the masses, so that is not so massively expensive? Basically anyone that has run other streaming system that is in the [Inaudible00:26:17], they always have to over-provision because they just don't understand the performance characteristics to elicit them. 

So let me talk about the memory of partitioning. So for modern computers, the new trend is going to increase in core count the frequency, the clock frequency, the CPU is not going to improve. Here's the tricky part where it gets very detailed. Even on one CPU or CPUs individually, it still got faster even if the clock frequency didn't improve. You're just like, how is this possible? It improved through the very low level things like instructions, prefetching, basically proud execution, like pipeline execution on there's all of these strikes at the lowest level of instruction execution. Even if the clock frequency of the CPU, wasn't getting faster, it made something like, 2X performance improvement or maybe 3X over the last 10 years. But now the actual larger training computing is getting more core counts. My desktop has 64 physical cords. It's like the Verizon 3900.

In the data center, there's also this weird trend, which actually don't think the industry has settled on where even on a single motherboard, you have two sockets. So now when you have two sockets, you have this thing called NUMA memory axes and NUMA domain, which means every socket has a local memory that it makes "low latency access and allocations, whether it worked like one computer," but it can leverage remote memory from the other sockets memory. And so when you rent a cloud computer, you would want to understand what kind of hardware is it. To some extent you're paying for that virtualization and most people are running in the cloud these days. So why is needing the memory to that particular thread important? It matters because like I mentioned, latency is that sum of all your bad decisions. And so what we did is we said, okay, let's take all of the memory for this particular machine and I want to give you an opinionated view on it, which is if you're running this for really larger scale, I'm going to say the optimal production setting is two gigabytes per call. That's what we recommend for Redpanda. You can run it on like 130 megabytes if you want to for very low volume use cases but if you're really aiming to go ham on that hardware, those are kind of the memory recommendations. So why is that important? When Redpanda starts up, it's that I'm going to start one P thread for every core that gives me the programmer at concurrency and parallelism model.

So within each core, when I'm writing code in C++, I code to it like it is a concurrent destruction, but the parallelism is a free variable that gets executed on the physical hardware. The memory comes in in that we split the memory evenly across every cores. So let's say you have a computer with 10 cores. We take all the memory, we sum it up, we subtract like 10% and then we split it by 10 and then we do something even much more interesting. We go and we ask the hardware, hey hardware, tell me for this core, what is the memory bank that belongs to the Relic in this NUMA domain. In this like memory and in this CPU socket. What is the memory that belongs to this CPU socket? And then the hardware is going to tell you- based on the motherboard configuration, this is the memory that belonged to this particular core. And then we tell the Linux Kernel, hey, allocated this memory. 

[30:00] And pin it on this particular thing and lock it. So don't give it to anybody else. And then this thread reallocates that as a single byte array and so now what you've done is you've eliminated all forms of implicit cross core communication. Because that thread will only allocate memory on that particular core, unless the programmer explicitly programs the computer. You've got to allocate memory on the remote core. And so it's kind of relatively onerous system to you get your hands on, but if you're programming an actor model. So what does that mean for a user? Let me give you a real impact. We were running with the big fortune 1000 company, and they took a 35 node Kafka cluster and we brought it down to seven. 

All of these little improvements matter because at the end of the day, if you get a 5.5 X performance improvement, hardware cost reduction at 1600% performance improvement, all of the things. There's a blog post, we wrote a month ago where we talk about one or the other mechanical sort of sympathy techniques that we do to ensure that we give low latency to a Kafka API. And so that was a long-winded way of explaining at the lowest level of why it matters to allocate memory. It all boils down to the things that we were optimizing for, which is saturated hardware, so streaming is affordable for a lot of people, making it low latency so that you enable new use cases like this oil and gas pipeline, for example. And yeah so that’s kind of one of the really deep [Inaudible 00:31:40]. I'm happy to compare with our pool algorithms and how that's different, but that's how we think about building software.

Utsav: Now I wanted to know, what is the latency difference when you use memory from your own NUMA node, versus when you try to access like the remote memory? Like how much faster is it to just stay in your own like core, I guess.

Alex: It's faster relative to the other. I think the right question to ask is what is the latency of crossing that NUMA boundary in relation to the other things that you have to do in the computer? If you have one thing to do, which is you just need to allocate more memory on that core, it'll be plenty fast. But if you're trying to saturate hardware, when you're trying to do this on Kafka, I think then let me give you orders may be made to comparison. Its a few microseconds to cross the boundary that's separate an allocate memory from another core. Just some experiments I did last year to cross the NUMA boundary and allocate memory.

But let me put that in perspective with writing a single page to disk using the NBME with [Inaudible00:32:55] bypass. You could write a page to a NBME device assuming non 3D Cross point technology, just regular NBME on your laptop in single to double digit microseconds. So when you say now a memory allocation is in the microsecond space, you're just like, well, that's really expensive in comparison with actually doing useful work, like send them this link to this. So I think it's really hard for humans to understand latency and unless we work in the low latency phase or have like an intuition for what the computer can actually do, or the type of work that you're trying to do in that particular case. It's really hard to judge, but that adds up now. If you have contention, then it's in like the human time depending on how contended the resources are. Let's say that trying to allocate from a particular memory bank in a remote NUMA node. If there is no memory, then you have to wait for a page fault and stuff to get written to the disk. These things just add up. And it's really hard to give intuition but I think the better intuition is like, let's compare with the other useful things that you need to be doing. They useful thing of a computer is to be used for doing some useful thing for the business, like detecting fraud protection or on our planning an Uber ride around to your house or doing all of these things. I think really expensive in comparison with the things that you can actually be using the computer for.

Utsav: That makes sense to me. And is there any other fancy stuff you do, like in terms of like networking, because recently I've heard that even NIKSUN and everything are getting extremely fast and there's like a lot of overhead in software. Does the Kernel not get things exposed?

Alex: I think this is kind of such an exciting time to be in computing. Let me tell you a couple of things that we do. 

[35:00] So we actually expose some of the latency that the disk is propagating. So let's say you're writing to disk and over time to start to learn. The device is getting busier. So the latency is where one to eight microseconds to write a page and now they're in like thirty to a hundred microseconds to the write a page, because there's like a little bit of contention. There's a little bit of queuing and you start to learn that. At some point, there are some thresholds that we get to monitor because we don't use the Linux Kernel page cache. So we get to monitor this latency and propagate those latencies to the application level, to the raft replication model. Which is very cool when you co-design a replication model with the mechanical execution model, because it means that you're sort of passing implementation details on purpose through the application level. So you could do application level optimizations. 

One of those optimizations is reducing the number of pluses. So raft in order to participate on this, you write the page, and then you flash the data on disk. But we could do it with this adaptive batching so that we write a bunch and then we issue a single flash with like five flashes. That's one thing. The second thing is what this latency gives you is a new computer model. We added WebAssembly. We actually took the V8 engine and that's two modes of execution for that V8 engine currently. One is as a sidecar process and one is in line in the process. So every core gets a V8 isolate, which is like the no JS engine. Inside of V8 Isolated there is a thing called V8 context. And just go into the terminology for a second because there's a lot of terms.

It means that you connect to the JavaScript file. Inside is a V8 Concept. In fact, a context can execute multiple JavaScript files. So why does this matter? Given that Redpanda becomes programmable storage for the user. Think of like the Transformers, when Optimus Prime unite and then all the robots make a bigger robot kind of thing. It's like the fact that you can ship code to the storage engine and change the characteristics of the storage engine. So now you are streaming data, but because we're so far that were like, oh, now we have a latency budget, so we can do new things with the latency budget. We can introduce a computer model that allows the storage engine to do inline transformations of this data. So let's say you sent the JSON object and you want to remove the social security number for DDPR compliance or HIPAA compliance or whatever it is. Then you could just hit the JavaScript function and it will do an inline transformation of that, or it will just obscure for performance. So you don't reallocate just write xxx and then pass it along, but now you get to program what gets at the storage level. You're not ping-ponging your data between Redpanda and other systems. You're executing to some extent. You're really just sort of raising the level of abstraction of the storage system and the streaming system to do things that you couldn't do before, like inline execution or filtering inline execution of max gain, simple enrichments. Just simple things that are actually really challenging to do outside of this model.

And so you insert the computation model where now you can ship code to the data and it executes in line. And so some of the more interesting things is actually exposing that the WebAssembly engine, which is just V8 to our end users. So as an end user, we've now the Kafka API where you say, RPK this command line, things that we run, wasn't deploying to give it a JavaScript file. You'd tell it the input source and the output source and that's it. The engine is in charge of executing that little JavaScript function for every message that goes in. So I think this is like the kind of impact that being fast gives you. You're now have computational efficiencies that allow you to do different things that you couldn't do before.

Utsav: That's Interesting. I think one thing you mentioned where there was like HIPAA compliance or something to get rid of information, like what are some use cases that you can talk about publicly that you've seen that you just would not expect? And you were like, wow, like I can't believe, that is what this thing is being used for.

Alex: Yeah. let me think. Well, you know, one of them is IP credit score. So why that's interesting is not as a single stuff, but as an aggregate of steps, it's really critical. So, let me just try to frame it in a way that doesn't [Inaudible00:39:50] the customers, but we have a massive customer internet scale customer. 

[40:00] They are trying to give every one of their users. So really like profiling information that is anonymous, which is kind of wild. So every IP gets a credit score information. So let's say in America, you have credit scores from like 250 to 800 or maybe 820. And so you give it the same number or like a credit score to every IP and you monitor that over time, but now they can push that credit to score in [Inaudible00:40:27] inside that IP. And then you can potentially make a trade on that. And so there's all of this weird market dynamics. Let me give you this example. Let's say you watch the Twitter feed and you're just like, oh, what is the metadata that I can associate with this particular period and can I make a trade on that? So it's like a really wild thing to do. And then the last one that we're focusing on is this user who is actually extending the WebAssembly protocol, because it's all in GitHub. So you could literally download Redpanda and then stop our WebAssembly engine or your own web assembly engine. Here's actually spinning out very tall computers that are tall both in terms of CPU, in terms of memory and in terms of GPU. He has a [Inaudible00:41:19] job running on the GPU and then every time a message comes in it would make the local a call to this sidecar process that is running this machine learning thing to say, like, you know, should I proceed or should I not proceed with this particular decision? Those are the things that we just didn't plan for. I was like, wow when you sort of expand the possibilities and give people a new computing platforms, everyone will use it. It’s actually not a competing platform. It enriches how people think about their data infrastructure because a spike and Apache Flink, they all continue to work. Like the Kafka API continues to work, were just simply expanding that with this WebAssembly engine. Yeah.

Utsav: I think it's fascinating. Let's say that you want to build a competitor to Google today. Like just the amount of machines that they have for running search is very high. And not that you'd be easily be able to build like a competitor, but at least using something like this will make so much of your infrastructure costs cheaper, that it's possible to do more things. That's like the way I'm thinking about it.

Alex: Yeah. We're actually are in talks with multiple database companies that we can't name but what's interesting is that there's multiple. We are actually their historical data, both the real-time engines and their historical data. So Kafka API and of course Redpanda give the programmer a way to address by integer every single message that has been written to the log. It gives you a total addressability of the log, which is really powerful. Why? So if you're a database, imagine that each one of these address is just like a page on a file system, like an immutable page. So like this database vendors, they're streaming data through Redpanda, and then, it gets written to the Kafka, a batch, and then we push that data to S3. We can transparently, hydrate those pages and give it to them. So they've actually built an index on top of their campaign index that allows them to reason of a Kafka batch, as you know, it's real, the guiding way to fetch pages. It sounded like this page faulty mechanism and can also ingest real-time traffic. And so that's another like really sort of fascinating thing that we didn't think of until we started building this.

Utsav: Yeah. So then let me ask you, how do you get customers ready? You build this thing, it's amazing. And I'm assuming your customer is like another principal engineer somewhere who's like frustrated at how slow Kafka is or like the operational costs. But my concern would be that how do we know that Redpanda is not going to lose data or something and right now it's much easier because you have many more customers. But how do you bootstrap something like this and how do you get people to trust your software and say, yes, I will replace Kafka with this?

Alex: Yeah. So that's a really hard answer because building trust is really hard as a company. I think that's one of the hardest thing that we had to do in the beginning. So our earlier users were people that knew me as an individual engineer. Like, there are friends of mine. I was like, I'm building this would you be willing to give it a shot or try it. So, that only lasts so long. So what we really had to do is actually test with Jepsen, which is basically a storage system hammer

[45:00] that just verifies that you're not bullshitting your customers. Like if you say you have a rapid limitation, then you better have a rapid limitation according to the spec. Pilo is a fantastic engineer too. And so what we did is that we were fortunate enough to hire Dennis who has been focused on correctness for a really long time. And so he came in and actually built an extended and internal Jepsen test suite and we just test it for like eight months. 

So it seems like a lot of progress, but you have to understand that we stayed quiet for two and a half years. We just didn't tell anyone and in the meantime, we're just testing the framework. The first rapid implementation, just to put it in perspective, took two and a half years to build something that actually works. That is scalable and this isn't like that overnight. It's like, well, it took a two and a half year to build a rapid implementation and now we get to build a future of streaming. And so the way we verify is really through running our extended Jepsen suite. We're going to be releasing hopefully sometime later this year, an actual formal evaluation with external consultants. People trust us because their code is on GitHub. And so you're just like, well, this is not just a vendor that is saying they have all the cool stuff and underneath it seems just a proxy for Kafka and the Java. And I was like, no, you could go look at it. A bunch of people have bought like 300,000 lines of C+++ code or maybe 200 something. It’s on GitHub and you can see the things that we do.

I invite all the listeners to go and check it out and try it out, because you could just verify this claims for yourself. You can't hide away from this and it's in the open. So we use the thing [Inaudible 46:51] so everyone can download it. We only have one restriction that says we are the only one that is allowed to have a hosted version of Redpanda. Other than that you can run it. In fact, in four years it becomes Apache 2. So it's really only for the next four years. So it's really, I think, a good tradeoff for us. But you get trust multiple ways. One is, people will know you and they're willing to take a bet on you but that only lasts with like your first customer or two. And then, the other ones is that you build a [Inaudible 47:24] like empirically. So that's an important thing. You prove the stuff that you're claiming.

It is on us to prove to the world that we're safe. The third one is we didn't invent a replication protocol. So ISR is a thing that Kafka invented. We just went with Raft. We say, we don't want to invent a new mechanism for replicating data. We want to simply write a mechanical execution or Raft that was super good. And so, it's relying on existing research like Raft and focusing on the things that we were good at, which was engineering and making things really fast. And then, over time there was social proof, which is you get a couple of customers and they refer, and then you start to push petabytes of traffic. I think a hundred and something terabytes of traffic per day with one of our customers. And we thought, some point the system is always intact. If you have enough users, your system is always intact. And I think we just stepped into that criteria where we just have enough users that every part of the system has always been intact but still, we have to keep testing and make sure that every command runs through safety and we adhere to the claims.

Utsav: Yeah. Maybe you can expand a little bit about like the release process. Like how do you know you're going to ship a new version that's like safe for customers?

Alex: Yeah. That is a lot to cover. So we have five different types of fault injection frameworks. Five frameworks, not five kind of test suites. Five totally independent frameworks. One of them, we call it Punisher and it's just a random poll exploration and that's sort of the first level. Redpanda is always running on this one test cluster and every 130 seconds the fatal flaw that is introduce into the system. Like literally, there's like an estimated time in that logged in, not manually, but programmatically and removes your data director and it has to recover from that. Or [Inaudible 49:32] into the system and that's K-9 or it sends like the incorrect sequence of commands to create a topic in Kafka at the first level and that's always running. And so that tells us that our system just doesn't like [Inaudible49:50] for any reason. The second thing is we run a fuse file system and so what it does is then instead of writing to disk, it writes to this virtual disk interface, and then we get to inject the deterministic. 

[50:08] The thing about Fault Injection is when you combine three or five, like edge criteria is when things go south. I think through unit testing, you can get pretty good coverage of the surface area, but it's when you combine like, oh, let me get a slow disc and a faulty machine and a bad leader. And so we get to inject like surgically for this topic partition, we're going to inject a slow write and then we're going to accelerate the reads. And so then you have to verify that there's a correct transfer of leadership.

There's another one, which is a programmatic fault injection schedule, which we can terminate, we can delay, and we can do all these things. There's ducky, which is actually how Apache Kafka gets tested. There’s this thing in Apache Kafka called ducktape and it injects particular balls at the Kafka KPI level. So it's not just enough that we do that internally for the safety of the system but at the user interface the user are actually getting the things that we say we are. And so we leveraged now with the Kafka tests, because we're Kafka API compatible is to just work to inject particular failures. So we start off with three machines. We take Liberty Kafka and, then write like a gigabyte of data and we crashed the machines. We bring them back up and then we read a gigabyte of data. We start to complicate experiments. And so that's the fourth on and then I'm pretty sure I mentioned another one but I know we have five. And so the release process is actually every single committed to Dev gets run for a really long time and ZrZi is parallel. So I think every merchant to Dev takes like five hours or something like that but we paralyze it. In human time, it takes one hour, we just run like five different petals of the time. So I mean, that's how we do it. That’s really sort of the release process. It takes a long time and of course, if something breaks, then we hold the release. And in addition to that, there's like manual tests too because there things that were starting to codify into chaos.

Utsav: I wonder if the Kafka can use the frameworks that you've built for it. And maybe that will be an interesting day when Kafka start using some of that.

Alex : Some of the things are so companies to the big. To launch tests, we have an internal tool called Veto like vectorized tools and the name of the company. So we say veto, give me a cluster and it just pick the binary of from CI, deploys it into a cluster and start particular to the test. And it's specific to our setup. Otherwise a lot of these tests, I think like three out of five are in the open. The things that are actually general people can just go on and look at so it could help. But the other two, the ones that are more involved are just internal.

Utsav: Okay. What is something surprising you've learned about the state of Kafka? You had an opinion when you started Vectorized - that streaming is the future and it's going to be used everywhere. And I'm sure you've learned a lot through talking to customers and actually deploying like Redpanda down to the wild. So like, what is something surprising that you've learned?

Alex: I feel like I'm surprised almost every day. People are so interesting with the stuff that they're doing. It's very cool. Multiple surprises. There's business surprises, there's customer surprises. So from a business, I'm a principal engineer by trade. I was a CTO before, this is the first time I'm being a CEO. It was really, I think, a lot of self-discovery to feel can I do this and how the devil I could do it. So that's one. That was a lot of sort of self-discovery because it started from a place of really wanting to scratch a personal itch. I wrote this storage engine. If you look at the comments, I think to date I'm still at the largest commuter in the repo and I'm the CEO and it's obviously through history. It's because I wrote this storage engine with kind of a page cache bypass and the first allocator and the first compaction strategy because I wanted this to be from a technical level. Then it turns out that when I started interviewing people and we interviewed tens of companies, maybe less than a hundred, but definitely more than 50 somewhere in between, people were having the same problems that I was having personally. 

[55:02] This varies just from small companies to medium companies, to large companies. Everyone knows we're just struggling with operationalizing Kafka.  It takes a lot of expertise or money or both money and talent, which costs money to just get the thing stable. And I was like, we could do better. And so the fact that that was possible, even though Kafka has been around for 11 years, I was shocked. I was like wow. There's like this huge opportunity. I'll make it simple. And you know, what the interesting part about that is that the JavaScript and Python ecosystem, they love us because they're used to running on JS and engine X and this little processes, I mean like later in terms of footprint, like sophisticated code and they're like, oh, Redpanda does this really simple thing that I can run myself and it's easy. They feel we sort of empower some of these developers that would never come over to the JVM and Kafka ecosystem just because it's so complicated and so just to basically productionize it easy to get started, let me be clear, it’s hard to run into stable in production. And so that was a really surprising thing. 

And then from a customer level the thing we used for the oil and gas was really, I think, revealing. The embedding use case and the edge IOT use case was I was blown because I've been using Kafka as a centralized sort of methods hub where everything goes. I never thought of it as it being able to power IOT, like things. Or like this intrusion detection system, where they wanted this consistent API between their cluster and their local processes and, Redpanda was like the perfect fit for that. I think in the modern world, there's going to be a lot more users of that and I think that sort of people pushing the boundaries. I think there's been a lot of surprising things, but those are good highlights.

Utsav: Have you been just surprised at the number of people who use Kafka more than anything else or were you like kind of expecting it?

Alex: I've been blown. I think there's two parts to that. One streaming is an AdSense market. I think Concord just filed their SCC recently. I think saw that today. I think they're going to have a massively successful life beyond. I wish them luck and success, because the bigger they are, the bigger we are too. We add to the ecosystem. I actually see us expanding the Kafka API users to other things that couldn't be done before. I think it's big in terms of its total size, but it also think is big in that it's growing. And so the number of new users that are coming to us about like, oh, I'm thinking about this business idea, how would you do it in real time? And here's, what's interesting about that.

You and I put pressure on products that scientifically translate to things like Redpanda. Say I want to order food. My wife is pregnant and she wants Korean food tomorrow, but she doesn't eat meat, but when she's pregnant, she wants to eat meat. And so I want to order that and it's like, I want to be notified every five minutes. What's going on with the food? Is it going to get here? Blah, blah, blah. And so end users end up putting pressure on companies. A software doesn't run on category theory, it runs on this real hardware and ultimately it ends up in something looking like red Panda. And so I think that's interesting.

There's a ton of new users where they're getting asked by end users, like you and I, to effectively transform the enterprise into this real time business and start to extract value out of the now, what’s happening literally right now. I think to them, once they learn that they can do this, that Redpanda empowers them to do this, they never want to go back to batch. Streaming is also strict superset of batch; without getting theoretical here. Once you can start to extract value out of what's happening in your business in real time, nobody's ever want to go back. So I think it’s those two things. One, the market is large and two, it is growing really fast and so that was surprising to me.

Utsav: Cool. This is a final question. You don't have to answer this. This is just a standard check. If Confluence CEO came to you tomorrow and said, we just want to buy Redpanda, what would you think of that?

Alex: I don't know. I think from a company perspective, I want to take the company, myself, to be a public company and I think there's plenty of room. Like the pie is actually giant. And I just think there hasn't been a lot of competition in this space. It's my view, at least. 

[1:00:00] Yeah and so I think we're making it better. Let me give you an example. Apache web server dominated the HTTP standards for a long time. It almost didn't matter what the standard said is like Apache web server was like the thing. If anything, they implemented that was the standard implied. Then NGINX came about, and people were like, wait, hold on a second, there's this new thing. And it sort of actually kicked off this new, I think, interest in trying to standardize and develop the protocol farther.

I think it's similar. I think it's only natural to happen to the Kafka API. The Pulsar is also trying to bring in the Kafka API. We say, this is our first class citizen and so, I think that there's room for multiple players who offer different things. We offer extremely low latency safety. So we get to take advantage of the hardware. And so I think people are really attracted to our offering from a technical perspective, especially for new use cases. And yeah, and so I don't know. I think there's a chance for multiple players in that.

Utsav: Yeah. It's exciting. I think like there's open telemetric. There'll be like an open streaming API or something that eventually there'll be like a working group that will be folded into CNCF and all those things.

Alex: Exactly. 

Utsav: Well, thanks Alex again for being a guest. I think this was a lot of fun and it is fascinating to me. I did not realize like how many people need streaming and are using streaming. I guess the IOT use case just like slipped my mind, but it makes so much sense. This was super informative and thank you for being a guest. 

Alex:

Thanks for having me. It was a pleasure.

0 Comments
Software at Scale
Software at Scale
Software at Scale is where we discuss the technical stories behind large software applications.