Software at Scale 37 - Building Zerodha with Kailash Nadh

  
0:00
-48:59

Kailash Nadh is the CTO of Zerodha, India’s largest retail stockbroker. Zerodha powers a large volume of stock trades - ~15-20% of India’s daily volume which is significantly more daily transactions than Robinhood.

Apple Podcasts | Spotify | Google Podcasts

The focus of this episode is the technology and mindset behind Zerodha - the key technology choices, challenges faced, and lessons learned while building the platform over several years. As described on the company’s tech blog, Zerodha has an unconventional approach to building software - open source centric, relatively few deadlines, an incessant focus on resolving technical debt, and extreme autonomy to the small but efficient technology team. We dig into these and learn about the inner workings of one of India’s premier fintech companies.

Share Software at Scale

Highlights

[00:43]: Can you describe the Zerodha product? Could you also share any metrics that demonstrate the scale, like the number of transactions or number of users?

Zerodha is an online stockbroker. You can download one of the apps and sign to buy and sell shares in the stock market, and invest. We have over 7 million customers, on any given day we have over 2 million concurrent users, and this week, we broke our record for a number of trades handled in a day - 14 million trades in a day, which represented over 20% of all Indian stock-trading activity.

[03:00] When a user opens the app at 9:15 in the morning to see trade activity and purchase a trade, what happens behind the scenes? Life of a Query, Zerodha Edition

[05:00] What exactly is the risk management system doing? Can you give an example of where it will block a trade?

What is the risk management system doing?

The most critical check is a margin check - whether you have enough purchasing power margins in your account. With equities, it’s a simple linear check of whether you have enough, but for derivatives, it’s about figuring out if you have enough margins. If you already have some futures and options in your account, the risk is variable based on that pre-existing amount.

What does the reconciliation process look like with the exchange?

We have a joke in our engineering team that we’re just CSV engineers since re-conciliation in our industry happens via several CSV files that are distributed at the end of the trading day.

[08:40] Are you still using PostgreSQL for storing data?

We still use (abuse) PostgreSQL with hundreds of billions of rows of data, sharded several ways

[09:40] In general, how has Zerodha evolved over time, from the v0 of the tech product to today?

From 2010 to 2013, there was no tech team, and Zerodha’s prime value add was a discount pricing model. We had vendor products that let users log in and trade, and the competition was on pricing. But they worked on 1/10,000th the scale that we operate on today, for a tiny fraction of the userbase. To give a sense of their maturity, they only worked on Internet Explorer 6.

So in late 2014, we built a reporting platform that replaced this vendor-based system. We kept on replacing systems and dependencies, and the last piece left is the OMS - the Order Management System. We’ve had a project to replace this OMS ongoing for 2.5 years and are currently an running internal beta, and once this is complete, we will have no external dependencies.

The first version of Kite, written in Python, came out in 2015. Then, we rewrote some of the services in Go. We now have a ton of services that do all sorts of things like document verification, KYC, payments, banking, integrations, trading, PNL, number crunching and analytics, visualizations, mutual funds, absolutely everything you can imagine.

[13:55] Why is it so tricky to rebuild an Order Management System?

There’s no spec out there to build an Order or a Risk Management System. A margin check is based on mathematical models that take a lot of different parameters into account.

We’re doing complex checks that are based on mathematical models that we’ve reverse-engineered after years of experience with the system, as well as developing deep domain knowledge in the area.

And once we build out the system, we cannot simply migrate away from the old system due to the high consequences of potential errors. So we need to test and migrate piecemeal from the system. 

[13:55] One thing you notice when Zerodha is how fast it feels compared to standard web applications. This needs focus on both backend and frontend systems. To start with, how do you optimize your backends for speed?

When an application is slow (data takes more than a second to load), it’s perceptible, and can be annoying for users. So we're very particular about making everything as fast as possible, and we’ve set high benchmarks for ourselves. We set an upper limit of mean latency for users to be no more than 40 milliseconds, which seems to work well for us, given all the randomness from the internet. Then, all the code we write has to meet this benchmark.

In order to make this work, there’s no black magic, just common sense principles. For the core flow of the product, everything is retrieved from in-memory databases, and nothing touches disk in the hot path of a request.

Serialization is expensive. If you have a bunch of orders and you need to send those back, serializing and deserializing takes time. So when events take place, like a new order being placed, we serialize once and store the result in an in-memory database. And then when an HTTP request comes in from a user, instead of a database lookup and various transforms, the application reads directly from in-memory databases and writes it to the browser.

Then, we have a few heuristics. For fetching really old reports that <2% of users use, it’s okay for those to be slow. Those will happen in separate paths so that they don’t block the more frequent kind of requests.

Finally, we’ve written all these services with Golang, which is fast out of the box, provides a reasonably good developer experience, and has good concurrency primitives. We’re careful with memory allocations and pool resources wherever applicable.

[24:00] Zerodha also seems to have skipped the React world, by going with Flutter on mobile and Vue on the web. Can you speak to that decision for mobile apps?

We initially built the iOS app in React Native about 3-4 years ago. I’m not sure how things are today, but the application was fairly slow. The bread and butter of a trading application is a bunch of ticking numbers to show stock values, and we experienced a bunch of frame drops while trying to render those. You’d think that it’d be trivial to show that with low latency in 2017, but we were experiencing 5-10 frame/s, and Indian smartphones weren’t extremely powerful then. We also ran into several library/dependency breakages.

We then randomly ran across Flutter, which in pre-alpha, and not a lot had been written in it. Picking a bleeding-edge technology is very risky, and we wanted to evaluate the risk carefully, so we built out a full-blown prototype of the Kite app that had all the parts we thought would be bottlenecks, like web socket connections, updating numbers, list views, navigation, transitions. We learned Dart (which you need to use Flutter). Once we built out the prototype, it was clear that the performance and experience with Flutter was significantly better than React Native. So we made that very early decision to ditch React Native and adopt Flutter.

We figured that even if Flutter got killed, we’d benefit from using Flutter for a few years and would eventually move to something else, and traded-off the risk. We launched the iOS application, fixed up issues, killed our Android application, and rolled out a Flutter version.

[28:30] How about not going with React on the website, and going with Vue?

We built our v1 with Angular, but we found it tricky to use for our small team, and there was the major version break fiasco. It was overly complicated even after months of use. We decided to skip out on sunk costs and decided to evaluate something new. Picking Vue over React was primarily a judgment call, the template system reminded of us Django and it felt easier to work with compared to wrapping HTML in JSX and function calls.

[30:30] How do you verify that your systems are correct and consistent? Could you walk us through the time when commodity prices went negative, what happened with your systems?

When commodity prices went negative, we lost a bunch of money, just like many other brokers and institutions. Thankfully, the Indian exchanges shut down trading after a while, and I think the exchanges themselves weren’t equipped to handle negative commodity prices.

The nature of the stock market is that it’s extremely complex and unquantifiable, and the complexity comes from human psychology and nature which is hard to account for. A bit of news could come up that could shake up the market. There’s price volatility, but with India, also regulatory financial volatility. Regulations come and change how brokers work overnight. It’s all correct in spirit to improve things for Indian investors, but massive changes nonetheless. Some changes completely alter how broking works in India, all with a month’s notice. So change here is the only constant, and change management is complex, slow, and risky.

For the technical stuff, we do the standard unit tests, integration tests, but we do a ton of QA after. So after developers have validated changes, the application is handed to all the various domain experts across the company test and QA changes. Their job is to try and break the system. This is very important as there are a lot of behavior changes that are tricky to quantify. For example, a regulation might come in that requires stock splits in a certain way, and it’s hard to back-test since it’s never been implemented that way before.

Due to the inherent complexity and rate of change, it’s not feasible to implement a stock market’s model in a test. So the combination of automated testing and manual QA by domain experts is our first step, after which we release an internal beta, and we slowly ramp up. Thankfully, this has worked for us.

[35:30] Release Cycles and Technical Decision Making at Zerodha

In terms of our release cycles, we move slowly and with care. If we feel that technical debt is mounting, we will pause feature development and address the debt. We’ve rewritten core systems several times when the benefits were apparent.

One of the really unique things about Zerodha is that technical decisions are entirely driven by technical folk. There’s no business folk who come and say - don’t fix that system but add this new feature instead. There’s no pressure to ship features, we don’t commit to shipping features every quarter, and there are no absurd goals. We agree that critical bug fixes should be fixed in a timely way. We will implement features only if it makes sense. Sometimes, if a feature makes business sense, but the system is not ready for it, and there might be a hacky way to implement it. We never add hacks, instead, we clean up the debt, make the system amenable to the feature, and only then add it.

This sounds slow, but it pays off in the long run. Because we’ve never let technical debt mount, and never compromised on hacks, we’ve been able to build things faster. After every refactor, the next set of features end up being implemented extremely quickly. Even with continuous regulatory changes, we’re always able to keep up and implement whatever’s necessary.

Ironically, we’ve shipped a lot of things fast and well by slowing down.

[38:00] What’s different about Zerodha that it allows its tech team full autonomy?

It’s common sense, really. The technology team does not have a full understanding of the business side of things, so it lets the business decide what to build. Likewise, the business does not really have context on debt and demand technical changes.

If you never pause to clear technical debt, it grows exponentially, and the system becomes a burden. I think if people had empathy to understand that concerns like tech debt were legitimate, software companies would likely be much more productive than they are now.

The other side of this is deadlines. Technical people find it extremely difficult to come up with deadlines due to the complexity of the space. Even as domain experts, the estimated time for a small task might be weeks, and a seemingly tricky task could just take hours. The core is to let technical people make these decisions.

[42:00] As a technical person, how do I know that there’s too much technical debt? What frameworks can I use to understand that I should probably invest in the foundation, and how do I develop that intuition?

Intuition is the unquantifiable summation of past experiences. The more experienced you are, the more you can develop your intuition. But there are simple metrics you can use to make these decisions if you’re a competent developer.

When you find it hard to collaborate, hard to ship new things, there are consistent performance bottlenecks, these are simple, commonplace signs that something’s wrong. If you realize that if certain parts of your code were slightly more modular, we could have shipped these 4-5 features faster. It’s contextual, but you know it when you see it. These indicators, such as difficulties, annoyances, and bottlenecks, indicate debt and burden.

[45:00] As a final question, if I’m a software engineer looking for a job where they value technical quality, how would you suggest I evaluate from the outside? 

First and most importantly, make sure to cut through the hype, and join a company only because others think it’s a good idea to do so. Look past your biases and evaluate in a data-driven way. Most importantly, the kind of software produced by the company is the best indicator of the culture of the company. Also, try to find resources that serve as an indicator of culture and engineering practices.

The reality is that most software engineering is really, really boring work. And innovation generally comes in spurts. And once you innovate, you need to build it into a usable system, which involves a lot of boilerplate. So most software engineering is boring, and that realization only comes in with experience. Once you know that, you will be in a better position to make trade-offs, and you’ll look for companies with a better culture or other parameters that are important to you in your decision-making.