Software at Scale

Software at Scale 32 - Derrick Stolee: Principal Software Engineer, GitHub

0:00

-1:06:41

Software at Scale 32 - Derrick Stolee: Principal Software Engineer, GitHub

Migrating the Windows codebase to Git, Bloom Filters in Git, and more

Sep 15, 2021

Derrick Stolee is a Principal Software Engineer at GitHub, where he focuses on the client experience of large Git repositories.

Apple Podcasts | Spotify | Google Podcasts

Subscribers might be aware that I’ve done some work on client-side Git in the past, so I was pretty excited for this episode. We discuss the Microsoft Windows and Office repository’s migrations to Git, recent performance improvements to Git for large monorepo, and more.

Share Software at Scale

Highlights

lightly edited

[06:00] Utsav: How and why did you transition from academia to software engineering?

Derrick Stolee: I was teaching and doing research at a high level and working with really great people. And I found myself not finding the time to do the work I was doing as a graduate student. I wasn't finding time to do the programming and do these really deep projects. I found that the only time I could find to do that was in the evenings and weekends because that's when other people weren't working, who could collaborate with me on their projects and move those projects forward. And then, I had a child and suddenly my evenings and weekends aren't available for that anymore.

And so the individual things I was doing just for myself and for, you know, that was more programming oriented, fell by the wayside. I'd found myself a lot less happy with that career. And so I decided, you know what, there are two approaches I could take here. One is I could spend the next year or two winding down my collaborations and spinning up more of this time to be working on my own during regular work hours. Or I could find another job and I was going to set out.

And, I lucked out that Microsoft has an office here in Raleigh, North Carolina, where we now live. This is where Azure DevOps was being built and they needed someone to help solve some graph problems. So it was really nice that it happened to work out that way. I know for a fact that they took a chance on me because of their particular need. I didn't have significant professional experience in the industry.

[21:00] Utsav: What drove the decision to migrate Windows to Git?

The Windows repository moving to Git was a big project driven by Brian Harry, who was the CVP of Azure DevOps at the time. Previously, Windows used this source control system called Source Depot, which was a fork of Perforce. No one knew how to use this version control system until they got there and learned on the job, and that caused some friction in terms of onboarding people.

But also if you have people working in the windows code base for a long time, they only learn this version control system. They don't know Git and they don't know what everyone else is using. And so they're feeling like they're falling behind and they're not speaking the same language when they talk to somebody else who's working commonly used version control tools. So they saw this as a way to not only update the way their source control works to a more modern tool but specifically allow this more free exchange of ideas and understanding.

The Windows Git repository is going to be big and have some little tweaks here and there, but at the end of the day, you're just running Git commands and you can go look at StackOverflow to solve questions as opposed to needing to talk to specific people within the Windows organization and how to use this version control tool.

Transcript

Utsav Shah: Welcome to another episode of the Software at Scale Podcast, joining me today is Derek Stolee, who is a principal software engineer at GitHub. Previously, he was a principal software engineer at Microsoft, and he has a Ph.D. in Mathematics and Computer Science from the University of Nebraska, welcome.

Derek Stolee: Thanks, happy to be here.

Utsav Shah: So a lot of work that you do on Git, from my understanding, it's similar to the work you did in your Ph.D. around graph theory and stuff. So maybe you can just walk through the initial like, what got you interested in graphs and math in general?

Derek Stolee: My love of graph theory came from my first algorithms class in college my sophomore year, just doing simple things like path-finding algorithms. And I got so excited about it, I started clicking around Wikipedia constantly, I just read every single article I could find on graph theory. So I learned about the four-color theorem, and I learned about different things like cliques, and all sorts of different graphs, the Peterson graph, and I just kept on discovering more. I thought this is interesting to me, it works well with the way my brain works and I could just model these things while [unclear 01:32]. And as I kept on doing more, for instance, graph theory, and combinatorics, my junior year for my math major, and it's like I want to pursue this. Instead of going into the software, I had planned with my undergraduate degree, I decided to pursue a Ph.D. in first math, then I split over to the joint math and CS program, and just worked on very theoretical math problems but I also would always pair it with the fact that I had this programming background and algorithmic background.

So I was solving pure math problems using programming, and creating these computational experiments, the thing I call it was computational competent works. Because I would write these algorithms to help me solve these problems that were hard to reason about because the cases just became too complicated to hold in your head. But if you could quickly write a program, to then over the course of a day of computation, discover lots of small examples that can either answer it for you or even just give us a more intuitive understanding of the problem you're trying to solve and that was my specialty as I was working in academia.

Utsav Shah: You hear a lot about proofs that are just computer-assisted today and you could just walk us through, I'm guessing, listeners are not math experts. So why is that becoming a thing and just walk through your thesis read in super layman terms, what do you do?

Derek Stolee: There are two very different ways what you can mean when you say I have automated proof, there are some things like Coke, which are completely automated formal logic proofs, which specify all the different axioms and the different things I know to be true. And the statement I want to prove and constructs the sequence of proof steps, what I was focused more on was taking a combinatorial problem. For instance, do graphs with certain sub-structures exist, and trying to discover those examples using an algorithm that was finely tuned to solve those things, so one problem was called uniquely Kr saturated graphs. A Kr was essentially a set of our vertices where every single pair was adjacent to each other and to be saturated means I don't have one inside my graph but if I add any missing edge, I'll get one. And then the uniquely part was, I'll get exactly one and now we're at this fine line of doing these things even exist and can I find some interesting examples. And so you can just do, [unclear 04:03] generate every graph of a certain size, but that blows up in size.

And so you end up where you can get maybe to 12 vertices, every graph of up to 12 vertices or so you can just enumerate and test. But to get beyond that, and find the interesting examples, you have to be zooming in on the search space to focus on the examples you're looking for. And so I generate an algorithm that said, Well, I know I'm not going to have every edge, so it's fixed one, parents say, this isn't an edge. And then we find our minus two other vertices and put all the other edges in and that's the one unique completion of that missing edge. And then let's continue building in that way, by building up all the possible ways you can create those sub-structures because they need to exist as opposed to just generating random little bits and that focus the search space enough that we can get to 20 or 21 vertices and see this interesting shapes show up. From those examples, we found some infinite families and then used regular old-school math to prove that these families were infinite once we had those small examples to start from.

Utsav Shah: That makes a lot of sense and that tells me a little bit about how might someone use this in a computer science way? When would I need to use this in let's say, not my day job but just like, what computer science problems would I solve given something like that?

Derek Stolee: It's always asking a mathematician what the applications of the theoretical work are. But I find whenever you see yourself dealing with a finite problem, and you want to know what different ways can this data be up here? Is it possible with some constraints? So a lot of things I was running into were similar problems to things like integer programming, trying to find solutions to an integer program is a very general thing and having those types of tools in your back pocket to solve these problems is extremely beneficial. And also knowing integer programming is still NP-hard. So if you have the right data shape, it will take an exponential amount of time to work, even though there are a lot of tools to solve most cases, when your data looks aren't particularly structured to have that exponential blow up. So knowing where those data shapes can arise and how to take a different approach can be beneficial.

Utsav Shah: And you've had a fairly diverse career after this. I'm curious, what was the difference? What was the transition from doing this stuff to get or like developer tools? How did that end up happening?

Derek Stolee: I was lucky enough that after my Ph.D. was complete, I landed a tenure track job in a math and computer science department, where I was teaching and doing research at a high level and working with great people. I had the best possible accountant’s workgroup, I could ask for doing interesting stuff, working with graduate students. And I found myself not finding the time to do the work I was doing as a graduate student, I wasn't finding time to do the programming and do these deep projects I wanted, I had a lot of interesting math project projects, I was collaborating with a lot of people, I was doing a lot of teaching. But I was finding that the only time I could find to do that was in evenings and weekends because that's when other people weren't working, who could collaborate with me on their projects and move those projects forward. And then I had a child and suddenly, my evenings and weekends aren't available for that anymore. And so the individual things I was doing just for myself, and for that we're more programming oriented, fell by the wayside and found myself a lot less happy with that career. And so I decided, there are two approaches I could take here; one is I could spend the next year or two, winding down my collaborations and spinning up more of this time to be working on my own during regular work hours, or I could find another job.

And I was going to set out, but let's face it, my spouse is also an academic and she had an opportunity to move to a new institution and that happened to be soon after I made this decision. And so I said, great, let's not do the two-body problem anymore, you take this job, and we move right in between semesters, during the Christmas break, and I said, I will find my job, I will go and I will try to find a programming job, hopefully, someone will be interested. And I lucked out that, Microsoft has an office here in Raleigh, North Carolina, where we now live and they happen to be the place where what is now known as Azure DevOps was being built. And they needed someone to help solve some graph theory problems in the Git space. So it was nice that it happened to work out that way and I know for a fact that they took a chance on me because of their particular need. I didn't have significant professional experience in the industry, I just said, I did academics, so I'm smart and I did programming as part of my job, but it was always about myself. So, I came with a lot of humility, saying, I know I'm going to learn to work with a team. in a professional setting. I did teamwork with undergrad, but it's been a while.

So I just come in here trying to learn as much as I can, as quickly as I can, and contribute in this very specific area you want me to go into, and it turns out that area they needed was to revamp the way Azure Repos computed Git commit history, which is a graph theory problem. The thing that was interesting about that is the previous solution is that they did everything in the sequel they'd when you created a new commit, he would say, what is your parent, let me take its commit history out of the sequel, and then add this new commit, and then put that back into the sequel. And it took essentially a sequel table of commit IDs and squashes it into a varbinary max column of this table, which ended up growing quadratically. And also, if you had a merge commit, it would have to take both parents and interestingly merge them, in a way that never matched what Git log was saying. And so it was technically interesting that they were able to do this at all with a sequel before I came by.

But we need to have the graph data structure available, we need to dynamically compute by walking commits, and finding out how these things work, which led to creating a serialized commit-graph, which had that topological relationship encoded in concise data, into data. That was a data file that would be read into memory and very quickly, we could operate on it and do things topologically sorted. And we could do interesting File History operations on that instead of the database and by deleting these Database entries that are growing quadratically, we saved something like 83 gigabytes, just on the one server that was hosting the Azure DevOps code. And so it was great to see that come into fruition.

Utsav Shah: First of all, that's such an inspiring story that you could get into this, and then they give you a chance as well. Did you reach out to a manager? Did you apply online? I'm just curious how that ended up working?

Derek Stolee: I do need to say I have a lot of luck and privilege going into this because I applied and waited a month and didn't hear anything. I had applied to the same group and said, here's my cover letter, I heard nothing but then I have a friend who was from undergrad, who was one of the first people I knew to work at Microsoft. And I knew he worked at this little studio as the Visual Studio client editor and I said, well, this thing, that's now Azure DevOps was called Visual Studio online at the time, do you know anybody from this Visual Studio online group, I've applied there, haven't heard anything I'd love if you could get my resume on the top list. And it turns out that he had worked with somebody who had done the Git integration in Visual Studio, who happened to be located at this office, who then got my name on the top of the pile. And then that got me to the point where I was having a conversation with who would be my skip-level manager, and honestly had a conversation with me to try to suss out, am I going to be a good team player?

There's not a good history of PhDs working well with engineers, probably because they just want to do their academic work and work in their space. I remember one particular question is like, sometimes we ship software and before we do that, we all get together, and everyone spends an entire day trying to find bugs, and then we spend a couple of weeks trying to fix them, they call it a bug bash, is that something you're interested in doing? I'm 100% wanting to be a good citizen, good team member, I am up for that. I that's what it takes to be a good software engineer, I will do it. I could sense the hesitation and the trepidation about looking at me more closely but it was overall, once I got into the interview, they were still doing Blackboard interviews at that time and I felt unfair because my phone screen interview was a problem. I had assigned my C Programming students as homework, so it's like sure you want to ask me this, I have a little bit of experience doing problems like this. So I was eager to show up and prove myself, I know I made some very junior mistakes at the beginning, just what's it like to work on a team? What's it like to check in a change and commit that pull request at 5 pm? And then go and get in your car and go home and realize when you are out there that you had a problem? And you've caused the bill to go red? Oh, no, don't do that. So I had those mistakes, but I only needed to learn them once.

Utsav Shah: That's amazing and going to your second point around [inaudible 14:17], get committed history and storing all of that and sequel he also go, we had to deal with an extremely similar problem because we maintain a custom CI server and we try doing Git [inaudible 14:26] and try to implement that on our own and that did not turn out well. So maybe you can walk listeners through like, why is that so tricky? Why it is so tricky to say, is this commit before another commit is that after another commit, what's the parent of this commit? What's going on, I guess?

Derek Stolee: Yes the thing to keep in mind is that each commit has a list of a parent or multiple parents in the case of emerging, and that just tells you what happened immediately before this. But if you have to go back weeks or months, you're going to be traversing hundreds or 1000s of commits and these merge commits are branching. And so not only are we going deep in time in terms of you just think about the first parent history is all the merge all the pull requests that have merged in that time. But imagine that you're also traversing all of the commits that were in the topic branches of those merges and so you go both deep and wide when you're doing this search. And by default, Git is storing all of these commits as just plain text objects, in their object database, you look it up by its Commit SHA, and then you go find that location in a pack file, you decompress it, you go parse the text file to find out the different information about, what's its author-date, committer date, what are its parents, and then go find them again, and keep iterating through that. And it's a very expensive operation on these orders of commits and especially when it says the answer's no, it's not reachable, you have to walk every single possible commit that is reachable before you can say no.

And both of those things cause significant delays in trying to answer these questions, which was part of the reason for the commit-graph file. First again, it was started when I was doing Azure DevOps server work but it's now something it's a good client feature, first, it avoids that going through to the pack file, and loading this plain text document, you have to decompress and parse by just saying, I've got it well-structured information, that tells me where in the commit-graph files the next one. So I don't have to store the whole object ID, I just have a little four-byte integer, my parent is this one in this table of data, and you can jump quickly between them. And then the other benefit is, we can store extra data that are not native to the commit object itself, and specifically, this is called generation number. The generation number is saying, if I don't have any parents, my generation number is one, so I'm at level one.

But if I have parents, I'm going to have one larger number than the maximum most parents, so if I have one parent is; one, now two, and then three, if I merge, and I've got four and five, I'm going to be six. And what that allows me to do is that if I see two commits, and one is generation number 10, and one is 11, then the one with generation number 10, can't reach the one with 11 because that means an edge would go in the wrong direction. It also means that if I'm looking for the one with the 11, and I started at 20, I can stop when I hit commits that hit alright 10. So this gives us extra ways of visiting fewer commits to solve these questions.

Utsav Shah: So maybe a basic question, why does the system care about what the parents of a commit are why does that end up mattering so much?

Derek Stolee: Yes, it matters for a lot of reasons. One is if you just want to go through the history of what changes have happened to my repository, specifically File History, the way to get them in order is not you to say, give me all the commits that changed, and then we sort them by date because the commit date can be completely manufactured. And maybe something that was committed later emerged earlier, that's something else. And so by understanding those relationships of where the parents are, you can realize, this thing was committed earlier, it landed in the default branch later and I can see that by the way that the commits are structured to these parent relationships. And a lot of problems we see with people saying, where did my change go, or what happened here, it's because somebody did a weird merge. And you can only find it out by doing some interesting things with Git log to say, this merge caused a problem and cause your file history to get mixed up and if somebody resolved the merging correctly to cause this problem where somebody change got erased and you need to use these social relationships to discover that.

Utsav Shah: Should everybody just be using rebase versus merge, what's your opinion?

Derek Stolee: My opinion is that you should use rebase to make sure that the commits that you are trying to get reviewed by your coworkers are as clear as possible. Present a story, tell me that your commits are good, tell me in the comments just why you're trying to do this one small change, and how the sequence of commits creates a beautiful story that tells me how I get from point A to point B. And then you merge it into your branch with everyone else's, and then those commits are locked, you can't change them anymore. Do you not rebase them? Do you not edit them? Now they're locked in and the benefit of doing that as well, I can present this best story that not only is good for the people who are reviewing it at the moment, but also when I go back in history and say, why did I change it that way? You've got all the reasoning right there but then also you can do things like go down Do Git log dash the first parent to just show me which pull requests are merged against this branch. And that's it, I don't see people's commits. I see this one was merged, this one was merged, this one was merged and I can see the sequence of those events and that's the most valuable thing to see.

Utsav Shah: Interesting, and then a lot of GitHub workflows, just squash all of your commits into one, which I think is the default, or at least a lot of people use that; any opinions on that, because I know the Git workflow for development does the whole separate by commits, and then merge all of them, do you have an opinion, just on that?

Derek Stolee: Squash merges can be beneficial; the thing to keep in mind is that it's typically beneficial for people who don't know how to do interactive rebase. So their topic match looks like a lot of random commits that don't make a lot of sense. And they're just, I tried this and then I had a break. So I fixed a bug, and I kept on going forward, I'm responding to feedback and that's what it looks like. That's if those commits aren't going to be helpful to you in the future to diagnose what's going on and you'd rather just say, this pull request is the unit of change. The squash merge is fine, it's fine to do that, the thing I find out that is problematic as a new user is also then don't realize that they need to change their branch to be based on that squash merge before they continue working. Otherwise, they'll bring in those commits again, and their pull request will look very strange. So there are some unnatural bits to using squash merge, that require people to like, let me just start over from the main branch again, to do my next work. And if you don't remember to do that, it's confusing.

Utsav Shah: Yes, that makes a lot of sense. So going back to your story, so you started working on improving, get interactions in Azure DevOps? When did the whole idea of let's move the windows repository to get begin and how did that evolve?

Derek Stolee: Well, the biggest thing is that the windows repository moving to get was decided, before I came, it was a big project by Brian Harry, who was the CVP of Azure DevOps at the time. Windows was using this source control system called source depot, which was a literal fork of Perforce. And no one knew how to use it until they got there and learn on the job. And that caused some friction in terms of well, onboarding people is difficult. But also, if you have people working in the windows codebase, for a long time, they learn this version control system. They don't know what everyone else is using and so they're feeling like they're falling behind. And they're not speaking the same language as when they talk to somebody else who's working in the version control that most people are using these days. So they saw this as a way to not only update the way their source control works to a more modern tool but specifically Git because it allowed more free exchange of ideas and understanding, it's going to be a mono repo, it's going to be big, it's going to have some little tweaks here and there.

But at the end of the day, you're just running Git commands and you can go look at Stack Overflow, how to solve your Git questions, as opposed to needing to talk to specific people within the windows organization, and how to use this tool. So that, as far as I understand was a big part of the motivation, to get it working. When I joined the team, we were in the swing of let's make sure that our Git implementation scales, and the thing that's special about Azure DevOps is that it's using, it doesn't use the core Git codebase, it has a complete reimplementation of the server-side of Git in C sharp. So it was rebuilding a lot of things to just be able to do the core features, but is in its way that worked in its deployment environment and it had done a pretty good job of handling scale. But the issues that the Linux repo was still a challenge to host. At that time, it had half a million commits, maybe 700,000 commits, and it's the site number of files is rather small. But we were struggling especially with the commit history being so deep to do that, but also even when they [inaudible 24:24] DevOps repo with maybe 200 or 300 engineers working on it and in their daily work that was moving at a pace that was difficult to keep up with, so those scale targets were things we were daily dealing with and handling and working to improve and we could see that improvement in our daily lives as we were moving forward.

Utsav Shah: So how do you tackle the problem? You're on this team now and you know that we want to improve the scale of this because 2000 developers are going to be using this repository we have two or 300 people now and it's already not like perfect. My first impression is you sit and you start profiling code and you understand what's going wrong. What did you all do?

Derek Stolee: You're right about the profiler, we had a tool, I forget what it's called, but it would run on every 10th request selected at random, it would run a dot net profiler and it would save those traces into a place where we could download them. And so we can say, you know what Git commit history is slow. And now that we've written it in C sharp, as opposed to a sequel, it's the C sharp fault. Let's go see what's going on there and see if we can identify what are the hotspots, you go pull a few of those traces down and see what's identified. And a lot of it was chasing that like, I made this change. Let's make sure that the timings are an improvement, I see some outliers over here, they're still problematic, we find those traces and be able to go and identify that the core parts to change. Some of them are more philosophical, we need to change data structures, we need to introduce things like generation numbers, we need to introduce things like Bloom filters for filed history, nor to speed that up because we're spending too much time parsing commits and trees.

And once we get to the idea that once we're that far, it was time to essentially say, let's assess whether or not we can handle the windows repo. And I think would have been January, February 2017. My team was tasked with doing scale testing in production, they had the full Azure DevOps server ready to go that had the windows source code in it didn't have developers using it, but it was a copy of the windows source code but they were using that same server for work item tracking, they had already transitioned, that we're tracking to using Azure boards. And they said, go and see if you can make this fall over in production, that's the only way to tell if it's going to work or not. And so a few of us got together, we created a bunch of things to use the REST API and we were pretty confident that the Git operation is going to work because we had a caching layer in front of the server that was going to avoid that. And so we just went to the idea of like, let's have through the REST API and make a few changes, and create a pull request and merge it, go through that cycle.

We started by measuring how often developers would do that, for instance, in the Azure DevOps, and then scale it up and see where be going and we crashed the job agents because we found a bottleneck. Turns out that we were using lib Git to do merges and that required going into native code because it's a C library and we couldn't have too many of those running, because they each took a gig of memory. And so once this native code was running out, things were crashing and so we ended up having to put a limit on how that but it was like, that was the only Fallout and we could then say, we're ready to bring it on, start transitioning people over. And when users are in the product, and they think certain things are rough or difficult, we can address them. But right now, they're not going to cause a server problem. So let's bring it on. And so I think it was that a few months later that they started bringing developers from source depot into Git.

Utsav Shah: So it sounds like there was some server work to make sure that the server doesn't crash. But the majority of work that you had to focus on was Git inside. Does that sound accurate?

Derek Stolee: Before my time in parallel, is my time was the creation of what's now called VFS Forget, he was GVFs, at the time, realized that don't let engineers name things, they won't do it. So we've renamed it to VFS forget, it's a virtual file system Forget, a lot of [inaudible 28:44] because the source depot, version that Windows is using had a virtualized file system in it to allow people to only download a portion of the working tree that they needed. And they can build whatever part they were in, and it would dynamically discover what files you need to run that build. And so we did the same thing on the Git side, which was, let's make the Git client let's modify in some slight ways, using our fork of Git to think that all the files are there. And then when a file is [inaudible 29:26] we look through it to a file system event, it communicates to the dot net process that says, you want that file and you go download it from the Git server, put it on disk and tell you what its contents are and now you can place it and so it's dynamically downloading objects.

This required aversion approach protocol that we call the GVFs protocol, which is essentially an early version of what's now called get a partial clone, to say, you can go get the commits and trees, that's what you need to be able to do most of your work. But when you need the file contents into the blob of a file, we can download that as necessary and populate it on your disk. The different thing is that personalized thing, the idea that if you just run LS at the root directory, it looks like all the files are there. And that causes some problems if you're not used to it, like for instance, if you open the VS code in the root of your windows source code, it will populate everything. Because VS code starts crawling and trying to figure out I want to do searching and indexing. And I want to find out what's there but Windows users were used to this, the windows developers; they had this already as a problem.

So they were used to using tools that didn't do that but we found that out when we started saying, VFS forget is this thing that Windows is using, maybe you could use it to know like, well, this was working great, then I open VS code, or I ran grep, or some other tool came in and decided to scan everything. And now I'm slow again, because I have absolutely every file in my mana repo, in my working directory for real. And so that led to some concerns that weren’t necessarily the best way to go. But it did specifically with that GFS protocol, it solved a lot of the scale issues because we could stick another layer of servers that were closely located to the developers, like for instance, get a lab of build machines, let's take one of these cache servers in there. So the build machines all fetch from there and there you have quick throughput, small latency. And they don't have to bug the origin server for anything but the Refs, you do the same thing around the developers that solved a lot of our scale problems because you don't have these thundering herds of machines coming in and asking for all the data all at once.

Utsav Shah: If we had a super similar concept of repository mirrors that would be listening to some change stream every time anything changed on a region, it would run GitHub, and then all the servers. So it's remarkable how similar the problems that we're thinking about are. One thing that I was thinking about, so VFS Forget makes sense, what's the origin of the FS monitor story? So for listeners, FS monitor is the File System Monitor in Git that decides whether files have changed or not without running [inaudible 32:08] that lists every single file, how did that come about?

Derek Stolee: There are two sides to the story; one is that as we are building all these features, custom for VFS Forget, we're doing it inside the Microsoft slash Git fork on GitHub working in the open. So you can see all the changes we're making, it's all GPL. But we're making changes in ways that are going fast. And we're not contributing to upstream Git to the core Git feature. Because of the way VFS Forget works, we have this process that's always running, that is watching the file system and getting all of its events, it made sense to say, well, we can speed up certain Git operations, because we don't need to go looking for things. We don't want to run a bunch of L-stats, because that will trigger the download of objects. So we need to refer to that process to tell me what files have been updated, what's new, and I created the idea of what's now called FS monitor. And people who had built that tool for VFS Forget contributed a version of it upstream that used Facebook's watchman tool and threw a hook.

So it created this hook called the FS monitor hook, it would say, tell me what's been updated since the last time I checked, the watchmen or whatever tools on their side would say, here's the small list of files that have been modified. You don't have to go walking all of the hundreds of 1000s of files because you just change these [inaudible 0:33:34]. And the Git command could store that and be fast to do things like Git status, we could add. So that was something that was contributed just mostly out of the goodness of their heart, we want to have this idea, this worked well and VFS Forget, we think can be working well for other people in regular Git, here we go and contributing and getting it in. It became much more important to us in particular when we started supporting the office monitor repo because they had a similar situation where they were moving from their version of source depot into Git and they thought VFS Forget is just going to work.

The issue is that the office also has tools that they build for iOS and macOS. So they have developers who are on macOS and the team has just started by building a similar file system, virtualization for macOS using kernel extensions. And was very far along in the process when Apple said, we're deprecating kernel extensions, you can't do that anymore. If you're someone like Dropbox, go use this thing, if you use this other thing, and we tried both of those things, and none of them work in this scenario, they're either too slow, or they're not consistent enough. For instance, if you're in Dropbox, and you say, I want to populate my files dynamically as people ask for them. The way that Dropbox in OneNote or OneDrive now does that, the operating system we decided I'm going to delete this content because the disk is getting too big. You don't need it because you can just get it from the remote again, that inconsistency was something we couldn't handle because we needed to know that content once downloaded was there. And so we were at a crossroads of not knowing where to go. But then we decided, let's do an alternative approach, let's look at what the office monorepo is different from the windows monitor repo.

And it turns out that they had a very componentized build system, where if you wanted to build a word, you knew what you needed to build words, you didn't need the Excel code, you didn't need the PowerPoint code, you needed the word code and some common bits for all the clients of Microsoft Office. And this was ingrained in their project system, it’s like if you know that in advance, Could you just tell Git, these are the files I need to do my work in to do my build. And that’s what they were doing in their version of source depot, they weren't using a virtualized file system and their version of source depot, they were just enlisting in the projects I care about. So when some of them were moving to Git with VFS Forget, they were confused, why do I see so many directories? I don't need them. So what we did is we decided to make a new way of taking all the good bits from VFS forget, like the GVFs protocol that allowed us to do the reduced downloads. But instead of a virtualized file system to use sparse checkout is a good feature and that allows us you can say, tell Git, only give me within these directories, the files and ignore everything outside. And that gives us the same benefits of working as the smaller working directory, than the whole thing without needing to have this virtualized file system. But now we need that File System Monitor hook that we added earlier.

Because if I still have 200,000 files on my disk, and I edit a dozen, I don't want to walk with all 200,000 to find those dozen. And so the File System Monitor became top of mind for us and particularly because we want to support Windows developers and Windows process creation is expensive, especially compared to Linux; Linux, process creation is super-fast. So having hooky run, that then does some shell script stuff to come to communicate to another process and then come back. Just that process, even if it didn't, you don't have to do anything. That was expensive enough to say we should remove the hook from this equation. And also, there are some things that watchman does that we don't like and aren't specific enough to Git, let's make a version of the file system monitor that is entrenched to get. And that's what my colleague Jeff Hosteller, is working on right now. And getting reviewed in the core Git client right now is available on Git for Windows if you want to try it because the Git for Windows maintainer is also on my team. And so we only get an early version in there. But we want to make sure this is available to all Git users. There's an imputation for Windows and macOS and it's possible to build one for Linux, we just haven't included this first version. And that's our target is to remove that overhead. I know that you at Dropbox got had a blog post where you had a huge speed up just by replacing the Perl script hook with a rusted hook, is that correct?

Utsav Shah: With the go hook not go hog, yes, but eventually we replace it with the rust one.

Derek Stolee: Excellent. And also you did some contributions to help make this hook system a little bit better and not fewer bucks.

Utsav Shah: I think yes, one or two bugs and it took me a few months of digging and figuring out what exactly is going wrong and it turned out there's this one environment variable which you added to skip process creation. So we just had to make sure to get forest on track caches on getting you or somebody else edited. And we just forced that environment variable to be true to make sure we cache every time you run Git status. So subsequent with Git statuses are not slow and things worked out great. So we just ended up shipping a wrapper that turned out the environment variable and things worked amazingly well. So, that was so long ago. How long does this process creation take on Windows? I guess that's one question that I have had for you for while, why did we skip writing that cache? Do you know what was slow but creating processes on Windows?

Derek Stolee: Well, I know that there are a bunch of permission things that Windows does, it has many backhauls about can you create a process of this kind and what elevation privileges do you exactly have. And there are a lot of things like there that have built up because Windows is very much about re maintaining backward compatibility with a lot of these security sorts of things. So I don't know all the details I do know that it's something around the order of 100 milliseconds. So it's not something to scoff at and it's also the thing that Git for windows, in particular, has difficulty to because it has to do a bunch of translation layers to take this tool that was built for your Unix environment, and has dependencies on things like shell and Python, and Perl and how to make sure that it can work in that environment. That is an extra cost like if windows need to pay over even a normal windows process.

Utsav Shah: Yes, that makes a lot of sense and maybe some numbers on I don't know how much you can share, like how big was the windows the office manrico annual decided to move from source depot to get like, what are we talking about here?

Derek Stolee: The biggest numbers we think about are like, how many files do I have, but I didn't do anything I just checked out the default branch should have, and I said, how many files are there? And I believe the windows repository was somewhere around 3 million and that uncompressed data was something like 300 gigabytes of like that those 3 million files taking up that long. I don't know what the full size is for the office binary, but it is 2 million files at the head. So definitely a large project, they did their homework in terms of removing large binaries from the repository so that they're not big because of that, it's not like it's Git LSS isn't going to be the solution for them. They have mostly source code and small files that are not the reason for their growth. The reason for their growth is they have so many files, and they have so many developers moving, it moving that code around and adding commits and collaborating, that it's just going to get big no matter what you do. And at one point, the windows monorepo had 110 million Git objects and I think over 12 million of those were commits partly because they had some build machinery that would commit 40 times during its build. So they rein that in, and we've set to do a history cutting and start from scratch and now it's not moving nearly as quickly, but it's still very similar size so they've got more runways.

Utsav Shah: Yes, maybe just for comparison to listeners, like the numbers I remember in 2018, the biggest repositories that were open-source that had people contributing to get forward, chromium. And remember chromium being roughly like 300,000 files, and there were like a couple of chromium engineers contributing to good performance. So this is just one order of magnitude but bigger than that, like 3 million files, I don't think there's a lot of people moving such a large repository around especially with the kind of history with like, 12 million objects it's just a lot. What was the reaction I guess, of the open-source community, the maintainers of getting stuff when you decided to help out? Did you have a conversation to start with they were just super excited when you reached out on the mailing list? What happened?

Derek Stolee: So for full context, I switched over to working on the client-side and contributed upstream get kind of, after all of the DFS forget was announced and released as open-source software. And so, I can only gauge what I saw from people afterward and people I've become to know since then, but the general reaction was, yes, it's great that you can do this, but if you had contributed to get everyone would benefit and part of the things were, the initial plan wasn't ever to open source it or, the goal was to make this work for Windows if that's the only group that ever uses it that was a success. And it turns out, we can maybe try to say it, because we can host the windows source code, we can handle your source code was kind of like a marketing point for Azure Repos and that was a big push to put this out there and say in the world, but to say like, well, it also needs this custom thing that's only on Azure Repos and we created it with our own opinions that wouldn't be up to snuff with the Git project.

And so, things like FS monitor and partial clone are direct contributions from Microsoft engineers at the time that we're saying, here's a way to contribute the ideas that made VFS forget work to get and that was an ongoing effort to try to bring that back but it kind of started after the fact kind of, hey, we are going to contribute these ideas but at first, we needed to ship something. So we shipped something without working with the community but I think that over the last few years, is especially with the way that we've shifted our stance within our strategy to do sparse check out things with the Office monitor repo, we've much more been able to align with the things we want to build, we can build them for upstream Git first, and then we can benefit from them and then we don't have to build it twice. And then we don't have to do something special that's only for our internal teams that again, once they learn that thing, it's different from what everyone else is doing and we have that same problem again. So, right now the things that the office is depending on our sparse Checkout, yes, they're using the GVFs protocol, but to them, you can just call it partial clone and it's going to be the same from their perspective. And in fact, the way we've integrated it for them is that we've gone underneath the partial clone machinery from upstream Git and just taught it to do the GVFS protocol. So, we're much more aligned with because we know things are working for the office, upstream, Git is much more suited to be able to handle this kind of scale.

Utsav Shah: And that makes a ton of sense and given that, it seems like the community wanted you to contribute these features back. And that's just so refreshing, you want to help out someone, I don't know if you've heard of those stories where people were trying to contribute to get like Facebook has like this famous story of trying to continue to get a long time ago and not being successful and choosing to go in Mercurial, I'm happy to see that finally, we could add all of these nice things to Git.

Derek Stolee: And I should give credit to the maintainer, Junio Hamano, and people who are now my colleagues at GitHub, like Peff Jeff King, and also other Git contributors at companies like Google, who took time out of their day to help us learn what's it like to be a Git contributor, and not just open source, because open source merging pull requests on GitHub is a completely different thing than working in the Git mailing list and contributing patch sets via email. And so learning how to do that, and also, the level of quality expert expected is so high so, how can we navigate that space has new contributors, who have a lot of ideas, and are motivated to do this good work. But we needed to get over a hump of let's get into this community and establish ourselves as being good citizens and trying to do the right thing.

Utsav Shah: And maybe one more selfish question from my side. One thing that I think Git could use is some kind of login system, where today, if somebody checks in PII into our repository into the main branch, from my understanding, it's extremely hard to get rid of that without doing a full rewrite. And some kinds of plugins for companies where they can rewrite stuff or hide stuff on servers, does GitHub have something like that?

Derek Stolee: I'm not aware of anything on the GitHub or Microsoft side for that, we generally try to avoid it by doing pre received books, or when you push will reject it, for some reason, if we can, otherwise, it's on you to clear up the data. Part of that is because we want to make sure that we are maintaining repositories that are still valid, that are not going to be missing objects. I know that Google source control tool, Garrett has a way to obliterate these objects and I'm not exactly sure how it works to then say they get clients are fetching and cloning and they say, I don't have this object it'll complain, but I don't know how they get around that. And with the distributed nature of Git, it's hard to say that the Git project should take on something like that, because it is centralizing things to such a degree that you have to say, yes, you didn't send me all the objects you said you were going to, but I'll trust you to do that anyway, that trust boundary is something that gets cautious to violate.

Utsav Shah: Yes, that makes sense and now to the non-selfish questions, maybe you can walk through listeners, why does it need to bloom filter internally?

Derek Stolee: Sure. So let's think about commit history is specifically when, say you're in a Java repo, a repo that uses the Java programming language, and your directory structure mimics your namespace. So if you want to get to your code, you go down five directories before you find your code file. Now in Git that's represented as I have my commit, then I have my route tree, which describes the root of my working directory and then I go down for each of those directories I have another tree object, tree object, and then finally my file. And so when we want to do a history query, say what things have changed this file, I go to my first commit, and I say, let's compare it to its parent and I'm going to the root trees, well, they're different, okay they're different. Let me open them up find out which tree object they have at that first portion of the path and see if those are different, they're different let me keep going and you go all the way down these five things, you've opened up 10 trees in this diff, to parse these things and if those trees are big, that's expensive to do.

And at the end, you might find out, wait a minute the blobs are identical way down here but I had to do all that work to find out now multiply that by a million. And you have to find out that this file that was changed 10 times in the history of a million commits; you have to do a ton of work to parse all of those trees. So, the Bloom filters come in, in a way to say, can we guarantee sometimes, and in the most case that these commits, did not change that path, we expect that most commits did not change the path you're looking for. So what we do is we injected it in the commit-graph file because that gives us a quick way to index, I'm at a commit in a position that's going to graph file, I can understand where this Bloom filter data is. And the Bloom filter is storing which paths were changed by that commit and a bloom filter is what's called a probabilistic data structure. So it doesn't list those paths, which would be expensive, if I just actually listed, every single path that changed at every commit, I would have this sort of quadratic growth again, in my data would be in the gigabytes, even for a small repo.

But with the Bloom filter, I only need 10 bits per path so it's compact. The thing we sacrifice is that sometimes it says yes, to a path that is the answer is no but the critical thing is if it says no, you can be sure it's no, and its false-positive rate is 2%, at the compression settings we're using so I think about the history of my million commits 98% of them will this Bloom filter will say no, it didn't change. So I can immediately go to my next parent, and I can say this commit isn't important so let's move on then the sparse any trees, 2% of them, I still have to go and parse them and the 10 that changed it they'll say yes. So, I'll parse them, I'll get the right answer but we've significantly subtracted the amount of work we had to do to answer that query. And it's important when you're in these big monitor repos because you have so many commits, that didn't touch the file, you need to be able to isolate them.

Utsav Shah: At what point or like at what repository number of files, because the size of file that thing you mentioned, you can just use LFS for that should solve a lot of problems with the number of files, that's the problem. At what number of files, do I have to start thinking about okay; I want to use these good features like sparse checkout and the commit graphs and stuff? Have you noticed a tipping point like that?

Derek Stolee: Yes, there are some tipping points but it's all about, can you take advantage of the different features. So to start, I can tell you that if you have a recent version of Git saved from the last year, so you can go to whatever repository you want, and run, Git, maintenance, start, just do that in every [inaudible 52:48] is going to moderate size and that's going to enable background maintenance. So it's going to turn off auto GC because it's going to run maintenance on a regular schedule, it'll do things like fetch for you in the background, so that way, when you run Git fetch, it just updates the refs and it's really fast but it does also keep your commit graph up to date. Now, by default, it doesn't contain the Bloom filters, because Bloom filters is an extra data sink and most clients don't need it, because you're not doing these deep queries that you need to do at web-scale, like the GitHub server. The GitHub server does generate those Bloom filters so when you do a File History query on GitHub, it's fast but it does give you that commit-graph thing so you can do things like Git log graph fast.

The topological sorting has to do for that, it can use the generation numbers to be quick, as opposed to before printers, it would take six seconds to do that just to show 10 commits, on the left few books had to walk all of them, so now you can get that for free. So whatever size repo is, you can just run that command, and you're good to go and it's the only time you have to think about it run at once now your posture is going to be good for a long time. The next level I would say is, can I reduce the amount of data I download during my clones and fetches and that the partial clones for the good for the site that I prefer blob fewer clones, so you go, Git clone, dash filter, equals blob, colon, none. I know it's complicated, but it's what we have and it just says, okay, filter out all the blobs and just give me the commits and trees that are reachable from the refs. And when I do a checkout, or when I do a history query, I'll download the blobs I need on demand. So, don't just get on a plane and try to do checkouts and things and expect it to work that's the one thing you have to be understanding about. But as long as you are relatively frequently, having a network connection, you can operate as if it's a normal Git repo and that can make your fetch times your cleaning time fast and your disk space a lot less.

So, that's kind of like the next level of boosting up your scale and it works a lot like LFS, LFS says, I'm only going to pull down these big LFS objects when you do a checkout and but it uses a different mechanism to do that this is you've got your regular Git blobs in there. And then the next level is okay, I am only getting the blobs I need, but can I use even fewer and this is the idea of using sparse checkout to scope you’re working directory down. And I like to say that, beyond 100,000 files is where you can start thinking about using it, I start seeing Git start to chug along when you get to 100,000 200,000 files. So if you can at least max out at that level, preferably less, but if you max out at that level that would be great sparse checkout is a way to do that the issue right now that we're seeing is, you need to have a connection between your build system and sparse Checkout, to say, hey, I work in this part of the code, what files I need.

Now, if that's relatively stable, and you can identify, you know what, all the web services are in this directory, that's all I care about and all the client code is over there, I don't need it, then a static gets merged Checkout, will work, you can just go Git's sparse checkout set, whatever directories you need, and you're good to go. The issue is if you want to be close, and say, I'm only going to get this one project I need, but then it depends on these other directories and those dependencies might change and their dependencies might change, that's when you need to build that connection. So office has a tool, they call scooper, that connects their project dependency system to sparks Checkout, and will help them automatically do that but if your dependencies are relatively stable, you can manually run Git sparse checkout. And that's going to greatly reduce the size of your working directory, which means Git's doing less when it runs checkout and that can help out.

Utsav Shah: That's a great incentive for developers to keep your code clean and modular so you're not checking out the world and eventually, it's going to help you in all these different ways and maybe for a final question here. What are you working on right now? What should we be excited about in the next few versions of Git?

Derek Stolee: I'm working on a project this whole calendar year, and I'm not going to be done with it to the calendar year is done called the Sparse Index. So it's related to sparse checkout but it's about dealing with the index file, the index file is, if you go into your Git repository, go to dot Git slash index. That file is index is a copy of what it thinks should be at the head and also what it thinks is in your working directory, so when it doesn't get status, it's walked all those files and said, this is the last time it was modified or when I expected was modified. And any difference between the index and what's actually in your working tree, Git needs to do some work to sync them up. And normally, this is just fast, it's not that big but when you have millions of files, every single file at the head has an entry in the index. Even worse, if you have a sparse Checkout, even if you have 100,000 of those 2 million files in your working directory, the index itself has 2 million entries in it, just most of them are marked with what's called the Skip Worksheet that says, don't write this to disk. So for the office monitor repo, this file is 180 megabytes, which means that every single Git status needs to read 180 gigabytes from disk, and with the LFS monitor going on, it has to go rewrite it to have the latest token from the LFS monitor so it has to rewrite it to disk.

So, this takes five seconds to run a Git status, even though it didn't say much and you just have to like load this thing up and write it back down. So the sparse index says, well, because we're using sparse checkout in a specific way called cone mode, which is directory-based, not path file-based, you can say, well, once I get to a certain directory, I know that none of its files inside of it matter. So let's store that directory and its tree object in the index instead, so it's a kind of a placeholder to say, I could recover all the data, and all the files that would be in this directory by parsing trees, but I don't want it in my index, there's no reason for that I'm not manipulating those files when I run a Git add, I'm not manipulating them, I do Git commit. And even if I do a Git checkout, I don't even care; I just want to replace that tree with whatever I'm checking out what it thinks the tree should be. It doesn't matter for what the work I'm doing and for a typical developer in the office monorepo; this reduces the index size to 10 megabytes. So it's a huge shrinking of the size and it's unlocking so much potential in terms of our performance, our Git status times are now 300 milliseconds on Windows, on Linux, and Mac, which are also platforms, we support for the office monitor repo, it's even faster.

So that's what I'm working on the issue here is that there's a lot of things in Git that care about the index, and they explore the index as a flat array of entries and they're always expecting those to be filenames. So all these things run the Git codebase that needs to be updated to say, well, what happens if I have a directory here? What's the thing I should do? And so, all of the ideas of what is the sparse index format, have been already released in two versions of Git, and then there's also some protections and say, well, if I have a sparse index on disk, but I'm in a command that has an integrated, well, let me parse those trees to expand it to a full index before I continue. And then at the end, I'll write a sparse index instead of writing a full index and what we've been going through is, let's integrate these other commands, we've got things like status, add, commit, checkout, those things are all integrated, we got more on the way like merge, cherry-pick, rebase. And these things all need different special care to make it to work but it's unlocking this idea that when you're in the office monitoring who after this is done, and you're working on a small slice of the repo, it's going to feel like a small repo.

And that is going to feel awesome. I'm just so excited for developers to be able to explore that we have a few more integrations; we want to get in there. So that we can release it and feel confident that users are going to be happy. The issue being that expanding to a full index is more expensive than just reading the 180 megabytes from disk, if I just already have it in the format; it's faster than being to parse it. So we want to make sure that we have enough integrations that most scenarios users do are a lot faster, and only a few that they use occasionally get a little slower. And once we have that, we can be very confident that developers are going to be excited about the experience.

Utsav Shah: That sounds amazing the index already has so many features like the split index, the shared index, I still remember trying to like Wim understands when you're trying to read a Git index, and it just shows you as the right format and this is great. And do you think at some point, if you had all the time, and like a team of 100, people, you'd want to rewrite Git in a way that it was aware of all of these different features and layered in a way where all the different commands did not have to think about these different operations, since Git get a presented view of the index, rather than have to deal with all of these things individually?

Derek Stolee: I think the index because it's a list of files, and it's a sorted list of files, and people want to do things like replace a few entries or scan them in a certain order that it would benefit from being replaced by some sort of database, even just sequel lite would be enough. And people have brought that idea up but because this idea of a flat array of in-memory entries is so ingrained in the Git code base, that's just not possible. To do the work to layer on top, an API that allows the compatibility between the flat layer and it's something like a sequel, it's just not feasible to do, we would just disrupt users, it would probably never get done and just cause bugs. So, I don't think that that's a realistic thing to do but I think if we were to redesign it from scratch, and we weren't in a rush to get something out fast, that we would be able to take that approach. And for instance, you would sparse index, so I update one file after we write the whole index that is something I'll have to do it's just that it's smaller now. But if I had something like a database, we could just replace that entry in the database and that would be a better operation to do but it's just not built for that right now.

Utsav Shah: Okay. And if you had one thing that you would change about Git architecture like the code architecture, what would you change?

Derek Stolee: I think there are some areas where we could do some plug ability, which would be great. The code structure is flat, most of the files are just C files in the root directory and it'd be nice if they were componentized a little bit better. We had API layers that could be operating. So we could do things like swap out how refs are stored more easily, or how to swap out how the objects are stored and it is less coupled to a lot of the things across the built-ins and other things. But I think the Git project is extremely successful for its rather humble beginnings, it started as Linus Torvalds, creating a version control system for the Linux Kernel things are in a couple of weekends or however long he took a break to do that. And then people just got excited about it started contributing it and you can tell, looking at the commit messages from 2005 2006 that this was the Wild West, people were just fast in replacing code and building new things and it didn't take very long, definitely by 2010 2011 to get code base is much more solid in its form and composition. And the expectations of contributors to write good commit messages and do small changes here and there have already been built at that time a decade ago. So Git is solid software at this point, and it's very mature, so making these big drastic changes are hard to do. But I'm not going to fault it for that at all, it's good to be able to operate slowly and methodically to be able to build something and improve something that's used by millions of people you just got it, treat it with the respect and care it deserves.

Utsav Shah: If you think of software today as you run into bugs and so many different things, but Git is something that pretty much I think all developers use the most probably, and you don't even think of Git having bugs. You think, okay, I messed up using Git, you don't think that we'll get that something interesting. And if it turned out that Git had all sorts of bugs that people will run into, I don't even know what their experience would be like. They just get frustrated and they stop programming or something but yes, well, thank you for being a guest I think I learned a lot of stuff on the show. I hope listeners appreciate that as well and thank you for being a guest.

Derek Stolee: Thank you so much it was great to have these chats. I'm always happy to talk about Git, especially at scale and it's been a thing I've been focusing on for the last five years, and I'm happy to share the love.

Utsav Shah: I might ask you for like another episode in a year or so once like sparse indexes are out.

Derek Stolee: Excellent. Yeah, I'm sure we'll have lots of teachers who had directions.

Software at Scale

Software at Scale 32 - Derrick Stolee: Principal Software Engineer, GitHub

Highlights

Transcript

Discussion about this episode

Ready for more?