Ep. 97Monday, January 19, 2026

Reliability, Scalability, and Maintainability - Designing Data-Intensive Applications by Kleppman

Watch on YouTube

Book Covered

Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems

by Martin Kleppmann

Get the book →

Book links are affiliate links. We earn from qualifying purchases.

Author

Martin Kleppmann

Hosts

Carter MorganHost

Nathan ToupsHost

Transcript

This transcript was auto-generated by our recording software and may contain errors.

Carter Morgan (00:00)

There's a lot of value in depth of knowledge and knowing a particular area of your field really well. There is also tremendous value as a software engineer in your breadth of knowledge and just being exposed to all of these concepts.

Hey there, you're listening to Book Overflows, the podcast for software engineers by software engineers where every week we read one of the best technical books in the world in an effort to improve our craft. I'm Carter Morgan. I'm joined here as always by my co-host, Nathan Toops. How you doing, Nathan?

Nathan Toups (00:31)

Doing great. Hey, everybody.

Carter Morgan (00:33)

Well, thanks for ⁓ tuning in everyone. always, like, comment, subscribe. If you're on the YouTube video, if you're on the Spotify video, anywhere that really like to get comment or like, or subscribe, ⁓ share the podcast with your friends and coworkers and you know, really helps the podcast. And you can also book time with us on the Leland platform if you'd like to ⁓ get some one-on-one career coaching with Nathan or I. And you can also join our discord. We started the book overflow discord. There's a link to it in the comments right now. That's been actually really fun to.

have lots of fans trickle in and to start more of a conversation.

Nathan Toups (01:04)

Yeah, we've got, I think, 19 seats left for what I'm calling the alpha tester's role. You'll be enshrined for all of eternity as one of the first 100 folks in the Discord. I think we'll also do a beta tester, so that'll be the second tier of ⁓ early adopters. ⁓ After that, I have no idea what we're going to do, but we'll have some fun perks for joining the server early, so come hang out.

Carter Morgan (01:30)

Yeah, this is a pretext to the custom crypto coin Nathan and I are going to launch and then do a rug pull on all of you. So stay tuned for that. Does the rug pull work if you announced in advance?

Nathan Toups (01:35)

there you go. Yeah, yeah, I'm gonna.

You know, ⁓ think the SEC really appreciates it, you know.

Carter Morgan (01:47)

Yeah,

I'm true college football poisoned. I like my brain. That's like the SCC, like Southeastern conference. What are you talking about? I'm like, I, I know what you're talking about. ⁓ well, we're not here to joke about the securities and exchange commission. are here to cover what is easily the most requested book of all time on this podcast. And that is designing a data intensive applications by Martin Kleppman.

Nathan Toups (01:54)

You

Carter Morgan (02:14)

We're excited to tackle this. We had been holding off on this for a while because there's a, there's a part two, not a part two, a second edition coming out. And this book is old. What'd you say, Nathan? 2017 is when this was written.

Nathan Toups (02:27)

I think it came out,

yeah, came out in 2017.

Carter Morgan (02:29)

Yeah. So, ⁓ we've been hoping to do the second edition, but it kept getting delayed and delayed. And, you know, we thought it's time. We got to read this. We got to tackle it and we're really excited to do it. So this is a first for Book Overflow. This is going to be our first four-parter. So tune in over the next four episodes while we cover designing data intensive applications. And it was ⁓ funny, Nathan, you shared on the Discord.

Nathan Toups (02:47)

Yeah.

Carter Morgan (02:56)

I thought this was really fun trip down memory lane. You shared the original Reddit post I made on the Georgia Tech subreddit looking for a cohost for a maniac's idea of a podcast to read a new software engineering book each week. And I thought that was fun. ⁓

Nathan Toups (03:10)

Yeah, yeah, so that's

some of the perks and the extras of being on the Discord. First of all, you can ask us questions and we're, you know, small enough of a podcast that we'll actually answer right now. But secondly, yeah, they're little Easter eggs. yeah, I was just, I was searching for, I like to search periodically if anybody ever mentions the podcast or has questions and things. And I was just on Reddit looking and I was like, oh yeah, look at that. That's the original post that Carter made. It was pretty cool.

Carter Morgan (03:19)

Yeah,

Well,

I was looking at it I was trying to see like, how far is the podcast strayed from the original vision? And the answer is not very far. It's actually pretty close, but I was bringing that up because one thing I mentioned on that is I said, look, this isn't a book report. It's a, we're not going to do a faithful dissection or retelling of everything we read. Instead, I wanted it to feel like two work coworkers chatting over lunch. Um, and we're just going to talk about kind of the most interesting ideas and.

⁓ I'm just doing that as an disclaimer, because if you're tuning into these episodes for designing data intensive applications, thinking this is the authoritative retelling of designing data intensive applications. If I listen to this podcast, I won't need to read the book. That's never been the aim of the podcast. We are, we could not possibly do this book justice over the course of four episodes. You'd probably need 50 episodes to fully discuss everything in this book. We're just going to talk about, ⁓ the meat.

of it, at least what we interpret as the meat, the most interesting stuff. And we're going to try to do it justice because this is a legendary book and we're so excited to talk about it. I guess I'll introduce the book for anyone and the author for anyone who's not as familiar with designing data intensive applications. It's written by Martin Kleppman. The author introduction is Martin Kleppman is an associate professor at the University of Cambridge, where he works on distributed systems and local first collaboration software.

Before academia, he was in the trenches. He co-founded Reportive, which was acquired by LinkedIn in 2012, where he worked on large scale data infrastructure. He's also one of the people behind AutoMerge, an open source library for building collaborative applications. The book introduction is, that is at the center of many challenges in system design today. Difficult issues need to be figured out, such as scalability, consistency, reliability, efficiency, and maintainability.

In addition, we have an overwhelming variety of tools, including relational databases, NoSQL data stores, streamer batch processors, and message brokers. What are the right choices for your application? How do you make sense of all these buzzwords? In this practical and comprehensive guide, author Martin Kleppman helps you navigate this diverse landscape by examining the pros and cons of various technologies for processing and storing data. Software keeps changing, but the fundamental principles remain the same. With this book, software engineers and architects will learn how to apply those ideas in practice and how to make full use of data in modern applications.

And you're going to hear over this podcast. I'm to use data and data interchangeably without any reason. So buckle up. Nathan, we have just finished part one of designing the Ad Intensive Applications. book is divided into several parts. I don't know how many parts to be totally honest. And we chose part one because it fit our cadence. Is it four parts? Oh, we might just cover it in four

Nathan Toups (06:15)

I think it's actually in four parts though. don't think all the parts are like,

some of them I think is gonna be a bigger stretch than others and we'll see. We'll see what happens.

Carter Morgan (06:21)

Yeah, well part

one was almost exactly a quarter of the book. So it worked out great. So we read part one. Nathan, give me your takes on part one.

Nathan Toups (06:31)

Yeah, so ⁓ I read that this was like, it was almost like Martin Kleppman was writing a letter to me of like things I need to think about. This is about the right level of depth and breadth that I like in a book like this. ⁓ Because he spends a lot of time being like, okay, well, we can't go deep into the inner workings of a bee tree, but he'll give you enough of a, here's how it operates. Here's the efficiencies. Here's the inefficiencies. Here's the trade offs.

Carter Morgan (06:46)

Yeah, yeah.

Nathan Toups (07:01)

love the structure of this book. I felt that the pacing was really easy for me to wrap my head around because you could just, there's a lot of trust I had for, for Kleppman in that he, he had, he would declare a through line. Then he would give these like examples of why something exists, the trade-offs, what people were trying to do with something.

where it kind of fell short or where something strong suit is. And then he would kind of weave this in. I don't know, it felt very narrative and that's not common, especially for something that's this ambitious. It's a very ambitious book. I think it easily could have fallen on its face and it doesn't. And I understand why people have asked us to read this book. Number one, it's a very large book and you're like, should I even read this book?

Carter Morgan (07:32)

Right.

Nathan Toups (07:49)

But the other part is that like, does it still hold up, right? It's written in 2017. ⁓ Is it still worth reading? And I'll give you a little spoiler, which is like, yes it is, with a big asterisk that says, but there are some missing pieces. The world has moved on in some of these debates and some of these arguments. And I'm actually like, I'm actually pretty excited to explore this today with you, Carter. And so I'd love to hear your general thoughts.

Carter Morgan (08:15)

Yeah, I completely agree on the pacing. That's something I hadn't even thought about until you mentioned it, but yeah, it is fantastically written. ⁓ we have read books in the past where it's a little like, okay, like you're, going too deep or you're spending too much time on this or you're, we're in chapter nine now, but you're mentioning stuff that was like in chapter three. Like we didn't read chapter three. ⁓ no problem here at all with any of that. I mean, I'll say with this,

I am listening to the audiobook of this. I believe you're doing the audiobook as well, Nathan.

Nathan Toups (08:47)

Yes, and I'm using the Alex Hermosy method, which I know that that might make some people's eyes roll because he's like a big sales and marketing kind of person. He actually recommends, if you're gonna listen to audiobooks, to also have physical book that you can kind of like, he actually recommends doing both at the same time. I did not do that. ⁓ Because he says it helps with retention.

Carter Morgan (08:57)

interesting.

Nathan Toups (09:09)

But I primarily listen to audiobook. I think I listen to like, I listen to like a three and a half hour chunk of it in one sitting, because I had to drive from San Jose down to Ojo Choll where I live these days in Costa Rica. So I had a big chunk of it by myself with just driving a very monotonous drive. Beautiful, but monotonous. ⁓ so yeah, I've listened to it in big chunks like that.

Carter Morgan (09:16)

well.

Nathan Toups (09:33)

But I went back because I will tell you, if you don't read the audiobook or, I mean, sorry, read the Kindle book or have the physical copy, there's a lot of stuff to miss. There's very deep graphical representations of some of this stuff. ⁓ I did well with the audiobook though because a lot of this is review for me. I knew a lot of these concepts have run into managing this as a platform engineer and an architect. ⁓

Carter Morgan (09:47)

Right.

Nathan Toups (09:59)

If you're reading, if you're hearing some of these concepts for the first time, I think the audiobook would be incredibly difficult. I'll just put it that way. So what was your experience with the audiobook?

Carter Morgan (10:07)

Well, so I'll say that some of this, I don't think there's been a single concept that I'm hearing about for the very first time. Like talking about like LSM and B-trees and like, I've heard about that. We talk about like different data storage schemas, like Afro, or Avro. I'm like, okay, okay, I got that, right. The whole first section about like ⁓ defining reliable, scalable, maintainable applications. Like that very much felt like review to me as well. ⁓

I think you're right. I have not been able to read any of the actual book for this. I really like what you're saying. Like have them both at the same time. I've been listening to this. I bike to and from work. And so I've been listening to this on my bike commute. ⁓ And there are times where I can sort of tell like, dang it, like I wish I could pause and, and look at that section again. But you know, we've got to record an episode every week. And so I haven't been able to go back and read a revisit as much as I'd like, but here's what I'll say.

We've talked about this in the podcast before. There's a lot of value in depth of knowledge and knowing a particular area of your field really well. There is also tremendous value as a software engineer in your breadth of knowledge and just being exposed to all of these concepts.

Even if you, that exposure is just you saying, I've heard that once before. I'm aware that this exists. I, I've been doing a lot of, ⁓

interviews, I mentioned we've been trying to hire for a position and we've had brought a lot of candidates in and my company keeps a pretty high talent bar and we're, we would rather reject candidates than hire one we think won't really level us up as an engineering organization. And I'm just really shocked with a lot of candidates. I'm maybe not shocked, but like you can tell that a lot of candidates kind of get into these interviews. They're like, especially system designs are like, I didn't know I was supposed to know any of this, right? Or they don't even know where to begin.

because their life is just open up the code base, make your changes and then submit the PR and that's it. And so I think for engineers like that, listening to this audio book, even if a lot of it is a little like in one ear and out the other, just being made aware that there's a whole world out there and so many of these concepts, just having any inkling of understanding of what's going on at the lowest level of your.

data because that's what this is all about. Part one is called foundations of data systems. ⁓ It's really, really helpful. And so I'm not like starting from zero. I've been familiar with a lot of these concepts, but I wish I could say I'm an expert after listening to this audio book, you know, but, obviously I'm not.

Nathan Toups (12:33)

Yeah.

I think this is an important point, is that ⁓ sometimes I'll do a couple things. First of all, yes, listening to audiobooks is its own skill and talent. I think all of us can have our minds drift. We can do this when we're reading with our eyes too, right? And with reading with your eyes, you just kind of go back and go, you know what? I didn't really comprehend that last two paragraphs that I looked at.

Carter Morgan (13:05)

Right.

Nathan Toups (13:07)

With audiobooks, you have to be pretty aggressive. If you have the opportunity is to rewind or hit a bookmark. Sometimes I'll kind of do those things. Not always easy to do, especially if you're riding a bike or driving, things like that. What I will say is that, especially with this book, coming away with, hey, I didn't fully grasp everything in this one section, but I do remember that there was a trade-off when it came to... ⁓

high transaction count on disks. And so just kind of make that mental note and just be like, hey, if this ever comes back up in the future, I'll just know before I like weigh in one way the other, I hear somebody talking about it. I'll go back and read DDIA. I'll go look at that section. I'll go deep dive into understanding like, why, what's the characteristics of fault tolerance for this one technology that we're looking at? And does that hit the risk profile that's okay for us, right? Like this really, this book is really kind of this higher order thinking of like,

Carter Morgan (13:39)

Right.

Nathan Toups (14:02)

every one of these technologies has a strong suit and a weakness.

And you need to understand the business problem you're trying to solve or the reality of the hardware available to you or where you think you're going to go with scaling and you you need to think about this and say, okay, well actually this puts an undue risk on our business because if this Really weird edge case of false tolerance comes out and it corrupts our data. We could lose millions of dollars, right? And that's the thing that nobody can just solve for you, right? When when you're kind of looking into these

types of problems. And this is what I love about this book is it kind of gives you a sampling of like, all these little real world things like, ⁓ this team from Facebook said that the schema ⁓ evolution thing didn't work very well. And so they came with a disparage and you're like, ⁓ that's very clever. You know, I understand why they made that decision. that's, I know we're gonna fanboy out and talk about these, these parts of the book. But ⁓ yeah, it's

Carter Morgan (14:41)

I hope we can get Klebman on. don't know if he does a lot of media or anything. Maybe he would because he wants to promote the second edition. ⁓ But we're devoting four episodes, Martin, to discussing your book. So we'd love for you to come on.

Nathan Toups (15:07)

yeah.

Great.

Hey, and I'm gonna put this up here. If the second edition does come out this year, because it has been put off a little bit, it was supposed to come out last year. If it does come out this year, I think it would be really cool to do a follow-up episode, because this is such an interesting thing, and I think it would be really cool for our audience to know, well, okay, had first edition sitting on my shelf for five years, and I never got around to reading it. Should I go buy the second edition? And we can weigh in. We can be like, yeah, you're really missing out on, you know, ⁓

Carter Morgan (15:30)

I think so, yeah.

Nathan Toups (15:46)

vector databases or whatever he's gonna cover in the new version. ⁓

we'll see.

Carter Morgan (16:24)

Well, this, this first part might be the one most immune to second edition changes because it's all about the foundations. I really enjoyed chapter one, ⁓ which is, it says, look, the whole point of kind of understanding any of this is that you want to build a reliable, scalable and maintainable applications. And so we, the votes a little bit of time talking about, ⁓ okay, if we're to talk about reliable, scalable, maintainable, let's define that. Let's define what reliability, scalability and maintainability is.

As far as reliability goes, I thought something very interesting he points out. says faults do not equal failures. He says a failure is when your service stops working for the end user, but a fault is something that could potentially lead to a failure. And our job is to design fault tolerant systems. Um, you know, so a fault could be something physical. mean, you know, a solar flare, we joke about solar flares all the time at work. If sometimes we, you know, we're a startup and so we can't.

When I was at like the cloud provider I was at, if we had something like a deviation in the system that caused P99 latency to increase for five minutes or whatever, right? We would devote a significant amount of time to understanding what that was. We're getting much better at my current job devoting time to that, but it's like 45 minutes. And if after 45 minutes, we try to figure out, I guess here's what I'll say. We try to figure out,

why the system reacted in the way it did, but sometimes we can't devote nearly as much time to figuring out what caused it to begin with. So what we'll be like, okay, we were hit with a bunch of queries, right? ⁓ We can't spend the time figuring out what exactly were those queries or why we saw an increase at this moment, but we can figure out why the system, when subjected to those queries, was performing poorly. But anyhow, and so at a certain point when we can't...

When we can't devote time to figuring out the why or like what caused it, we'll say like it's a solar flare. Like that's probably what happened. Some solar flare hit the system and you know. But the whole point is that like you should be designing systems that are fault tolerant to reduce failures. So you should be designing things that are tolerant to solar flares. know, a malformed data in your database could be considered a fault. And if your whole system blows up and encounters malformed data, that's a problem.

⁓ this is best exemplified by Netflix, the whole chaos engineering approach, right? Let's have a, let's purposefully kill servers, ⁓ and see how the system responds. ⁓ I would love to work at a place that was much more aggressive about things like that. ⁓ but I've just never been able to work at somewhere that is running those sorts of simulations, at least with any sort of frequency. ⁓ so I mean, I thought those were interesting points about reliability.

Nathan Toups (18:56)

Right.

Yeah, this gets back to something I really appreciate about ⁓ various software communities. So how you handle faults, how you handle and do fault tolerance really is cultural, right? It really has to do with the domain of the problem you're solving. For instance, I'll bring up Go, which again, I'm a fan boy, but in Go...

Errors are just values. There is no sort of like try catch. There is no sort of like exception flow control. There is a panic that is like sort of existential. So I guess technically there is some of that, but you don't use that in your normal day-to-day life. One of the things that's interesting about this is that you say, okay, well errors will happen. And it makes you kind of upfront think about, how am I error handling? Am I just wrapping it and shooting it up the stack?

and letting somebody else upstream deal with it? Or is it like the John Osterhout method where you say, you you handle errors out of existence. ⁓ Good fault tolerance systems figure out what that right contract is. Can I build something that is resilient and that the errors actually aren't something that goes to the end user? It's just some part of the fault tolerance system.

that it goes into place. Again, obviously there's trade-offs, there's things, but these are fun problems to think about. And I think this first section of the book also talks about things like, well, there's hardware faults, there's software errors, there's human errors, right? Like if you have some manual part of your system, and I've seen this time and time again with the startups that I work with, you start, somebody is the deploy manager and the SSH in some machine and triggers some magic job and...

they try to get somebody else in the company to do that. And they don't realize that like step three is an undocumented step, right? Step one and two, everybody gets it. Step three is this like kind of weird edge case that, know, Sally always checked and nobody realized that Sally checked it. And that's why her deploys were always perfect. ⁓ And so like, there's all these ways that faults, the sort of failures and fault tolerance and these things can be built into a system. ⁓

Yeah, you have to, again, this is what I love about this book. You have to think, and the whole point is, here's a mental, he'll give you these little sort of like ⁓ thought experiments, and then he'll be like, okay. Like I thought that ⁓ the other section in this first chapter on scalability, I thought was really kind of interesting where he's basically like talking about Twitter. And he was like, yeah, there's this, and this is a classic, I would say this is like a classic systems problem that will come up from time to time in your interviews.

Carter Morgan (21:33)

Yes.

Okay.

Nathan Toups (21:59)

you say, how do you aggregate a feed to everyone, right? And I loved it because it was like, okay, well, I do some query and I look for everyone who I follow and then I go and try to grab the latest piece of information and then I shove all these together in a timeline and then display that timeline to the end user. ⁓

Carter Morgan (22:04)

Hmm.

Nathan Toups (22:21)

The problem is this is like a one to many relationship. This is like a very expensive join operation that happens in a database. And of course, he's using this as sort of a layup to talk about all these other topics in the book. And he realized that actually there's a much better way to do this by inverting this. And when someone posts, they post rarely, but they read a lot, right? So you post rarely, yeah.

Carter Morgan (22:24)

Right.

Yeah, they said 12,000

tweets a second, write operations, 300,000 read operations a second.

Nathan Toups (22:50)

Yeah.

So for most people who don't tweet that often and don't have that many followers, it makes way more sense for them to have this sort of like filtered timeline view where you kind of have this thing and it makes it much easier for this to come in. Except there's this other problem, which is what happens when you're a super, super popular person, like you're Elon Musk on Twitter or you're some other like millions and millions and millions of followers.

when you post, that actually breaks the opposite direction because all of sudden your one tweet will update millions of people's sort of cache timeline. ⁓

And so they actually had to pendulum swing back to the original pattern for that, for a special subset and just like a certain amount of follower base. And it was just like this, was kind of cool to see like, okay, well, here's this engineering problem. Here's how we fixed it. but actually there's this edge case that actually is so existential that we have to go back and fix it a different way. And then we have this hybrid approach. And ⁓ I think everybody who's seen cool software sees these things in reality, right? There's this some sort of like weird.

you look at it you're like, why is it this way? And they're like, ⁓ actually it makes sense given the constraints that I have.

Carter Morgan (24:04)

Well, when he talks about scalability, he says like, it's not binary, like system, a scales or it doesn't scale. The system scale in different ways, right? And writing a write heavy service is very different from writing a read heavy service, right? And also in terms of like scalability, like, I don't know, stick up a server and all it has is a health check-in point, right? And then gate that behind, you know, like put that on Kubernetes like boom, it scales. You can probably handle a million requests per second, right? If you do that.

Nathan Toups (24:08)

Great.

Carter Morgan (24:34)

Um, but it's, uh, there's not much to it. Um, he talks a lot about, uh, performance metrics, uh, percentiles over averages, you know, so you got your P 50 and this is so, you know, so dumb about this is that I have, I know about P 95 P 99 latency, right? And, um, if you're familiar with the podcast, you know that Nathan mocked me at my current place because

We did not have like open telemetry set up or any sort of metrics. like, what are you doing? Like, how do you know if the system is stable? And so I got that all set up. It looks great now. In fact, just recently I got also our MongoDB drivers exporting automated metrics, which has been great because we've been doing the whole month of January. My project has been just a performance improvements for the site. ⁓ and we had

We had this like weird library that was serving as like the application bottleneck. And so we got that removed, which is great. So that's no longer the bottleneck, but it's so funny because now the bottleneck has moved to MongoDB. And so we are, I know, right. And so it's cool that like this, this former library, had been such a pain for us, we finally solved that, but now I'm doing all this Mongo stuff. I had to these Mongo driver metrics automatically exporting. But anyhow, so yeah, I have P95, P99 latency set up, but I was monitoring average latency.

Nathan Toups (25:36)

Isn't that funny?

Carter Morgan (25:53)

And then this book pointed out P50 latency, median latency. like, ⁓ why do we not have a dashboard? You know, why don't we have a panel for that on our dashboard? So I got that set up and I was like, Hey, like our P50 latency is a lot better than our average latency. But in part because our P99 is just too high. ⁓ but this is something I've been thinking about all month because again, my whole job this month has just been getting, ⁓ latency down on the site. Amazon determined this is what Martin Klempen says in the book.

Amazon determined that every 100 milliseconds of increase in latency decrease sales by 1%. Insane, right? That's 100 milliseconds, right? One 10th of a second leads to 1 % decrease in sales. ⁓ and so it's interesting because like, again, I'm jealous of a company like Amazon who could afford to run those sorts of experiments and come up with that definitively. I don't have any proof. I don't think I'll have any proof once this is all said and done that like, by

Nathan Toups (26:28)

That's nuts. That is nuts.

Carter Morgan (26:52)

decreasing our latency, we saw this increase in this metric for the business or whatever. But I keep thinking about that. like, I bet that holds true. I remember when we were in the Plex, the early Google engineers very strongly believed in the power of low latency. Yeah, and how it grows a business.

Nathan Toups (27:09)

This was a really interesting one too. came into a, so was working in a bigger startup for me, about 300 ICs, we were seriously funded. And we had some mergers and acquisitions. There was a company that we had acquired out of Spain, actually. And I remember that one of the weird problems that we had, it didn't show up in the dashboards. And this is the worst case scenario. ⁓ We had all this like tracking metric stuff. So the headers were actually pretty full. Like it was way too full.

And what would happen was if the headers got too full, some of the ⁓ edge routers would actually like either completely block or cut out some of the headers that were annotated in.

And this would actually disproportionately affect people that were part of the loyalty and rewards program. So like the most valuable customer, the people who spent enough to actually like want to be, you know, getting the loyalty and rewards are the ones who are getting the worst performance or the worst ⁓ user feedback. And so we ended up having to make dashboards specifically for loyalty rewards. It was, it was a very interesting, like metrics challenge to kind of be like, how do we actually, I just had this conversation with another, ⁓

Carter Morgan (27:56)

interesting.

Yeah.

Nathan Toups (28:19)

⁓ a client actually I was talking to, if you run into a thing where you can't, if you run into a problem that you can't ask the question of your data, that's a really good indicator that you need to beef up your metrics in some way. Like sometimes there are these questions, just, you have a question and you're like, I actually don't know how we measure that. ⁓ That's one, it also processes that scare you, right? So these are the kind of couple things that come up. And this actually, I think this gets us into a really, next part where she talks about maintainability.

Carter Morgan (28:31)

Yeah.

just want to say one more thing about scalability, ⁓ is he says, ⁓ with your percentiles. ⁓ So, you know, right. So P99 means one out of every 100 requests. And he talks about P99.9, which is one out of every thousand requests. He says that big companies have determined that P99.9 is about as far as you want to go. He says when you get to P99.99, one out of every 10,000 requests, it's just a little like, you know, that's where you're getting into things like.

Nathan Toups (28:48)

yeah, go ahead. Yeah, go ahead.

⁓ yeah.

Carter Morgan (29:16)

solar flares or just like, can't even really tell what's causing these things to be extra long. But he says that you might think P 99.9 is excessive one out of every 1000 requests. But the point he makes, and I believe this came from Amazon as well, is that your customers with the most data tend to be your most valuable customers. And those tend to be the sorts of requests that start showing up in P 99.9. And so you might think one out of every 1000, can afford to lose it. And to be totally honest at my company right now.

Nathan Toups (29:36)

Yeah, yeah, exactly.

Carter Morgan (29:46)

I'm not even touching P 99.9. I'm focused entirely on P 99. We're just, we're starting out doing baby steps, right? But that is something to think about. Those tail end requests aren't necessarily random. They might be associated with your most valuable customers. And so you got to make sure that you're, uh, you're taking care of them. But anyway, that, that does bring us, oh, sorry. Go ahead.

Nathan Toups (29:59)

Yeah, that's interesting. I've actually, you

know, I've actually never measured P 99.9 either, actually. No, that's interesting. Okay, yeah, that makes sense. Well, and I've never worked at a company that's that large. So I think P 99 was actually us holding ourselves to a really high bar. Because again, right, what you're doing with this, and I think for the uninitiated, P 99 is looking for outliers.

Carter Morgan (30:08)

really? We did it at the cloud provider. ⁓

Right.

Right, right.

Nathan Toups (30:28)

it's saying hey 99%, if you look at the average latency of something, you're gonna have this number that's hopefully in a cozy spot. Or you'll see something that looks really bad. So there's two things that'll happen. Number one, it'll either lie to you and make things look better than they really are ⁓ versus edge cases or.

Carter Morgan (30:35)

Mm-hmm.

Nathan Toups (30:46)

you'll get some number that's your average that looks really high. But really what it is is that the outlier just happens to be super crazy outlier. This is kind of like if you look at ⁓ median versus mean income in the United States, right? The people who make the haves versus the have nots, the people who make really, really, really high income will completely distort what the average income is for the American household. And so that's why we call it.

Carter Morgan (31:11)

It's like if you look like

the average wealth for millennials, like it was looking a lot higher than it should because of Zuckerberg. Like him alone was distorting that so much.

Nathan Toups (31:18)

Yeah, exactly.

Right, right, there's this joke that if you're at a cocktail party and a billionaire walks into the room, the mean income goes up by hundreds of thousands of dollars, right?

Carter Morgan (31:32)

Is it it's Wyoming?

It's Wyoming, I think or Montana, I forget. But it's the only state in the union where the average income for black Americans is higher than the average income for white Americans because Kanye West lives there. And there are so few black people in that particular state. so anyhow, yeah, median versus mean lots of funny.

Nathan Toups (31:45)

⁓ my god.

Yeah.

Exactly. so I think statistics, so I will say, and this comes up more time and time again, having at least like a freshman level college understanding of statistics is actually very important in our jobs. And I would say that the longer you put it off, the worse it will be because you're, if you want to start asking interesting problems, asking interesting questions and solving interesting problems in your company, you're going to have to get pretty good with.

Carter Morgan (32:02)

Right. Yeah, yeah.

Nathan Toups (32:17)

how statistics work and how statistics can lie to us. we actually, is the data giving us something that's actually meaningful? And it's non-trivial. And I will tell you that I've walked into environments where...

they are willing to lie to themselves with statistics so that it looks good to a manager, but they're not actually solving the real problem because they aren't putting the right statistical rigor in place. And sometimes you kind of have to break it like, hey, this isn't actually how we should be measuring ourselves. ⁓ so, yeah, I think you can't meet your goals with scalability unless you're measuring things properly. And I think that's what the section really kind of drove down, you know.

Carter Morgan (32:57)

Well,

and the thing with P99, you say, well, it's one out of every 100, but think about it. It's not crazy that each page on your website would make four requests to your backend. Right? And so that means if someone clicks around 20 pages over the course of your time, they're going to encounter, mean, odds are they'll encounter one out of those 100 requests. And if your P, if your median latency is 300 milliseconds, but your P99 is six, that means at some point during your user experience, they're going to encounter a page that takes six seconds to load.

Right? Or some feature is going to take six seconds to load. you know, worth considering. guess maintain a build. Go ahead.

Nathan Toups (33:25)

Right. Yep. And we look, yeah,

that was the thing, it's like the, lot, time and time again, I've found that it's, ⁓ there's some cache invalidation path that, know, cache poisoning or something that comes up, or we find that there's some usage of a webpage that ends up being this really expensive SQL query or, you know, something, and it's really good investigative tool, like truly.

Carter Morgan (33:39)

Yeah.

And then there's maintainability, which we can just touch on. Maintainability is a lot more kind of the architecture of the code itself. Although he does talk about things like monitoring, automation, documentation. I mean, this is where something like fundamentals of software architecture or philosophy of software design is going to talk a lot more about this. ⁓ And I mean, again, you could devote a whole series of episodes to just maintainability, especially in the age of AI, because, you know, there's a lot. I was reading, what is it?

Nathan Toups (34:19)

Right.

Carter Morgan (34:22)

I never heard of this open source project before. It's called TL draw TLDR. I don't know where you see in this. ⁓

Nathan Toups (34:27)

yeah, yeah, yeah, yeah.

There's some YouTubers that I like a lot and I've actually started, I actually have been using Teal Draw for some diagramming that I've been doing for client work, so.

Carter Morgan (34:32)

Okay.

Well,

they just announced that they are automatically closing all new pull requests on the open source project because there's so much AI generated code. ⁓ and, and I think they're just users who are just trying to, cause that's been something people have said forever. Like, commit to open source. If you're having trouble finding a job, go commit to open source. And so I'm sure there are some programs out there. Maybe just users themselves, you know, being savvy, they're just opening up as many pull requests and open source projects as possible. And so.

Nathan Toups (34:48)

Yuck. Yeah.

Carter Morgan (35:06)

You know, we're just in an age, very strange, right? We're in an age where you can generate code much, much faster than you can review it. ⁓ I think that's funny. Kind of like how I've been talking about it at my job, like, okay, we removed this application level bottleneck, but now we have come to a new bottleneck, right? Which is Mongo. We're kind of at that point as a field, right? We're like, okay, the bottleneck for a long time had been writing code.

Nathan Toups (35:15)

Yeah.

Carter Morgan (35:32)

So you remove that, the code generation is faster these days, but what's the new bottleneck, right? And I think that there are a lot of bottlenecks.

Nathan Toups (35:32)

All

It's so true too, and it actually gets back all the way back to some of the conversations we had about fundamental software architecture, where we talk about co-nascence. that when you find a bottleneck, actually are finding an accidental co-nascence of process, right? Because what you don't realize is, ⁓ there's a dependency graph here. Once I cleared up this weird little thing, I realized that actually, and so it is interesting of like, how do I decouple systems?

Carter Morgan (35:57)

yeah.

Yes.

Nathan Toups (36:11)

⁓ or sometimes the fact is that that is just the bottleneck and now we just have one variable set to deal with like, ⁓ MongoDB needs more resources or we need to structure our database a certain way. Maybe that's the sort of ending point, but sometimes you don't know how to tune a system until you've uncovered some of the like blockage of process that are in place.

Carter Morgan (36:38)

Well,

so that's a lot about ⁓ scalability, reliability, and maintainability. Yes.

Nathan Toups (36:44)

I will bring up one thing with maintainability,

it's gonna come up, and so I'm just gonna give a layup. He kinda mentions evolvability. so we'll get through chapter two and chapter three and chapter four. That's where we'll be for this episode. Chapter four is actually my favorite chapter. It has to do with like... ⁓

Carter Morgan (36:52)

yes.

Nathan Toups (37:02)

encoding and decoding, serializing and deserializing data. And a big theme that comes up with this, because this ends up being a foot gun for a lot of organizations, is the evolvability of schemas, the evolvability of a contract and an API and these other things. so, again, I appreciate this book because he gives himself a layup. He's like, okay, here's the fundamental, here's the foundational principles that we're gonna address everything else in this book. And he keeps coming back. And so I love that he'll,

bring up scalability, bring up maintainability, and bring up reliability when we're talking about trade-offs ⁓ for the rest of the text.

Carter Morgan (37:39)

Chapter two is all about data models and query languages. This is one where I started to feel a little more like maybe he was doing his due diligence to kind of cover everything, but he gets into like a lot of things where I'm like, I don't know if I'm ever gonna use this, right? I don't know if I'm ever gonna use Codasil, right? Or, and then he talked about all these different kinds of databases. And I think we're seeing these days like, yeah, like.

I don't really see a lot of arguments for like MySQL over Postgres, for example, right? Like Postgres is kind of eating the world. ⁓ But he does talk about the convergence. this is the chapter where he lays out like, okay, SQL versus NoSQL, right? And NoSQL has kind of evolved to mean not only SQL. And I think it's really important to have an understanding of how these data models work. I mean, we are actually struggling with that at work.

⁓ we use Mongo as our database and kind of the early engineers who built the product, a lot of them aren't with the company anymore. Just treated Mongo as like, yeah, you know, it's a database. Like let's just query whatever we want. And then as we've been digging in and making these performance improvements, we're like, you know, some of these Mongo queries are really, really slow. ⁓ and it's because they're, Mongo isn't like a magic sack. You can just pull things out of, right? Like it has an actual physical structure.

underneath and how it organizes all this data. so some things are really fast. Like anything that has an index on it, right? Mongo can query that pretty quickly. If it doesn't have an index, all of a sudden you're talking about doing an item by item scan of the entire collection, right? And so there's a lot of value to whatever database you choose. Having a basic understanding of how the data is organized underneath and

what queries you're gonna make against it, because that really influences the amount of processing, or the processing time it takes to return that data.

Nathan Toups (39:44)

Document stores are really cool if you have the, and again, the book does a great job. It's probably the best explanation I've had on why I'd pick a document store. And when we're talking about document stores, we're really talking about, you know, MongoDB, we're talking about DynamoDB, which I don't think he brings up in the book, but there's also, ⁓ you know, there's a bunch of these where what you have is you need basically some sort of a primary index, you know, key value kind of store.

Carter Morgan (39:53)

Right.

Uh-huh.

Nathan Toups (40:12)

of a bunch of tree structures, right? Like that's really what a document database is. And then if you need to start doing relational queries across it, you're gonna be at a lot of pain because what relational databases do quite well, which is denormalize, I'm sorry, normalize the data where you can actually like abstract out all these pieces and you put all these joins together and you can get like a correct representation of the entire system. ⁓

Carter Morgan (40:14)

Hmm.

Nathan Toups (40:39)

in a very like, you know, really nice clean way. ⁓ It's really slow for the type of things in which maybe really what I care about is finding some chunk of data and then displaying this tree structure to the end user where that's where, you know, Mongo really shines. And I loved that this thing really just kind of gives us this breakdown ⁓ of one versus the other. It is funny too that the world has shifted like.

There's this whole movement, I think since this book came out called the Just Use Postgres movement, right? A lot of people are just like, just use Postgres. You don't need all these crazy databases. Not always true. And I think for your personal projects on the weekend, it probably is true. Especially because Postgres has blurred the line, and all the major SQL databases have this now. But Postgres has a JSON data type.

So you can actually do relational data plus tree structure document store inside of the same database. And that kind of blurs the lines. if you have relational-ish data with documents that you want, ⁓ you have all these options that you didn't have before. And yeah.

Carter Morgan (41:46)

And this is something

Alex Ju in his system design interview talks about. He's actually doing like back of the envelope math. And I wish I saw more candidates doing that because a lot of candidates will be like, well, we gotta use Mongo. We gotta go NoSQL because it scales. It's web scale. We joke about that all the time at work. Mongo, it's web scale. We say we don't know anything about Mongo, but we know that it's web scale. And it's

And so a lot of candidates will be like, well, you got to do no sequel because it's got to scale. But it's one of those things where it's like, okay, let's say you're doing a tweet, right? And so let's say a tweet is 140 characters. And so how much is each care is one character a bite.

I think so.

Nathan Toups (42:26)

Typically, if it's UTF-8, it's not necessarily. They call those runes and they can be longer. If it's ASCII, it's yes. It's one byte is one character.

Carter Morgan (42:30)

Okay.

Right.

So a tweet can be 140 characters, you know, and so that's 140 bytes. And so, you know, then how many tweets can you fit in a megabyte? So you can fit like 800, like 700 tweets in a megabyte, which means you can fit. All right. Sorry. No, no, that means you can fit 700 tweets in a kilobyte, which means you can fit 7,000 tweets in a megabyte, which means you can fit. I mean, what is that? Seven million tweets in a gigabyte. Right. And so if you're, if you're kind of saying like.

Well, we got to do Postgres because, or we got to do Mongo because Mongo scales. But then it's kind of like, well, wait a minute. Like, what if I did Postgres? And I mean, how much does it take to get, you it's like, it costs like five bucks a month to run a 10 gigabyte Postgres database on AWS, right? And so all of a sudden that's 70 million tweets I can store in my Postgres database. And then you, so you kind of ask yourself like, okay,

especially starting out, how fast are we going to get to 70 million tweets? And that's why, and every you need to be asking those sorts of questions, like how many rights do we expect per day, right? You know, what kind of growth are we expecting to see? But if you're doing kind of that backup in the envelope math and actually estimating like, okay, how much data are we actually going to have? Then it opens up some of those options for you rather than just jumping immediately to like, well, we need NoSQL because NoSQL scales.

Nathan Toups (43:45)

Great.

One framing that I thought was interesting is so think a lot of people will reach for something like Firebase or something like my ⁓ MongoDB because they want to, I like to call it the they're kicking the can down the road in thinking about schemas, right? So it's like, ⁓ man, you know, the data, shape of our data is changing so quickly. It just really would be annoying to have to nail a schema down and do all these schema migrations. And I've had a bad experience with this in the past and like, whatever, it's just a bunch of JSON blobs.

Carter Morgan (44:08)

Rhyme.

Yeah. Yeah. ⁓

Nathan Toups (44:30)

I get that for quick iteration, but it runs into a lot of problems. again, I've never seen the framing like this, and maybe it's just because I'm a dunce, but he calls this schema on write versus schema on read. And I think this is, again, a good example of the trade-off.

So Schema on Write is what relational databases are. You are declaring your column types, your names. You really have to understand the schema for your system to interact with the database. There's all these expectations. It also means it's in your face. When you need to add a column or make some change or do some other thing, you have to think about the shape of this data before you get started versus Schema on Read. ⁓ And Schema on Read is the document structure.

Carter Morgan (44:49)

Yeah.

Nathan Toups (45:14)

This is actually a lot like how APIs work, right? We don't know, we might have a written promise that data from an API endpoint from Stripe is gonna be a certain way, but it can change over time. We might get some new columns, we might get some new changes, and your code is written in a way that says, like, okay, I'm gonna parse this out, and I will validate the data, and I'll make sure it fits the right data type, and I do that at read. When I read the data, I'm inserting it into whatever structures that I want.

Carter Morgan (45:17)

Mm-hmm.

Nathan Toups (45:41)

And I just hadn't thought about like, depending on the shape of the work that you're doing.

This is the trade off. That's you may have super flexible schemas where maybe only three fields you really care about and everything else just kind of like nice to have in doing a super lockdown relational schema and right structure really is in the way where, know, maybe I care about the user ID and the number of times they've logged in and what groups they're in or something, right? And then everything else, like the tags that they've given themselves and all these other things like extra annotations, if it's there,

that's great and if it's not, I don't care. Well, okay, well, document structure would be great for that. You don't have to denormalize and do all this crazy stuff. can just kind of let this data... So you basically get a data cache optimized for certain types of queries and you can think of documents being that way, right? You're not having to ask a question across all of these documents.

Carter Morgan (46:40)

Right. But I like that framing from you. Like sometimes you are just kind of kicking the can down the road. we've seen that just in like, and I'm again, we've entered a world where generating code is a bit of a commodity these days. And so I I've seen a lot of posts in the experience dev separate it lately of just like, and I really feel for these people, these are people like uncle Bob talks about like the flow state, right? Like these are people who

That was their favorite thing about programming. They just loved getting in the flow state and just generating lots and lots of code. And for better or for worse, that doesn't exist anymore. that's, there's not, or it's not nearly as valuable as it used to be. And so me, I'm enjoying this new era of large language models because I've never really, I have not derived a lot of great enjoyment out of the actual act of writing code.

I derive a lot of enjoyment out of making things and seeing results, right? And coming up with solutions. And I really like that once I kind of have honed them a solution, like, okay, this is what we want to do. Then I submit the prompt to Claude code and it takes care of what I want.

Nathan Toups (47:55)

This is so interesting and I think this is like a, so I was actually literally last night having a conversation with a buddy of mine who has, ⁓ you know, you would think, so there's this sort of divide where there's like all the large language models are terrible and you know, what are we doing to the world and all this stuff and who would use them in your fool? And then there's this other camp that seems like they're like the large language model maxis, right? Like they're like.

Carter Morgan (48:18)

Yeah,

yeah.

Nathan Toups (48:19)

Yeah, like relishing in the fact that everyone's going to lose their jobs and that you can like replace Facebook with one prompt on a weekend or something. And you're like, both of y'all are nuts. That's how I kind of look at it. But my buddy who's like kind of in this Linux, he started as Linux system and he's like a phenomenal programmer. He's worked at a bunch of cool companies. He's kind of taking this like pragmatic middle spot. He loves actually writing code. He like loves writing rust. I consider him a deeply thoughtful programmer, someone who actually does enjoy the flow state.

Carter Morgan (48:28)

Yeah, I know.

Nathan Toups (48:48)

and is also really enjoying using large language models. He actually just introduced me to a tool called Happy, which is like a wrapper around Cloud Code or Codex that allows you to ⁓ let it run where you can like...

Carter Morgan (48:51)

Interesting.

Nathan Toups (49:03)

you can access it from like your mobile device. So you can let it do its agentic stuff. And it's all personal and private because he's very privacy oriented. He actually brings this up in a really interesting way that I hadn't thought about. His analogy was, it's like when the CNC machine was invented, the industry around like machinists freaked out, right? Because before CNC machines, right? And these are the ones that are milling out the stuff and can do this very technical, ⁓ very technical work in an automated way.

Carter Morgan (49:24)

Mm-hmm.

Right, right.

Nathan Toups (49:33)

⁓ machinists were like kind of repulsed by it because they were just like, no, I mean like there's a craftsmanship, there's a tooling, there's a way that we do this. They enjoy, machinists were very well compensated, they did these things.

And yet, CNC did two things. Number one, it allowed people to get custom machine stuff at a much lower barrier. If you could buy a CNC machine and you can spend a few weekends learning it, you can be good enough. It's not going to be as good as a machinist, but a machinist with a CNC mill is next level excellent. They know the excellent craftsmanship that goes into finishing these parts. And they might CNC it out and then go back with their machining tools and make it perfect. They may go do their extra stuff.

Carter Morgan (49:56)

Mm-hmm.

Riot, Riot.

Nathan Toups (50:13)

And I never thought about this idea of like precision plus scale is what these large language models are unlocking with professionals like us, where there's certain types of parts of my job that literally had such a cognitive load to them that I just didn't even know how to get started. ⁓ And then letting the CNC mill go off and do a thing, and then me go off and work on some other stuff and then come back as a machinist and.

Carter Morgan (50:19)

Right, right.

Nathan Toups (50:39)

clean it up and do some stuff that the CNC mill can't do on its own. I think this is actually like a really good analogy. It's the best I've actually heard from anybody. ⁓ And it's cool too, because like Carter, you and I, think.

I'm not quite where my friend who's at the rest programmer, he's the one who gets into deep flow state and really cares about, you know, low level data structures. You're like very product focused. You love the output, the intersection of humans and this and I'm somewhere in between. Like I, I actually like like the producty stuff, but I also love getting into flow state. ⁓ and I feel like all three of us, it's kind of, it's rare for me to see you and him and me all kind of having these aha moments that there's a there there.

Carter Morgan (50:56)

Right, right. ⁓

Yes, yes.

Nathan Toups (51:21)

Right? There's there really is something interesting and new and ignore the hype people and ignore the doomsdayers. I just this little PSA in the middle of this episode is just like there's something there. We're not all going to lose our jobs. Our jobs are changing, though, and the expectations are going to shift. And it's actually like if you have a if you have the right attitude about this, I think that there's something here that is like deeply rewarding, like deeply satisfying. ⁓

Carter Morgan (51:22)

Right, right.

Yeah.

Nathan Toups (51:51)

It actually makes reading books like DDIA really important. If you want to be the machinist plus the CNC mill, you've got to read books like DDIA. You have to domain-driven, sorry, data-intensive, designing data-intensive. The domain-driven one is also another one we're going to read down the road, but yeah. ⁓

Carter Morgan (51:54)

Yes.

designing yeah designing that intensive applications

Well, yeah,

I'm with you, the little PSA in the middle of the episode, but I just think, yeah, like, I don't know. I tend to think of myself as, I'm an optimist, but also a realist. I tend to have a pretty clear-eyed, I hope I'm not lying to myself about too many things, right? And so I am very, very aware of the current state of LLMs. I'm very aware of what my job constitutes. And I've just, I've mentioned those podcasts several times. I'm just like, I don't see my job disappearing anytime soon, right? Like I am not.

incredibly concerned. The job has changed 100%. I look at our junior engineers, I'm just like, you graduated into a completely different world than I graduated into, right? But there's still, again, and reading a book like this just confirms to me, like, this is all really important knowledge. This is something you absolutely have to have to succeed in this world, even in a large language model dominated world. Chapter three? I don't know. Maybe.

So I'm only making fun of chapter three because before we started the podcast, I was like, Nathan, I'm going to be honest. This chapter kind of went over my head a little bit, right? Like I remember this is one where I, if I reading a book, would have went back. If I had the book in front of me with the audio book, I would have paused to do it. So we have not reached the point of the podcast where Nathan explains chapter three to me and to you, the audience.

Nathan Toups (53:26)

my gosh.

So there's a famous story a few years back with the guy who wrote homebrew, the missing package manager for Mac OS. He interviewed at Google and he ended up getting pat, was it Apple? ⁓ well maybe, I thought it was Google, but anyway, there's a mythology. So one of the fame companies, we'll just say that. But he interviewed and he failed. ⁓

Carter Morgan (53:35)

Yes.

It was Apple. I think it was Apple. Yeah, that's what... I know, right.

Nathan Toups (53:57)

this, he couldn't ⁓ construct a B-tree on a whiteboard. And he's just like, I've got hundreds of thousands of people using my software on a daily basis. It's writing valuable stuff. And yeah, sure, I can't do this of know, courting ritual ⁓ thing that you've asked me to do. ⁓

Carter Morgan (54:07)

Right, right.

Nathan Toups (54:16)

And I kind of feel that way about this chapter. like, this is really important stuff if you're in this domain, right? If you are having to solve or understand why are we using B-trees versus LSM trees, or why ⁓ do we use this sorting algorithm, or why is level DB, you know, picked this or, you know, these kinds of things are, they're really important when you happen to be in that space. Most of us are not.

most of us should really lean on the same defaults, right? Like if you're tooling with which index type you're using for your column in your database, I would say most likely you're probably like over-engineering it. And I'll take that a step back though. I was actually doing some Postgres stuff and I realized that the way that I was, the shape of the data, the default indexing algorithm was not correct.

for what I needed, and I actually needed this other thing because the way that I was doing queries was highly optimized. There was like an O to one relationship with the type of thing that I was doing. I was able, luckily with knowledge like this, I was able to understand the trade off. This chapter is definitely one that takes close consideration. If you're listening to the audio book, you're probably gonna have to listen to sections a few times.

I would highly recommend going back and reading the physical copy if you could because there's diagrams that are really important. ⁓ But this really digs into things like hash indexes. And also just like how some of these algorithms actually work on hardware. And I think this is something that we don't think about a lot even in graduate algorithms, right? You take graduate algorithms and you learn about

Carter Morgan (55:48)

Mm-hmm.

Nathan Toups (56:02)

how dynamic programming works, but they don't get into like, oh, and actually this dynamic programming algorithm is really great because of the way that sequential writes on disk work. you actually can, you can say, I'm gonna take this chunk of memory and because I know that like this L-centries work like this, there's these chunks of continuous memory allocation and that because of the way I write these sections on disk, because they are append only logs or whatever,

Carter Morgan (56:12)

Right, right.

Nathan Toups (56:32)

I can write this and know that the fault tolerance characteristics of it are excellent for, you know, if I get interruptions in the, and it ties into things like write ahead logs versus like these append-only records and how these things work.

I liked this section because I hadn't thought about, ⁓ if you have a write-ahead log, you're actually accessing this data two times, right? Versus if you have this thing that's writing to disk the first time, you only access it one time. And sometimes it makes sense for this to be a write-ahead log, and sometimes this makes sense to just write to disk ⁓ because of how disk access works with what you have going on. And ⁓ if this is making your eyes glaze over, it's OK. ⁓

This chapter really gets into the weeds of things like ride ahead logs and crash recovery or how certain technologies, why they were picked for certain applications. Was there anything that stood out to you, I guess? Maybe we should hop into some more concrete examples of stuff.

Carter Morgan (57:40)

Yeah, and like

I, there were some things like, he talked about data warehousing, right? And he talks about like ETL, like extract, transform load. ⁓ yeah, yeah. Which I did a little bit of work at that with that in my, in my last job. But again, this is a little something like I joined. Cause like there's so much breadth as far as like what you can work on. Like every now and again, I'll see like a Reddit poster. Someone mentions they work on like embedded systems. I'm like, I forgot that was a thing.

Nathan Toups (57:50)

Mmm, yeah. actually, yeah, this is a section I loved, actually.

Carter Morgan (58:14)

I forgot that there were software engineers actually doing that sort of work. And that's probably some work I will never do in the history of, you know, throughout my entire career. Yeah. Yeah. ⁓

Nathan Toups (58:14)

Yeah.

They're doing the Lord's work, because that is some deep focus stuff. They're cut from

a different cloth. Yeah, actually, this would be good one. Whether we got into too much of the details or not, OLTP versus OLAP, I think this is, ⁓ even if you don't know a ton about databases, especially SQL type databases, this is a really important concept, because I think everyone runs into this at some point.

if you're dealing with SQL, ⁓ which is the shape of the questions you're asking your database, right? ⁓ There is ⁓ OLTPs are the transactional databases. That is what you think of as like a web app database, right? I have a bunch of users that maybe go to the website and log in and are doing stuff. And most of those transactions are tied to just their user behavior, right? If I have a shopping cart, it's my shopping cart.

Carter Morgan (59:12)

Right.

Nathan Toups (59:15)

my credit card history, my shipping addresses, the things that I've purchased and orders and stuff. It's a bunch of transactions, maybe I have millions of customers, but I'm not going across and asking questions of like, I'm not looking into a bunch of other user data for that, right? At Amazon.com, everything in there is either product inventory or your stuff, like your history, right? ⁓

Carter Morgan (59:43)

Yeah, yeah.

Nathan Toups (59:45)

OLAP is the other big argument here, which is that, okay, let's say I'm an analyst at amazon.com and I want to know what the total sales out of the United States during the month of May 2024, right? ⁓ That's a really big juicy query that you're asking and it's going to affect a ton of rows. And if you ask that amongst of your OLTP, your transactional database,

that database is not optimized for that type of workload. And you can actually take the whole thing down if you ask, do you see big enough queries? Yeah. ⁓

Carter Morgan (1:00:19)

Well, and this is

a good example, right? Of like, again, I read this, I listened to this chapter on the audio book. To be honest, a lot of it went over my head, but things like OLTP and OLAP, right? And like ETL, like ETL at this last job I had, right? Like I, I didn't even know what that was until I showed up. And then it's like ETL. So I would have just been a little bit ahead of the curve if I read this book. And just like OLTP versus OLAP again, like, I don't really.

Again, a lot of this went over my head, but I have a chat to be T pulled window pulled up right now and I just wanted to give some context for the audience. But what's great is that as I'm to give you guys some context on OLTP versus OLAP is I'm in my mind. I'm like, yeah, yeah, yeah. I read this. I remember this part from the book, right? ⁓ And so OLTP is online transaction processing and that's exactly what you're talking about. Nathan, right? This is like user signups, user updates, profiles, payment processed. And what is chat to be listed? Key characteristics is optimized for writes and fast, small queries. It has highly normalized data.

ACID transactions, low latency. OLAP is that online analytical processing. So this is going to be used for your business intelligence dashboards, your weekly metric reports, trend analysis, and what are its key characteristics. Again, and this is completely different from OLTP. This is optimized for reads and large aggregations. It's often denormalized. It has columnar storage, right? This is for handles very large data sets, gigabytes to petabytes. And so again, I listened to this.

I could not have told you that until Nathan started talking about until I pulled up this window, but I do have a, a little bit of foundation. So I'm, I started looking at chat TV team like, okay, yeah, yeah. All those things, Martin Klempman are talked about. I'm remembering them now. So again, even if you're listening to this again, like I am on a bike commute, right? I think there is some really solid foundational work it's doing in your brain, even if you're not coming away with a really detailed understanding.

Nathan Toups (1:02:13)

Yeah, no, this is, and I will tell you, and we'll kind of talk about this towards the end, because we're getting close, is that the world's moved on as well. This is a really good thing to think about.

Carter Morgan (1:02:15)

Well, ⁓ go ahead.

Nathan Toups (1:02:29)

and understand the concept, like a columnar database is important. It's literally storing the columns. And typically it's because these are append-only databases where rows, you go across a bunch of the column files and you're just appending the extra data to the very end. ⁓ But for instance, I think it's Spanner, which is Google's technology. If you ask it a set of data across, let's say 10 columns, ⁓

and you tell it to limit 100, a lot of people think that that'll reduce the cost of the query, but it actually doesn't. It actually, because of its column or database, it's actually accessing the entire column. So you have billions of rows in that column.

⁓ Limit 100 is like a UI syntax, a syntactic thing, but it doesn't actually save you money because, and I actually saw this in an organization where we were doing ⁓ really expensive queries on really, really large data sets and they were acting like it was a transactional database. And I was like, yeah, you're not using this the right way. And it was a knowledge gap. It's just they didn't understand what the difference in these technologies were. And so...

Carter Morgan (1:03:29)

Yeah. Right, right.

Nathan Toups (1:03:36)

If you're in this world, I would highly recommend spending some extra time. This is a really juicy chapter. think I'll kind of close out this section too, which is this, we're in a new world. So there are new databases that kind of blur this line. Databases like ClickHouse, databases like CockroachDB.

where there's even a term that's coming up called hybrid transactional. Like it's ⁓ HTAP, I think is what it is. Where like there are databases now that allow you to actually ask both types of questions and they kind of handle both things that it reduces cognitive load and it kind of just magically either will do.

analytics optimized queries versus transactional queries. And so it's probably because of books like DDIA that inspired folks to think about other ways that we could structure stuff. But this is, it would be surprising if there wasn't amazing developments in how data intensive applications were engineered from 2017 to 2026, right? I mean, it's been almost 10 years since this book was written.

Carter Morgan (1:04:45)

I know, right? And again, just kind of crazy about her feel that 10 years ago, I mean, a lot of the stuff, it is impressive how timeless this book is, but you're right. I mean, it's only been 10 years and still there's a lot of, if not things that have become obsolete from this book, but there's a lot of missing context as far as what has been developed since then. I wanted to go time to chapter four because you had mentioned in particular that you really enjoyed chapter four. What about chapter four stood out to you? This is encoding and evolution.

Nathan Toups (1:05:02)

Yeah.

So this gets in partly into like ⁓ software architecture and platform engineering. So these are like near and to my heart. And it happens that everybody ends up running into this, into these problems, which is how do I do, how do I evolve my system over time? How do I make changes in a way that does not introduce ⁓ errors and ⁓ irreversible changes that can cause major problems? Also, ⁓

He really kind of gets into this idea of like, okay, can I do rolling upgrades? Are they backwards compatible? Like, and he also spends a decent amount of time thinking not just about backwards compatibility, which is actually of the, of the two that we're about to talk about is the easier of the problems. Can I upgrade my system so that an old version of the data schema is still compatible? Like I access something from a backup or from an old part of a database and it's structured slightly differently than new data.

The other one's called forward compatibility. And that's actually the harder one, which is can I write my code in a way that it's actually tolerant of changes that I can't imagine to the shape of that data in the future? And that's actually a much harder problem. ⁓ So old code reads new data. That's kind of how he explains it versus new code reads old data. And a resilient ⁓ data infrastructure should be able to do both and should also be able to handle

why it won't do one or the other. Maybe we do have to, for whatever reason, have to make an incompatible change. And a lot of this goes into what he calls like, so, and I've heard, like in Go, we always call it marshaling and unmarshaling, but most people call it ⁓ encoding and decoding or serialization and deserialization, which is you have some shape of a data. It needs to be written into some format that can be written to disk, which means it's in a string of bytes, some kind of bytes, right? That could be clear text as JSON. It could be,

some highly optimized byte-encoded format, ⁓ some binary format, and of course, tons of people have tried to solve this problem in tons of ways that you've probably been in organizations where like, I remember a data science team that used pickle a lot. That's the Python way of using, it's the Python native way, and it lets you do things like encapsulate the inner workings of a function into the pickling format.

Carter Morgan (1:07:26)

I see.

Mm-hmm.

Nathan Toups (1:07:36)

which can actually be super dangerous, but the problem we ran into is that Pickle was tied to the particular version of Python you were on. So if you're in version, yes, exactly. And so if you're in version 3.7 and then you go to 3.9, well, you now have incompatibility and the Pickling format itself didn't give you a good clean way of doing forward and backwards compatibility. And so you either have to re-encode everything every time you're planning to do an upgrade or you pick

Carter Morgan (1:07:44)

Right, right, he mentions that.

Nathan Toups (1:08:06)

what he advocates in the book, is some sort of data encoding format that is agnostic of the programming interface under the hood.

Carter Morgan (1:08:15)

It's like Steve Flanders and open telemetry or mastering open telemetry. Like the whole book is like avoid vendor lock-in, avoid vendor lock-in, use open telemetry, avoid vendor lock-in. There's, it's not necessarily vendor lock-in here, but it's a similar thing, right? If you choose a data encoding format, like again, pickle, which is married to that particular implementation of a particular programming language. Well, now you're super locked in, right? And so something like JSON, which has, you know, it's a language agnostic. And so, you can have pretty flexible.

Nathan Toups (1:08:22)

Right.

Carter Morgan (1:08:45)

⁓ You can evolve your system more flexibly so long as you maintain those API contracts. But another point he makes is that your data will live far longer than your code will. And so picking the right way to, ⁓ one, the right data structure, but two, how you're encoding and transporting that data ⁓ is more important than the programming language. Now,

Nathan Toups (1:08:57)

Yeah.

Carter Morgan (1:09:11)

This is something I don't know if he was just trying to do his due diligence here. I don't know if the world has evolved significantly since 2017, but he's kind of throwing out all of these like different options for encoding and transporting your data. Whereas I feel like today, like the answer is JSON, like just, use JSON. Even like he's talking about XML is like a viable alternative. I've never really, I don't really see that these days.

Nathan Toups (1:09:26)

Well...

Well, it is funny because he does give this whole section on like, oh, soap is still around and you're like, it basically doesn't exist anymore, except I guarantee you somewhere, somewhere, I know that for the longest time, Mechanical Turk over at AWS was like famously still like a soap client because it was such an old part of the system. I'm sure it's different now, but one thing that I thought was interesting and I don't remember him mentioning it in the book.

Carter Morgan (1:09:37)

Right, right. think, I don't know, soap's around.

really?

Nathan Toups (1:09:58)

for data science particularly, especially, and this is another thing that he didn't really exist now, we don't really use data warehouses like we used to, and now they're called data lakes. This is like what Snowflake and all these other organizations do, and you literally use blob storage with files light. You can use CSV, you can use JSON, but a of people use columnar-oriented structures like Parquet.

Carter Morgan (1:10:09)

Yeah, yeah.

Nathan Toups (1:10:22)

Parquet is, if you're in data science, you're probably using Parquet or some similar optimized data structure. And I do think that like while maybe some of the things that he's talking about in here are a bit dated, it makes sense in the sense that, you know, for instance, analytics data is very, typically very sparse. There's a lot of repetition in a particular column because maybe, you know, there's lots of zeros and lots of 100s or whatever.

and I can compact that down and store it efficiently. So when I query a terabyte worth of data to do queries, I can do this in an efficient way. ⁓ And so yeah, like it's really interesting to ⁓ think about why I would wanna encode something that's not just JSON. Like.

Carter Morgan (1:11:11)

Right, right.

Nathan Toups (1:11:12)

Maybe I want to encode, think it was, which one was was one I was not aware of, it came out of Facebook. ⁓ It came out of Facebook, but I'm trying to remember.

school.

Maybe it it was thrift. Which was the one? Yeah, which was the one that had the ability to say there was like a writer schema and a reader schema. And it basically could map if you've changed, I can't remember which one it was now. It was kind of cool. And it was one that was like it was made for schema evolution. ⁓

Carter Morgan (1:11:35)

yeah, thrift, thrift.

Nathan Toups (1:11:53)

And you basically could, if you've changed the shape of the schema from one to the other, this tool could reconcile the mappings between the two and then find the most compatible version. Anyway, it was kind of some cool stuff where I'm like, ⁓ that's a really clever way of handling that. And again, this is one of the reasons I love this chapter is because if you're doing stuff that's API heavy, REST heavy, GRPC heavy type stuff, ⁓

Carter Morgan (1:12:02)

Honey.

Nathan Toups (1:12:20)

all of the demons that you've run into, all of the nice design decisions are like, how did we get here? It's like in this chapter, so yeah.

Carter Morgan (1:12:22)

Right.

Right. Well, and this would have benefited me in my last job. We were using a lot of GRPC and Protobuf and it was kind of like, this is stupid. Why aren't we just do it using HTTP and, ⁓ and rest and, ⁓ and Jason, but learning more from the chapter. I'm like, okay, I'm starting to see why some of those design decisions were going down. I, it was a, you know, a faster, a little faster, a little slimmer, and we were handling lots and lots of data. So maybe that was the best decision. ⁓ well, we got a.

This podcast, just the time we record it, is now limited by when I have to leave for work. So we are wrapping up here. We would, and you're seeing from this episode, right, we're like, we could devote four episodes to just part one here. ⁓ Really, this is a fantastic book. I've been enjoying it immensely. I'm very excited to finish it. ⁓ Maybe we like to do our hot takes. I don't have a ton of hot takes ⁓ aside from, you know, the book's a little outdated and there, I guess,

the, my hot take would be like, you cannot read this book and be like, this is going to expose me to all these different types, ways to work with the world and, and ways to work with my data. And all of them are equally valid. And so in any project I choose from now on, I need to have this big checklist of like, am I going to use thrift? Am I going to use protobuf? Am I going to use JSON? Right? No. Your answer most of the time is going to be like JSON, HTTP rest. Right. but

Nathan Toups (1:13:50)

Right.

Carter Morgan (1:13:52)

you may wind up in these edge cases. And if you wind up in these edge cases, having this knowledge of all these other options can be very, very valuable. But you gotta know when to break out this knowledge.

Nathan Toups (1:14:02)

Yeah, and

I will say that increasingly the type of data science you work has like diverged from what people who are building web apps. And I've been luckily enough to work in that data side of the thing and their tools are starting to look less and less like the web apps that we're dealing with as well. yeah, I think a couple of hot takes, he spends a section talking about graph databases. And I kind of remember in 2017, everybody was like super excited about graph databases.

Carter Morgan (1:14:30)

VRAI

Nathan Toups (1:14:31)

they still don't feel like they've had their moment in the sun. I don't know if they will have their moment in the sun. Yeah, their graph databases are cool, but I haven't seen it one used in some way where I'm like, wow, that's one thing. The other one is, maybe this is just me.

Carter Morgan (1:14:36)

I imagine they're incredibly useful for Facebook and not for a ton of other people.

Nathan Toups (1:14:51)

I think a maintainability for data intensive applications would be a great book. I was just like, I think you could go off and just talk about maintainability of all of these things and not even have the other subject matter and I think that would be an amazing book for folks.

Carter Morgan (1:14:56)

Yeah, I agree.

Well,

Nathan, what are you gonna do differently in your career? Because you've read part one.

Nathan Toups (1:15:09)

I love evolvable systems design. In this book, touched on some patterns on schema evolution that I hadn't thought about. So like, I do think a lot about how do we have two-way street, sort of maintainable schema migrations. I'm gonna go back and spend some time with some ideas that were in chapter four and also see what technologies have come out since 2017, because I have a feeling that there's probably some stuff I could learn about that's modern. Yeah, what about you?

Carter Morgan (1:15:37)

Yeah, as far

as me, I forgot to fill out this section of our notes. And so I'm going to do differently in my career. I'm just going to keep reading. I'm going to keep reading this book. And that is my commitment to everyone. I'm going to finish designing data intensive applications. And I feel like you should get put on like a leaderboard or something. If you, everyone talks about designing data intensive applications. I want a badge that says, I read designing data intensive. Yes. Yeah. ⁓ we should make a t-shirt and sell it. We don't have like a merge store, but like,

Nathan Toups (1:15:44)

I love it.

I actually read it. Yeah, that's great.

Carter Morgan (1:16:04)

I read Designing Dead Intensive Applications and all I got was his lousy t-shirt. That's what we should do. Who would you recommend the book to, Nathan?

Nathan Toups (1:16:08)

That would be great. Signed by Klepman.

So this is for software engineers who are deeply curious about systems architecture and want to grow in their understanding in the trade-offs. ⁓ I think that, and again, I can only speak for part one, haven't read the rest of it, but this is not a tutorial book. This is not gonna sit here and like tell you how to build all this stuff. This is really about systems thinking in the trade-offs. ⁓

Carter Morgan (1:16:21)

Yes.

Nathan Toups (1:16:38)

So if that kind of thing sounds deeply rewarding, if you want to get to that next level, especially if you want to be staff or some sort of engineering leadership, this is a really important book for that kind of trajectory.

Carter Morgan (1:16:49)

Yeah, I think you have to have your feet under you a bit. And this isn't the perfect analysis here, but I would say first read the DevOps handbook. if while reading the DevOps handbook, you're a little like, okay, yeah, I'm familiar with a lot of these concepts. This all makes sense to me. You'll learn a lot of new things reading the DevOps handbook. But if you kind of read that and are like, got it, this lines up with kind of my experience and what I've done. Then I would say, okay.

now redesigning data intensive applications. It's not, again, that's not a perfect comparison, but I just think I would not recommend this to anyone who can't at least explain to me in good detail how their application is built, how it's deployed, how it's monitored, ⁓ you know, how it's a basic understanding like scalability. ⁓ So get that first. But if you have a good understanding of how that works, hey, this is

Kind of the next level.

Nathan Toups (1:17:46)

Yeah, I would say, yeah, I like that idea. DevOps Handbook and Fundamentals of Software Architecture. I'd say if you read those two and you're like, I'm hungry, I want more, this is the obvious next step. Like, DDIA, yeah.

Carter Morgan (1:17:51)

Yes, yes.

Both of those two, I would recommend

to any sort of eager, ambitious junior engineer. I might be like, you know what, someone might be over your head, but this is great to kind of get that understanding of breadth and understand what's going on. But I'd say read those two first before you start tackling this. ⁓ Great, well, hey, we're so excited. This is gonna be great. ⁓ We're gonna cover the rest of this book across the next three episodes. Thanks for tuning in, everyone. ⁓ You can always contact us at contact at.

Nathan Toups (1:18:03)

Yes.

Yes, absolutely. Absolutely. 100%.

Carter Morgan (1:18:26)

BookOverflow.io. You can find us on Twitter at BookOverflowPod. I'm on Twitter at Carter Morgan. Nathan and his consulting business, Rojo Roboto, is at RojoRoboto.com and his newsletter is at RojoRoboto.com slash newsletter. And if you like, ⁓ this is funny. I do a second podcast with my brother. It's a theme park podcast called Please Remain Heated. ⁓ My brother is a professional YouTuber. He's got like 130,000 subscribers. Does it full time. Anyhow, a lot of our audience is like,

his super fans and so they're very interested in him. Not that they don't like me. I'm saying if you're listening to this and you're a theme park guy, you gotta come and show up on the, you know, you gotta show up in the comments, right? And you gotta let people know that there's at least one Carter Morgan super fan who is listening to the podcast because you just like me. So, you know, make me look good for my brother, guys. But anyhow, again, and I'll have you know on my other podcast, please remain heated. I always close it off saying if you're an aspiring software engineer, check out Book Overflow. So, you know.

Nathan Toups (1:19:22)

⁓ this is an O'Reilly book, so I didn't even think about it, but ⁓ should we maybe, maybe we can, ⁓ I don't know, come on over, join us on the Discord, and maybe we'll have something related to this on Discord, and maybe we'll do a book giveaway over there? Mm-hmm. Yeah.

Carter Morgan (1:19:22)

we're going to get the most minimal of overlaps.

yes.

We have a partnership with O'Reilly. We're still working out the kinks. ⁓

I can't promise anything, but if you join us on the Discord and ask how to get a free book, or if you post this on LinkedIn and tag us and tag the episode, ⁓ we'll do our best to take care of you. And we'll figure this out by next week to know what exactly we can offer you. Anyhow, I know, right? We're pretty good software engineers. As far as running a podcasting business, we are learning every episode. All right, that was... ⁓

Nathan Toups (1:19:57)

Yeah, exactly. We're amateurs when it comes to this stuff. Okay.

Carter Morgan (1:20:09)

A little ton of fun. Thanks, folks. We'll see you next week for the part two, roughly, of designing down intensive applications.

Episodes in This Series

Ep. 97Reliability, Scalability, and Maintainability - Designing Data-Intensive Applications by Kleppman(This episode)

Jan 19, 2026

Ep. 98Replication, Partitioning, & Transactions - Designing Data-Intensive Applications by Martin Kleppman

Jan 26, 2026

Ep. 100 Time is an Illusion - Designing Data-Intensive Applications by Martin Kleppman

Feb 2, 2026

Ep. 101The Ethics of Data-Intensive Applications - Designing Data-Intensive Applications by Martin Kleppman

Feb 9, 2026