Ep. 71Monday, July 14, 2025

Deployment Strategies for Success - The DevOps Handbook

Watch on YouTube

Book Covered

The DevOps Handbook, 2nd Edition: How to Create World-Class Agility, Reliability, & Security in Technology Organizations

by Gene Kim, Jez Humble, Patrick Debois, John Willis, Nicole Forsgren

Get the book →

Book links are affiliate links. We earn from qualifying purchases.

Authors

Gene Kim

Jez Humble

Patrick Debois

John Willis

Nicole Forsgren

Hosts

Carter MorganHost

Nathan ToupsHost

Transcript

This transcript was auto-generated by our recording software and may contain errors.

Nathan Toups (00:00)

favorite teams that have done the on-call rotation, we used a very simple pattern, which was one week on as primary, one week on as secondary. and basically you... interesting.

Carter Morgan (00:10)

Yeah. At Amazon, we had a tertiary. It was rough.

Hey there, welcome to Book Overflows, the podcast for software engineers by software engineers where every week we read one of the best technical books in the world in an effort to improve our craft. I am Carter Morgan and I'm joined here as always by my co-host, Nathan Toops. How are doing, Nathan?

Nathan Toups (00:34)

doing great. Hey everybody.

Carter Morgan (00:36)

Well, as always, like, comment, subscribe, share the podcast with your friends and coworkers. Tag us on LinkedIn if you want. We like when you guys tag us on LinkedIn. ⁓ And also, if you're looking for more great book overflow content, our interview with Manuel Pais is out. He is the author, one of the authors of Team Topologies. We were fortunate enough to be able to interview him. Lots of fun then and good to get another author interview done. What do you think, Nathan?

Nathan Toups (01:00)

Yeah, I'm excited. Hopefully this is a string of some new interviews that we have in the works. Obviously we don't promise anything before it happens, but talking about team topologies was cool with the author. There's also a new edition of team topologies coming out in just over a month, I think. And so it was cool to hear about what's gonna be new in the book and what he felt was important to think about in the future, especially now that

remote work and AI have become such a huge part of our work life ⁓ because that wasn't really addressed in the first edition of the book, which, you know, understandable.

Carter Morgan (01:39)

No, obviously, and it's a sign of how quick this industry can move sometimes. ⁓ But yeah, great to have another author interview. We had kind of put a semi pause on those while I was wrapping up my master's degree. And so yeah, we are making more of an effort these days to read out some authors. We have some other ones working through the pipeline and hoping to get more of those published. But in the meantime, we always have our regular discussion episodes. And that's what we got for you this week. We've ⁓ continued our reading of

Nathan Toups (01:43)

Yeah.

Carter Morgan (02:08)

The DevOps Handbook, we read parts three and four. It's a six part book. So we covered parts one and two last week and we got parts three and four this week. There are four authors of the DevOps Handbook. Just gonna give a quick overview of all of them. Gene Kim, technology researcher and author who founded IT Revolution and is best known for the Phoenix Project and his work studying high performing technology organizations. And just to give a bit of synergy here, Team Topologies was published by IT Revolution. So you know what? We should talk to Manuel Pais. He said,

He'd be happy to put us in contact with any authors. And Gene Kim is one that I have not been able to get in contact with. So we might have it in there. Then we got Jez Humble, a software engineer, researcher, and co-author of Continuous Delivery, who has worked at companies like Google and currently teaches at UC Berkeley. Patrick Dubois is considered the father of DevOps for organizing the first DevOps Days conference in 2009 and is a consultant who helps organizations with their DevOps transformations.

John Willis is a DevOps evangelist and consultant with over 40 years of IT experience who has worked at Docker, Red Hat and Chef and co-host the DevOps Cafe podcast. And the book introduction, have the DevOps Handbook is the definitive guide for applying DevOps principles across entire organizations, not just IT departments. Recognizing that technology is now core to everyday business. The second edition adds 15 new case studies from organizations like ideas, American Airlines and the U S air force.

plus new research and insights from Dr. Nicole Forsgren. With over 100 pages of new content, it provides tools and practices for anyone working with technology to create organizational success. So like we said, we read parts one and two last week. And this parts three and four, we're right here in the meaty middle of the book. Nathan, are your thoughts on parts three and four?

Nathan Toups (03:52)

So while we did have some critique early on that, this felt like an enterprise only sort of philosophical focused book, we actually are getting into the meat of the book here. So I see why this book has been recommended so much. Parts three and four are more specific in the area of things like GitOps and feature flags and on-call patterns and low risk deployments.

things that are the bread and butter of my career. And so it was really speaking to me because I'm like, ah, yeah, this is actually, they put into words really well why you should optimize for these flows to get engineers unblocked. And so it is interesting though, and it's a challenge of any technical book. Some of the technology recommendations are already outdated. some of the case studies, again, I will continue this criticism.

it feels like success theater. like, I know every time I get to a case study, I'm like, okay, show me your example of an overwhelming success with, you know, no pushback and how you hit all your goals. ⁓ While it is interesting that it's based in reality, it also just feels like ⁓ either reductive or highly selective. think it would be so much more interesting to show a case study where a company had two or three false starts and then finally has a success. Like that's fine. Or even a failure story.

Like I would love it if there was like a, they tried to do it this way, but they didn't fix the organizational problems and it failed, right? Like that's a really good cautionary tale. wish that those were there. Yeah, that's, that's sort of my thoughts of where we are part three and four. What about you?

Carter Morgan (05:22)

Huh

Yeah, I mean, this book is so very targeted towards enterprise customers. I mean, we've gotten much more into techniques and practices here in the second part, which is great. That's more what I thought we were signing up for the first, for, I guess, the entire book and parts one and two are very much like, here's why you should do this. And it was like, I'm on board. I'm already on board for why we should do this. I'd like to know how we should do this parts three and four is.

that. I feel similarly with this book, I do with, as your Jay Ores is the software engineers guidebook where I worked on one of the big cloud providers and they do a lot of things right and a lot of things wrong. But some things they do really right. It's just like operational excellence, like managing and running large services at scale. You learn a lot about that. And I kind of thought like

that's the accumulated wisdom over years and years of learning how to do this. Similarly with like, Zhezhe Oraz's guidebook, like all the excellent career advice in there. Like I've had to learn a lot of that just on my own, like with, you know, ⁓ various battle scars from, ⁓ poor experiences. And then he just put it all in a book and you could just read that book. I feel that way with this book and like operational excellence. Like, yeah, you can learn a lot from working with these, big cloud providers or big companies and,

and see how it's done. And obviously that's very valuable. You can also just read this book and parts three and four reading it. There were a lot of the patterns that I recognized working at one of the big tech companies. ⁓ And there were even lots of other things that they weren't doing that I thought they could be doing. Yeah, just lots of really, really valuable ⁓ advice here for how to deploy safely, how to monitor your applications.

On call rotations, like you were saying, Nathan. Yeah. Lots of really, really great stuff this week. Excited to dig into all of

Well, let's go ahead. We'll take a quick break and then we're going to be back with ⁓ just kind of a wild card discussion on everything in parts three and four.

And we are back. Thanks for tuning in. Well, let's start. Like we said, ⁓ this part three, it's called the first way, the technical practices of Flow. And it's all about technical practices, the actual nitty gritty of how do you implement all of the principles and ideals in parts one and two? ⁓ It kicks off right with deployment pipelines, which Nathan, I know that's something you've had a lot of experience with working as like a cloud architect and a tech lead. So.

I talked to us about that.

Nathan Toups (08:05)

Yeah, so ⁓ I am ridiculous when it comes to this. think this is maybe, some would maybe argue this is a premature optimization, though I think this book makes a very good case that it's not. This idea is ⁓ if I'm personally from the command line doing a deploy, even if I'm a team of one.

I'm going to start doing snowflake operations. I'm gonna start doing things that I can't reason about when I get busier or I haven't touched the code for a while. And so the idea is that your deployment pipeline is your foundation. ⁓ So this means that everything should be in version control, right? They talk about this, they say that ⁓ you're...

environment creation should be automated and that you have a single deployment pipeline. so like these three ideas are so powerful. And I recently had a, we had a project that was Greenfield and over at my current company, but we had another project similar that was Greenfield at my previous company. And I'm a local first maximalist when it comes to dev, meaning before ops team is ready to engage with you.

I want to see you spin up your containers and get a local dev environment running on your laptop. ⁓ That's a place for me to have a conversation with you. And the thing is that it means that you already have thought about how to encapsulate your application into a container runtime. ⁓ The reason I bring this up is because this is exactly what we did. We had a project, it got traction, it got some leadership buy-in. They said, hey, we should invest more resources in this. And I was able to swoop in, and within like a couple hours,

wired up to our environment deployment pipeline so that every time that they merge, we decided to adopt a trunk-based deployment process so that everything that went to the main branch, to give the of ideas here, trunk would be your main branch, right, or master or whatever your repo has. Anything that deploys to that, mean, anything that is merged into that automatically deploys, right? So there's this idea that all code, there is no release process that you wait for, you just...

immediately it gets put into an environment. If you go all the way around to like Google or Facebook, or Meta I should say, trunk is actually production. Like everything that lands into the trunk branch actually goes into production. You probably have other more advanced things that we'll talk about with like feature flags or canary deploys or a bunch of other testing before most customers see it. And that is pretty advanced. I'll actually bring up here, ⁓ if there's any one critique I have about this book,

If I was reading this fresh, there is no prioritization framework for like, what's the thing that I should do first, right? ⁓ There are so many good ideas in this book, but I would imagine if I was new to DevOps or if I was new to a team that had to have DevOps tasks, I'd be like, where do I even start? Like which one of these is like the one that's gonna be the 10x problem solver? I would actually argue nailing down your deployment pipeline will do a bunch because it's gonna actually uncover things. ⁓

Carter Morgan (10:48)

That's fair.

Nathan Toups (11:13)

You can't do this without good test automation. We're going to get into that. You can't do this without having good containerization or some way to bundle up what we'll call an artifact. So can I make my code into an artifact that is deployable? Maybe we don't use containers in the future. Maybe we're switching to like nano VMs or these, what are they called?

uni-kernels. It's like a new fun, interesting, weird thing that's like, you know, in development. ⁓ But they really make this case for reasoning about a complex system. And again, I think that the big thesis of this book is how do we reduce the cognitive load on our engineers so that they can default to doing the right behavior, right? And so if my abstraction with you, Carter, if we're working on a code base together and I say, I have a pull request that's out,

I'd like you to do your review. And then when we accept it and squash it or merge it or rebase or whatever your patterns are at your company, and that goes into whatever trunk environment we have, we can deploy with confidence. And I know that that's all I have to think about is my code. I'm just thinking about the PR. It goes in. Some magical stuff happens in the background, and the code gets deployed. ⁓ That's not done automatically. Most people...

build up a bunch of changes, and then they go, I gotta go deploy this to staging before we put it out production. Or YOLO, we only have one environment, it's production, and I can SSH into it and change the code if something breaks, right? Which is like so broken, beyond broken. ⁓ And they make some really good cases for like why you really shouldn't have Snowflake servers, these deployment processes that are error prone and manual, they don't scale.

they get into this larger and larger technical debt. And the more you push off everything in version control, automated environment creation, and a ⁓ single deployment pipeline. And single deployment pipeline does not mean, let's say you have microservices of like 10 services. I'll tell you that that's one anti-pattern is called the distributed monolith, where you have 10 microservices. And then every time you do a deploy, it redeploys all 10 microservices. That's insane.

Carter Morgan (13:02)

Mm-hmm.

Yeah. ⁓

Nathan Toups (13:30)

Don't do that. I've

been there. I've done that stuff before. Your build process should be resilient enough to understand it should only traverse a dependency graph. It should say, OK, I can tell what's been deployed here and what I'm deploying right here only touches this one microservice, and that's what I'm going to redeploy. ⁓ And again, OK, well, glad you told me that. How do I get into that? How do I do that? And there's no obvious way, because there's

50 different ways to wire that up and 50 different build tools to get on that path.

Carter Morgan (14:05)

My thoughts with all of this have been, and I think this is another example of where this is kind of like targeting, this whole book is written with enterprise customers in mind, ⁓ is when does it make sense to do all of this? I kind of in the back of my head throughout all of this has been Kent Beck's work with, particularly Tidy First and The Good News Factory, which we read both. ⁓

And both talk about this idea of like different projects are at different stages and there's different amounts of investment and different kinds of behavior at each stage. That's a mistake I've made in the past is showing up to a project that is really in its infancy and thinking this needs to operate at the scale of a flagship cloud service. ⁓ And so I look at all of this, like for example,

at my current place, we don't do any sort of automated deployments to production. Staging is automated. getting it automated to production is not difficult. It's just a matter of ⁓ changing the GitHub Action YAML file and saying, OK, run this every time we merge. We can do that. That's fine. We don't, because we don't have a lot of tests. And in particular,

was something I was hoping this book would answer and it hasn't yet and I don't think it's going to answer. What does this look like on the front end? Because right now what we do is when we have the changes we need to make, they're deployed continuously to staging. But then before it's time to go to production, we do a bug bash. And so we all log on and end staging and just interact with the UI and make sure it works. I don't love that. It's clunky. It's slow. ⁓

I mean, engineers time are valuable and it takes our time and it turns us into like QA. ⁓ but the alternative from everything I understand is to write a bunch of unit tests to verify that the front end is working as it should. Front end unit tests can be tricky. They certainly take a lot of time. And at a startup, like, you know, we're trying to move fast, right? We're trying to just get as many at bats as we possibly can.

⁓ and so I see all this stuff like, this makes a lot of sense. Like for example, some of the case studies are like Adidas and it's like Adidas is like a company that's moving more and more into the online world, but they kind of, they know they're going to make money in the online world, right? Like they've made money for years and years and years. It's like a clothing manufacturer. ⁓ we're not quite at that stage yet at my company. So I don't know. are your thoughts on that, Nathan? How do you, when do you start doing this?

Nathan Toups (16:57)

So I, you know, if I were here, would, anything that can be solved by human from a bug bash and QA standpoint can be automated. I would recommend getting a, even if it's a bit cumbersome to get working, get it running as a no-op stage. And again, in the full UI stuff, and again, I'm not like extensive in this, you know, there's some, there are integration test frameworks. There's some of them do, ⁓

Carter Morgan (17:08)

Mm-hmm.

Nathan Toups (17:25)

I know in the past, and this is like a thousand years ago in these worlds, we had something that used like a headless Chrome and we could actually, it would even take screenshots from the integration testing when it failed. cause again, it gets cumbersome because if you're doing big refactors and it's based off of ⁓ HTML tags and stuff, things can get a little funky. ⁓ There are tools that have gotten a little more semantic with large language models.

Carter Morgan (17:33)

Yeah, yeah.

Mm-hmm.

Nathan Toups (17:55)

But I would say, even if it's not used by any test, get a no-op test in place. then encourage lead by example and say, here's something I'm fixing. How do I do this using this integration testing framework? Like, how do I cover this so that the build pipeline fails if a button is moved or the button doesn't have the same output as I expect it to, right? And you'll basically uncover how testable is our code.

you know, are there ways that we can restructure how we do modules so that we introduce few breaking, you know, because I think what you're looking for integration test is emergent behavior changes, right? Yes, if I write packages in JavaScript is pure functions, it's very easy to run unit tests on. When I compose all of those in a react route, and now state is trickling through 15 different components, I can't reason about this, right? Like it gets really difficult to reason about.

⁓ and so I think that that would be the, how do I introduce, and I was actually just having a conversation with another person on another project. It's like a Linux build pipeline where they hate their testing framework and they're trying to move to a new testing framework. And we were debating on like, should they just throw the whole thing out and start from scratch, which I think is almost always a bad idea. Or can we do what's what I was calling, ⁓ parallel paths, right? So parallel paths being keep the buggy test framework that you have, but

If you think of a DAG, a DAG being a directed acyclic graph, it's, you I start the process here and I fan out and then I fan back in, right? The end here says deployable, you know, the pipeline's deployable. But I might be able to fan out in parallel and do a bunch of things like linters and unit tests and integration tests and all these things that can happen in parallel. And then to save time. And then they come back and say, hey, look, if all of these parallel processes come back and say, we're good, we're good.

⁓ we should be able to do the same with a new paradigm of testing. And so there are frameworks to do these sort of like more UI based things. And like I said, I would just introduce it with literally zero testing it, right? Just to get it comfortable with the fact that it's gonna come back and hit the check mark that says, hey, all of the integration tests passed, which is zero. And then you introduce maybe the world's simplest one.

which is that your logo shows up in the top left corner or something, Something that's like incredibly obvious and you're learning the tool too. And then you could even do a, if there's a demo day or an internal thing where you make a loom video and say, hey, I've got this new thing. I'm encouraging people to start thinking about integration tests. I've written this really basic one. I know it doesn't do much. I think we should start doing this. I think it'll cut down on our bug bash time, right? I think this is how this fits into the DevOps stuff, which is that,

you can measure. Because then again, you can do stuff like, does this introduce flaky tests? Well, if you write bad integration tests, they're going to be super flaky, right? If you're doing A-B testing, or you're experimenting with layouts and formats and stuff, and you make some huge change, now the integration test breaks. Well, OK, well, now people are going to be mad, right? They're going be like, ugh, these integration tests were a cool idea, but we moved too fast, But I would argue you can't move with confidence.

Carter Morgan (20:57)

Yeah

Nathan Toups (21:17)

unless you have that test coverage, right? Because I think the end goal is you want it so that the most junior engineer in your company can ship a feature and hit the deploy to production button with the least amount of risk. And the only way you can do that is to automate the tests, right? And that means some upfront pain. What I would not do though is stop feature development while you're doing it, right? So that was a mistake I've made in the past where you're like, okay, everything's broken.

Carter Morgan (21:33)

Hmm.

Nathan Toups (21:45)

We can't do any new features until this happens. That's a terrible idea. So I think what you need to do is basically say, well, maybe we won't do new UI features unless there's at least a stub for integration tests for this feature. And as we get more comfortable, we can get stronger and say, hey, look, no UI features start shipping until they have integration tests. It won't go to production. ⁓

Carter Morgan (22:06)

I'm

trying to think about it like, cause I'm totally bought in on the theory. We, I think there's a lot of that in this book. We just did some big changes to the way that our, again, I work for like an online coaching platform. So the way that our coaches could set their own prices. We kind of all hustled, you know, was a whirlwind of a 48, 72 hours.

Nathan Toups (22:16)

Yeah.

Carter Morgan (22:35)

But when we got it done and said, OK, it's done, let's ⁓ bug bash it. We went, there were 50 bugs. That's how many we found just by fiddling around with the UI. And I'm not going to say that we went over our work with a fine tooth comb previously and validated every conceivable edge case. But I'm also not going to say that some of these bugs that we found were ⁓

were ⁓ tough, like we wouldn't have known to test for them. ⁓ If you're Facebook, this is fine because they talk about this in the book a lot where Facebook kind of deploys in stages. Well, they don't even deploy in stages like, and I guess we'll touch briefly on this idea of deploying or decoupling feature releases from deployments. Basically that all feature releases and all features should be hidden behind feature flags.

And that you should be able to deploy your code without deploying features so that when it's time to actually activate a feature, it's not coupled with this whole big deploy and rolling back a feature is not initiating a new deployment. It's as simple as adjusting a config value and flipping that on and off. So Facebook, they're great about this. They've decoupled their, ⁓ their, feature releases from deployments. So that makes it so that when they start doing their feature releases, they start doing it in phases. So the very first thing they'll do is they open it to.

Facebook employees the new feature and then they'll roll it out to 1 % of their users and then 10 % and blah blah blah That makes a lot of sense because if you catch it Or if there is some sort of weird funky issue you'll catch it at the at the Facebook user stage and say okay This isn't working, right? We can't do that so much ⁓ Well, okay, so so

Nathan Toups (24:30)

why.

Carter Morgan (24:33)

Maybe is this the difference then is the difference that we should be embracing automated deployments and embracing feature flags more.

Nathan Toups (24:42)

Thanks

Well, so the idea, so to get to the philosophy, and this actually gets us into the next chapter in the book, which is called Enable Low Risk Releases. So the principle is that it should be as low risk as possible to make an update to your release. And feature flags is a tool that can help with this, right? You need trunk-based deployment, meaning that we ship things quickly. You should encourage folks to build code in a way that

reduces side effects, meaning that the person who deploys that feature should take responsibility for the reliability of the system, right? Let me ask this question. Do all the software engineers share on-call responsibilities? Okay, excellent. That's a huge part of it, because that helped us with our company culture, because nobody wants to be called at three in the morning. And so it encourages fixing. I'd say the other part, though, is to also, like,

Carter Morgan (25:28)

We do, we do, share that on call responsibilities.

Yeah.

Nathan Toups (25:44)

⁓ Low risk releases should be, it gives you that, again, to talk about Kent Beck or even to talk about ⁓ some of the wisdom that you got from Amazon, the optionality piece. You can write code in which once I release it, I can't roll it back. And so for a good example of this is sort of like a destructive schema update to a SQL database where if I actually rolled it back, I would delete columns and lose functionality in my application, which again, I would argue is a

bad, bad practice. It was one that I had to work on at a pretty large company that should have known better. And I ended up working with them this idea of decoupling schema deployments from application deployments. there'll be a thesis to this in a second, which is I was saying that well-planned software should know that a schema change is going to happen in the future, that you should write your software so that it can handle

either the current state of the schema or a future state of the schema so that it's able to be rolled back. So what I would do is I would say, I'm going to proactively update the schema so that once the new application features come out, I can use it. But if I have to roll back my application, the old version of the application can handle either schema. If you do this, you actually get application deployments that are reversible. If you don't do this, you end up getting this one-way road that's very high risk. And again, this

violates the idea of low risk releases. So getting a software engineer to have to think about the reversibility or the two way, you know, the one way versus the two way decision ⁓ is really important. And at first it might be uncomfortable. Software engineers might even push back and go, that's too much for me to think about. And I would argue, is it too much for us to get customer churn, right? Is it more expensive for us to lose reputation with our end users if...

you're just feeling kind of lazy. Like you're feeling like yoloing it right now. And if it's not important enough for you to think through, then maybe we should reject this change for production, right? And that sounds kind of mean, but it's actually nice. It's a nice thing because I care about the customer and we should always be making customer-centric decision or I would say user-centric decisions because this works for open source as well, right? But the goal should be how do we reduce the risk of our release? And so...

Carter Morgan (27:42)

Hmm.

Mm-hmm.

Nathan Toups (28:08)

If the quality gate is a bunch of engineers having to do bug bashes before we're actually ready for production, what that tells me is there's some broken process in which we're getting sort of like, instead of having an assembly line at the Toyota factory, we have like artisanal craftsmen who are creating cars by hand, right? Which ⁓ you don't want at scale. Like that's just not a way. And again, we all got there. I've been in organizations, I'm not like judging, but.

Carter Morgan (28:29)

huh.

Nathan Toups (28:39)

One of the ways that you might be able to measure yourself is, can I reduce, like, start measuring the number of manual bug bash things that you have to do, and then make a goal to reduce the number of bug bash interventions from a human, and see if you can build a tool that, you know, we reduced it by 20 % this week, that's awesome. And then, you know, give yourself a feedback loop that way. Again, there's a little bit of pain. might, in the short term, might be, it might feel like you're slowing down, but there's a... ⁓

Carter Morgan (28:58)

Mm-hmm.

Nathan Toups (29:08)

There's a military saying that I'm a big fan of, which is slow is smooth, smooth is fast, right? Slow is smooth, smooth is fast, which is that, hey, if I'm working as fast as I can by myself, I might be able to go around like a crazy person and just go super frantic and do decisions. But as soon as I have like five people on a team or eight people on a team, the cohesion of our team is more important than how fast any individual is working.

Because once we nail down cohesion, then we can get operations smooth, we can start moving really fast. And we can start moving way faster, right? If you asked me to carry a rowboat on my back, I couldn't do it, right, by myself. ⁓ If I'm just yanking at it frantically. But if five people are yanking at it frantically in five different directions, we're not going to move the rowboat either, right? So we have to coordinate. We're all put it on our shoulder. We're going to decide the direction that we're going to go in. And then all of a sudden, we can move a rowboat.

so fast, right? We can move, we can go in a great direction. And so I think sometimes with this, you look at a broken process, go, ⁓ it's a pain every time we have to do this. I really loathe it. ⁓ This is a good example of being like, okay, let's go on the same page. Let's all get the same target. And I'll tell you, once you get over that initial test automation integration, ⁓ or, you know, maybe you can... ⁓

increase the cadence for production releases. So the other argument would be, let's do a production release. Let's experiment with making one feature and then trying to release that immediately to production as well. ⁓ You know, especially if you can do it with a side effect free one. If that's not comfortable, maybe experiment with building a feature that has a feature flag, some basic feature flagging system that you understand what the thing is. You can demo how a feature flag might work. Because again,

there's 1,000 ways you can do feature flags. It can also get really expensive. There's really great tools like LaunchDarkly and a few other ones out there. But at scale, it gets really expensive really quickly. So if you're a startup, you may just want to suck it up and take the cost one. There's also really cool open source flagging systems that have come out, depending on what framework you're using. There's Kubernetes native ones. We were using one at

Carter Morgan (31:05)

Yeah.

Nathan Toups (31:33)

my last company, because we started making this move to feature flags. And ⁓ it's a culture shift, right? ⁓ But it can be really cool. ⁓ You kind of have to do two things with it once you have to get feature flags in place. You also have to get what's called a canary deployment. So it makes way more sense if you have ⁓ these like cohorts or audiences. So for instance, let's say you've got somebody on your platform who's like an early adopter. They love

Carter Morgan (31:50)

Yeah, yeah.

Nathan Toups (32:03)

your company, they want every new feature as soon as it comes out. You get them to hit the checkbox that says, I want to alpha any new feature. they know full well they're going to get access to features for everybody else, but it might be buggy, right? And we would actually even hope that you give us some feedback if something weird's going on. You actually shift your relationship with your customer because all of a sudden they're like, oh, I'm part of the team. I get superpowers that other people don't have.

Carter Morgan (32:26)

Yeah, yeah.

Nathan Toups (32:32)

you give a little more forgiveness because you're like, oh man, you know what, this is an alpha feature. I did sign up for that. It is a little buggy. But you can find out from, you know, your five most dedicated customers or, you know, 20 most dedicated customers or, 1 % of your customer base or whatever. You can find out if it like errors jump 50x or something because they turn this feature on and all of a sudden, you know, things are barfing all over the place instead of the whole, you know, all or nothing kind of releases.

Carter Morgan (33:02)

Yeah, it's, it's all just a matter of timing, right? Just in that, like, again, you read these case studies, these big enterprise customers, ⁓ and they're saying, yeah, we only did four releases a year and it was a huge nightmare. And we had to summon the whole release team and it was a week long affair. mean, we read the unicorn project, right? And that's, ⁓ that's what they have going on there. ⁓ we're not like that. Like there's a lot of things like.

Nathan Toups (33:07)

Yep.

Carter Morgan (33:32)

I don't want to be the guy who says this is best practices. We must do what best practices dictate. ⁓ When it's not currently very painful, right? Like this, company I'm working at has actually been, I've been very, very impressed with their hiring and recruiting and all the engineers I work with are great engineers. ⁓ So it's a high trust culture, which is great. ⁓

And it's, it's one where people are making relatively few dumb or silly mistakes, which is also great. I know that doesn't scale. I know that as this company grows from 30 employees to a hundred employees, 200, a thousand employees, right? All of this stuff in this book is going to become necessary. And I see the dark alternative future where, you know, they talk about this a lot in part four where.

You start imposing more and more cumbersome practices like, okay, now we have this big change release form you have to fill out, or now you can't just get a manager's approval. You need a VP's approval. Now you need the CTO's approval to make a change. I don't want that at all, but we're not in danger of that yet. And if we ever started going down that road, I'd fight hard to make sure that we don't go down that road. ⁓ And I think that there is value in implementing some of this stuff.

Nathan Toups (34:35)

Yep.

Carter Morgan (34:56)

early on ⁓ to set a culture and a precedent. ⁓ But at the same time, there's a lot of features that need to be built. There's a lot of, mean, I'm always a little nervous when like, as of now, this company is not profitable, right? Like, and that's very normal for a venture backed startup, right? But, but to kind of say like, we should be focusing on all of these, like, best practice engineering health stuff when

Nathan Toups (35:14)

Yeah, that's normal startup, yeah.

Carter Morgan (35:24)

Right now we don't make enough money to pay our salaries. It's a little like, I don't know about that. Yeah.

Nathan Toups (35:30)

Again, I would argue

that you're actually slowing yourself down without realizing it. So another good example, right? The team that I just got ⁓ wired up with some of these tools, it didn't take me that long if you picked the right tools. So for instance, we don't have an existing Kubernetes cluster. Kubernetes kind of shifts this a lot because you can abstract away a lot of things. ⁓ We were using a little tool called fly.io. Fly.io is really interesting because...

The entire abstraction is a container. uses under the hood, it's using like Firecracker and some other kind of cool stuff. What I love about Fly and I had a friend who turned me on to Fly is that for our local dev transition into just having, you know, prototyping environments, it is no-brainer. It does all the right DevOps best practices. It makes it really easy to wire up to your pipelines. And from a friction standpoint, the engineers love it because like you literally go fly logs.

And you're inside of the repo and it knows, you know, if we're in a monitor repo, so it has a bunch of these like microservices. So if you're just in the microservices directory, and you type fly logs, the log stream of that service comes up for the thing that's deployed. Well, yeah, cool. That's fine. But if you have to do that by hand, you're like, okay, well, what logging service am I using? And what are the, what's the deploy tool? What's the pipeline and all these other things. And like, I think that.

how rapidly you can do the right thing is more important than having like the most, are we going to use fly for production down the road? Possibly, possibly not. You know, there's some DevOps folks who have other infrastructure that we like run ourselves. We may transition into that, but what I'm doing is I'm setting the bar high being like, it has to be at least as good as what fly deploys processes. And if it's not the engineers are going to mute me. And that the engineers are, will tell you, are so happy with this. We've been able to onboard new people.

They've been able to wire up the rest of their ⁓ testing automation. And there's only maybe one or two full-time people working on this, but they have this mature data flow. Everything's in Git. All the tests cover these things. as they run into issues, they just update this. they really care about never repeating a mistake twice. ⁓ And I think that that investment is actually worth it. You don't have to stop the world.

you can dedicate 10%, 20 % of your time to doing these things. And especially the stuff that 10X's return on investment are the ones. And they talk about this in the book, right? Is that, ⁓ for instance, I've actually met, there's a guy, Tom Limoncelli. Tom Limoncelli wrote this ⁓ cloud ⁓ SRE book that's really good. He's like background in a... ⁓

systems administrator back when we called them that. He was at Google, super smart guy. And Tomlin Michelli was like, if I had my way, I would go to most alerting dashboards and just delete everything. He this towards the end of the section that we read. I would delete everything. And then the next time that there was an outage, we would build alerts off of those. And we would build these up off of the only actionable alerts. And I think that that's one of the things that comes up here is that there's a lot of things that we measure.

and there's a lot of things that we alert on that are going to be the signal versus noise thing. ⁓ Every metric that we gather should be to get early detection to avoid some sort of production issue before it becomes a production issue. And if it's not actionable, you shouldn't be measuring it, right? Unless it's like a compliance thing, unless there's like, the government makes you, you know, gather logs or whatever. And I think this is absolutely true. So it's like, for instance, let's say you have a SEV1, ⁓

on your deployment, something you did the bug bash, you deployed production, and despite all of this, some bad thing happens, right? After that postmortem, after you go through the five whys, after you do all the stuff that you're supposed to do, there better be telemetry and a metric to early detect this the next time. And if there isn't, I would say that your postmortem process is broken, right? If there isn't a new way to discover this in tests beforehand, and there isn't a way to measure this again in the future,

you're literally doing Groundhog Day. You're literally having this thing in which, and again, we've all been guilty of it. I'm saying this as like a message to myself from 10 years ago, to 10 years ago. ⁓ You're not honestly solving the problem unless there's a organizational learning that comes from some sort of failure. And again, that could be pipeline and process failure. That could be runtime and measurement failure, but.

Carter Morgan (39:53)

Yeah, yeah.

Nathan Toups (40:18)

That's the sort of maturity we have to bring as a software engineer to these problems, even at a startup. I would actually say, especially as a startup, because if Google pisses off a few customers because of some bad queries, they're not going out of business. Or a startup, if they churn a few core customers, it's devastating, right? Absolutely devastating.

Carter Morgan (40:41)

Let's talk a bit about telemetry. ⁓ That's what ⁓ all of part four is about. ⁓ Telemetry for our listeners who aren't familiar is the practice of collecting logs and metrics to monitor your application. ⁓ This is another practice I've seen at the big tech companies where they really excel. ⁓ There's some kind basic principles of

Nathan Toups (40:49)

Yeah.

Carter Morgan (41:10)

telemetry. ⁓ One of them is that it should be really, really easy to register a new metric. You don't want, if you want to log something, you don't want to have to go to some sort of essential system and create a new metric. A library I've used in the past, and as mentioned, this book is statsd, which I didn't realize was created by an Etsy engineer. ⁓ statsd is really easy in your code base too. It's just one line.

Essentially, as soon as you log a metric, you just write a line like, you log this thing and call it this. StatC picks up on it, automatically categorizes it as a new metric. And then you can feed that into your telemetry visualization tool and ⁓ start making graphs on it. ⁓ Do you have any telemetry tools you've liked in the past, Nathan? This is one where I

Nathan Toups (42:09)

Yeah.

Carter Morgan (42:09)

I've been

a weird spot because at the big companies I've been at, they've had their own in-house versions that they've built for telemetry tools. But I don't know. What do you like off the shelf?

Nathan Toups (42:15)

Right.

First I'll say is what a time to be alive. ⁓ We're in the best spot for telemetry tools we've ever been in. ⁓ I remember the good old days when telemetry met Nagios and a bunch of other weird stuff and ⁓ we were trying to get things set up and it was a mess that we all got used to, assisted men's, but it was just a mess. Things are great now. ⁓ So.

Obviously the big ones, ⁓ open telemetry, think is sort of won the standards war, whether you use them or not. Open telemetry is a really important paradigm because ⁓ there's really sort of like three types of metrics, right? ⁓ There's logs, there are ⁓ sort of like metered statistics, and then there's tracing, right? ⁓ And traditionally those were all kind of like three separate things.

Carter Morgan (42:52)

Yeah.

Nathan Toups (43:16)

Open telemetry gives us a way to sort of take this time series data and cross-reference them so that logging events and distributed stack traces and metrics can all ⁓ be able to be referenceable across one another. And in reality, that's really kind of how you want things because yes, maybe I care about some gauge that tells me how saturated a load balancer is, but I also probably want to cross-reference that with some distributed stack trace right behind it, right?

⁓ Open Telemetry kind of gave us a way to have this conversation of like, ⁓ and I'll tell you that setting up logging is pretty easy. ⁓ Setting up good logging is not. A lot of companies, and I'll still see this today, ⁓ haven't even settled on a structured log format, right? Structured logs are much better than just regular logging standard, standard out or standard error. Structured logs typically come in the form of like a JSON file format ⁓ log that gives me like,

the server it was running on, the timestamp, the log that came out of the runtime itself. If there's additional annotations like the Kubernetes cluster, stuff like this, it makes it really nice. It makes it easy to query. To answer your question though, if I was a startup and I was doing this from scratch right now, and I didn't want to put too much effort into it, Grafana Labs has an amazing cloud service. Grafana Labs is the visualization dashboard that most people use these days.

Carter Morgan (44:44)

Mm-hmm.

Nathan Toups (44:44)

But

they have their own logging service called Loki. They have their own metrics. They also have a really cool load testing tool called K6, which has nothing to do with Kubernetes. It's just called K6. ⁓ And it's an amazing tool. It gives you a tremendous amount of resources on if you want to stress test API endpoints or get the metrics stuff. If I was going to do something right now and I just wanted to have low effort, high reward, I would probably just use Grafana Cloud straight up. ⁓

Carter Morgan (45:11)

Interesting.

Nathan Toups (45:13)

There's another company I like a lot called Corelogics. They have a little bit different approach. I've used them in the past. ⁓ Corelogics is they take a bunch of open source stuff and then put a really cool ⁓ Kafka ⁓ event stream in front of it. So you can put tremendous, it was a fintech company I was working in. So we were like bombarding it with terabytes worth of data a day ⁓ and it handled it without a hiccup. And it also let us store our data.

long-term into S3 buckets that we owned. So we would basically have like hot logs, know, warm logs, and then like cold storage, because we had for compliance with SCC, you have to store things for like seven years. ⁓ And so it was kind of cool to be like, we didn't have to build out our own observability infrastructure. We could just use CoreLogix and then like make it work. ⁓ there's a lot of great solutions that you can basically just like hook it in, you know, for more front end stuff, there's like Sentry, there's like a bunch of other

that kind of like plug into things and then extend functionality. Obviously there's the big one, which is Datadog. Datadog works really well, but oh man, is it expensive. Like it is so expensive. I will tell you though, I was at a company where we would do like 2 million a month in Cloud spend. And we spent, I think we spent around like 30 or 40,000 a month on Datadog.

Carter Morgan (46:22)

That is what I've heard.

Nathan Toups (46:40)

So, you know, and that was considered like low spend because Datadog will tell you that like, about 20 % of your cloud spend will be observability. like 20%, so, you know, of course, you know, they're like, oh yeah, you're not spending 200,000 a month. know, that's a, so yeah, observability stuff though is, I will tell you, it's incredibly important and it does need to be low friction.

Carter Morgan (46:40)

This is

crazy.

Yeah.

Nathan Toups (47:07)

There really needs to be, like, I'm a big fan of one place for all of this. this is where Datadog, they have what they call the Hotel California model, right? Which is, I don't know if you know this one, but it...

Carter Morgan (47:18)

You can check

out anytime you like, but you can never leave.

Nathan Toups (47:20)

You can never leave. So Datadog

does enough unique things that it's so painful to switch off of them that you just are like, ⁓ I'll suck it up and pay the bill, right? Which is why I think if your instinct is like, don't want to be in that relationship, then, but the weird thing, what has shifted a bit is even Datadog understands this. Open telemetry has this idea of sinks. And again, we know sinks from ⁓ graph theory, right? From grad school.

Carter Morgan (47:27)

Interesting.

You

Nathan Toups (47:47)

A sync is, there's a source and a graph, and then there's a sync. A sync is like where, a termination point where something comes in and doesn't come out. You can now use OpenTelemetry and have a Datadog sync. And so I could actually have a private observability stuff and Datadog and send them off to separate syncs. And this is like a pattern that's come out. This also allows you to do things like, OpenTelemetry lets you like pre-filter. So let's say, you know, I don't want to send them all 200 OK.

requests, I want to give a sample. And so I'll sample my logs so that healthy logs, I'm only going to send 10 % of those to Datadog. There's like all kinds of interesting things you can do. Again, I would not optimize this as a startup. But if you become a successful startup and you're doing billions of transactions or something, sampling and a bunch of these other filters and other things kind of come up. I'll tell you this, telemetry is, we would actually...

use telemetry not just for production though, our staging environment and our dev environments and our ephemeral dev environments, we would also ship those off into whatever telemetry platform we had. ⁓ And we had a bunch of things like you were allowed to use debug logs in dev. ⁓ You were allowed to use debug logs in staging, but there was a budget for it. So like we would only let so much debug logs, because debug logs typically you just, you'll run up your bill really fast.

Carter Morgan (49:05)

Hmm.

Yeah.

Nathan Toups (49:11)

And then we had an immediate alert if debug logs showed up in production. So we would do these kind of things to make sure that somebody didn't accidentally turn a flag on that said debug in production, because our production workloads, we would literally spend like 3x our normal data dog spend just in debug stuff if we weren't careful. So ⁓ we had really strict rules in place for that. But the observability stuff, think it was.

Carter Morgan (49:15)

Hmm.

Nathan Toups (49:39)

I've heard this about Amazon. I don't know if this is true of your team or your experience, but we had this at my FinTech company, which was we actually had to propose the dashboard with the software that we were going to work on. like, yeah, so like I've heard that there's like ⁓ some products at Amazon, you had to write the press release before you started the project. then, yeah, okay. So the press release, and then some teams would do press release plus

Carter Morgan (49:55)

Interesting.

Yeah, that was very true.

Nathan Toups (50:08)

⁓ observability dashboard. So the idea would be, I'm going to propose which things would be potential things that would fall over, what I would need to measure, and because other teams might want to consume those alerts. ⁓ It's really, if you're talking about a big feature, or if you go back and look at your postmortems, this is a good point. This is something I did. When I came into a team that I had not been on, they were having a lot of deliverability problems. They were introducing a lot of bugs into production.

customers were upset. We looked at the postmortems over the last year before I joined to see if there was a pattern of stuff that was going on. And I used that to decide what should go in the observability dashboard. So this is a cool practice, which is that you go back and go, hey, I saw this was happened. I saw this was this resolution. Do we actually measure this now? Like, how do we know that this isn't going to happen again? Luckily,

A decent amount of the time there was something, maybe the thresholds aren't set properly or, you know, it wasn't exactly how would have done it. But there were a few cases in which we literally having that Groundhog Day experience where people knew it was a problem. No one was measuring it. Everyone kind of just like, you know, cross their fingers and prayed that, uh, that next deploy wasn't going to break stuff. And that's what we shifted. And we did that. We earned a lot of confidence from leadership, right? Cause all of sudden we could say, Hey, um,

you know, we're looking at this in staging and things went up this much. And, you know, when we do a stress test, it does this, we can't, we can't release this yet. And, ⁓ and so a lot of it's just like, yeah, get out of hunch world. You know, a lot of smart people can have a good hunch, but you should really back it up with data. And, that's where the observability stuff comes in.

Carter Morgan (51:55)

Do we want to talk at all about ⁓ on-call? We've talked about on-call in the past. ⁓ The book dedicates some time to talking about on-call. ⁓ There are some practices here that are standard at most companies these days. And if they're not, you're living in the Stone Age a bit. But the idea is that developers are on-call. There is no separate ops team. ⁓ I remember when I worked at

big entertainment company right out of school and, a little more enterprisey, surprised I'm not in here as a case study. Maybe that's good. Maybe that's bad. and, I remember I was young and dumb and my manager was floating like, I'm thinking of getting like a separate, some separate on-call engineers that, you know, that they're going to be permanently on call and we'll get contractors or something. And as a, as an employee, I'm like, that sounds great. So you're like, we don't have to do any on call. ⁓ not.

great. You decouple your feedback loop is less typed, right? And engineers are not incentivized to not push buggy code. You don't feel the pain of what you're developing. I've been pretty upfront on this podcast before. I am not one of those engineers who clutches their pearls or gets upset if you ask them to do front-end work or back-end work or DevOps work or any kind of work.

I don't really believe in super clear separations. ⁓ I guess as you get bigger and maybe become incredibly specialized in general, though, I think we're engineers. think we build things and I think we own those things we build and we're in charge of fixing them when they break. So ⁓ yeah, that's I have a couple of questions I ask when joining a new company, just to kind of suss out like, okay, what are your development practices? One of my favorite ones is

Nathan Toups (53:42)

Yeah.

Carter Morgan (53:49)

how does it change on my local computer get into production? Because that'll kind of let you, it lets you suss out, like is there CICD? Or even like when I joined this company, like it was funny, because I, in my interview, there was one of our junior engineers and then my boss, the VP of engineering. And I asked that question and the junior engineer kind of talked about it from like kind of a more nuts and bolts perspective. But my now boss, after that, he asked me, like, you're asking out CICD.

Nathan Toups (53:53)

That's a great question.

Carter Morgan (54:18)

Aren't you? I'm like, yeah, like that's what I want to know. And so, and he told me, he's like, we have some of it done. We don't have all of it done, but it is a company goal and the CTO has committed blah, blah, blah. And like, that was enough for me. was like, I think we talked about this in the podcast last week, but I'm just becoming more more convinced that you can't really change a company when you join it. You can be an accelerant for values. The company already has, right. ⁓ but it's hard.

Nathan Toups (54:43)

Right.

Carter Morgan (54:45)

as a rank and file employee to inject new values into a company. And so that's really important for me is not trying to suss out exactly, are they doing all the things they want? I would like them to be doing, but do we share the same values and do we have the same understanding of what good looks like? Because if you're aligned to there, then you can be a huge force multiplier. ⁓ Anyhow, ⁓ yeah, so.

Nathan Toups (54:50)

Right.

Carter Morgan (55:11)

And another thing I ask out is on-call rotations when I joined, just because one, you want to make sure it's not like, yeah, like there's, kind of the, the two sides. It's like, we don't have any on-call rotation. We have a bunch of contractors out of India and they do it. It's like to me, that's a big red flag. Another big red flag is we don't have an on-call rotation. Everyone's on call all the time. Red flag. ⁓ but I, I, I'm just a big fan of that total ownership model. And I, I've said that at the past.

Nathan Toups (55:27)

Great.

Carter Morgan (55:42)

⁓ In the past when I've talked to recruiters and said, what's important for you in joining a company? I say, end to end ownership of a product. I just want to be in control and have not only the power, but the responsibility.

Nathan Toups (55:55)

Yeah, I love that. My favorite teams that have done the on-call rotation, we used a very simple pattern, which was one week on as primary, one week on as secondary. and basically you... interesting.

Carter Morgan (56:07)

Yeah. At Amazon, we had a tertiary.

Nathan Toups (56:12)

Yeah, so, yeah, because your team has to be big enough to handle this, right? Because you have this like combinatorial problem. So what we would do is that you would start a secondary. Basically, you were shadowing the person who was primary.

Carter Morgan (56:11)

It was rough.

Nathan Toups (56:23)

there was a handoff and we would do ours on like Monday afternoon. We thought Monday afternoons were nice, just because typically there just wasn't a lot of deployment activity going on that would get busier later in the week. ⁓ That handoff would be around lunch, around noon for us. ⁓ And you basically were like, hey, here's the weird stuff I had to deal with last week. Just kind of be aware, you know, this comes up, you know, you probably were because you were secondary, but you know, this is the stuff I handled even not bringing it up in Slack. ⁓

And then the secondary become primary and then you'd get a new secondary. So there would basically be three people in your meeting when you did a handoff. New secondary, secondary becoming primary, and then the primary who's leaving. And that was a really nice cadence because it just kind of helped you keep track of that stuff. Everybody in our company ⁓ who is an individual contributor would

be part of this, I was actually hired as an SRE. And the book actually talks about this. I've never been, I know folks who worked at Google and stuff where a service would get booted. So SREs, SRE is this elite group at Google, yeah, you're right, you're right, site reliability engineers. They're like the Navy SEALs of the company. They handle this excellence of software engineering.

Carter Morgan (57:30)

Yeah, yeah.

Site reliability engineer for any.

Nathan Toups (57:50)

up around reliability. And the thing is, is that if your application was built so poorly that was getting on call, it was getting pulled all the time, and all these other things, they would actually kick you out. And you would lose access to SRE on your project, and then you'd have to just handle it yourself until you got back to a certain level of excellence. I've never been in an organization like that. We've all tried to best effort, get things into place. But one of things that we would do for on call though,

is the way that we use pager duty typically, it would always hit the primary first. If they didn't respond within five minutes or three minutes or something, it would fail over to the secondary. And then we had what we called, it was like the third level was tattling to your boss. So the third level was the head of engineering. So it wasn't just like whoever you spoke to, head of engineering would get that both of you missed the call. ⁓

you didn't want that to happen. Like, I'll just put it that way. Like that was a, you're like, how did both of you not respond to this? Right. cause you were waking up the head of engineering at three in the morning. ⁓ and he's like, why did this process fail? And, it'd get fixed really quick. So I'll put it that way. but yeah, ⁓ I think that that kind of stuff matters. And then, giving folks the autonomy to actually fix systems, to troubleshoot, to have access to logs.

We kind of take all that for granted, but obviously if you're on call, you need to be able to access machines, do resets, do other things. And that should always turn into a postmortem. Those should be blameless postmortems as we've talked about in the past. Because again, at the end of the day, any individual in the company may have been the person who caused the problem from a, actually wrote the code or I accidentally, you know, did our own.

Carter Morgan (59:30)

Big fan of that on the podcast.

Nathan Toups (59:44)

dash RF and deleted everything off the file server, dropped the table in a database. But the company let you do that. The whole idea of a blameless postmortem is something broke in our organizational processes to allow these failure points to happen, whether they're humans or externalities or whatever, in that a blameless postmortem says, OK, as an organization, how do we prevent this from happening in the future? And that's the only way. Sharing that is the only way to build trust.

to take risks, right? If you're gonna get your head on the chopping block because you took a risk to try to make a positive change in the company and it blew up in your face and the company blames you for it, then you're not gonna take risks anymore. You're just gonna like take your paycheck and do a couple of tickets and call it a day and you're not gonna be innovative. And obviously the biggest thing that you want out of your software engineers is innovation and you have to cultivate that. And so that's the whole idea here is that if you get called at three in the morning,

Necessity is the mother of innovation, right? You're like, I don't want to call it three in the morning. How do I fix this system so that I don't have that happen again, right? ⁓ And then, yeah, blame the support of MoRMS. And then, of course, they also talk about launch readiness reviews. Do you all do anything like that with your organization? OK. Yeah. Yeah.

Carter Morgan (1:00:49)

Yeah.

No, I mean, I've done this at big companies, right? Like, holy cow. Yeah. I remember

my big feature at Amazon was user-defined IP, which was just to give you an idea of like a big tech scale and how slow things move. ⁓ I worked on a product called PrivateLink, which what PrivateLink is, is it's a frontline AWS service.

You have your virtual private cloud. You've got your EC2 instances within it. You might want those EC2 instances not connected to the public internet for security purposes usually, but you still might want those EC2 instances and the application joining within to be able to access ⁓ AWS services like Lambda. Maybe you want them to trigger a Lambda or maybe you want them to be able to access your S3 database, right? What PrivateLink would do is you would provision

an elastic network interface within your virtual private cloud. So it'd have an IP address. And when you query that IP address, we would route your traffic purely along the Amazon internal network. And so, that was cool. Cause you'd be in the box and you you try to ping google.com and nothing gets out, but you could use the AWS CLI and it worked like a charm. Really, really neat product. ⁓ my big project, and this was like a six month affair was when that IP address was provisioned previously.

It had just been randomly generated within the IP cider range for that subnet. My project was making it so you could choose that IP address upon provisioning. This took six months and in part took a lot of time because of these big launch readiness reviews. and yeah, security reviews and penetration testing and lots and lots and lots of stuff. Again, kind of getting at the theme of this, which is like.

Amazon had a product that was generating, that product was generating a billion dollars in revenue each year. And it took maybe a team of like 20 or 30 engineers to keep it running. So hugely, hugely profitable. So Amazon's biggest goal is keep the beast running. Don't bring it down. We have tons and tons and tons of customers. When HBO Max changed their name to Max, for some reason that was my problem because they were a huge customer of ours and like they had to rename a bunch of resources. So now that they changed it back to HBO Max.

thinking about my old team at PrivateLink, I'm like, I wonder if they had to deal with that again. ⁓ Anyhow, ⁓ so yes, those big launch readiness reviews, like I get it because you have so many customers and you can't jeopardize it. At my current place, it's a startup, no, we don't really do anything like that because there's not a ton we can jeopardize.

Nathan Toups (1:03:19)

That's hilarious.

So I'll

tell you, there's a lightweight version that we did at a startup. We had about 30 ICs, okay? So relatively small team in the grand scheme of things, but large enough that we had to have organizational structure. One the that they did, obviously it would have been better if we had the continuous delivery stuff, but one of the things that they did was kind of interesting. We had a staging environment and they would do a get log.

like they would do like the Git logs from the last production deployment to this staging deployment with attribution of like which user it was. And everyone had to sign off on their commit. So if you had a commit that was merged in, that was a diff from the last production release, you basically had to say, yes, I've checked to make sure that this was working as I expected it was on staging. And once everybody who was in the batch of what was who made it into the next production release, if everybody signed off on it.

That was our readiness review. And I'll tell you, it helped so much because every once in a while, an engineer would be like, actually, no. We don't want this actually in production. And we didn't have feature flags at the time. And so they would literally, ⁓ they would actually just yank their, they would just yank their commits out of that. They would redeploy into staging. And this was actually kind of like our in-in chord. ⁓

Carter Morgan (1:04:45)

Mm-hmm.

Nathan Toups (1:05:01)

So you never wanted to be that person who held up the assembly line. If your stuff landed into staging, you better have tested it and made sure that you're ready for it to be into production next release cycle. And so this was actually really useful. That was our launch readiness review. And it was pretty informal, It small enough. In any given time, it was maybe 15 engineers, because we released on a weekly cadence, ⁓ or typically. And so 15 engineers or so.

Carter Morgan (1:05:10)

Yeah.

Nathan Toups (1:05:30)

who were the primarily responsible person for that commit. And that was enough. That was enough for us. We very rarely, and we actually did have feature flags for important things. We did have the alpha testers and stuff like this. But it was just a nice sort of sanity check. I was trying to change the culture so that we could have a more aggressive release cadence. At the very least, what we probably would have done is maybe done twice a week so that that set of

people who had to manually do the readiness review was smaller, right? Because there would just be a smaller chunk of changes. So it would be easier to do. It also be easier to roll back in case, know, because it's kind of counterintuitive. But the more frequently you release, the easier it is to roll back ⁓ any one little piece. So, yeah.

Carter Morgan (1:06:13)

Mm-hmm.

Well, I think that ⁓ about wraps it up for this week, at least in the discussion points. Yeah, I was really happy with these two sections. This is more what I was hoping to get out of this book. And honestly, I would say to anyone who's looking at picking up the DevOps handbook, like if you're already sold on the idea of DevOps, you can skip the first two sections and not miss a ton. ⁓

Nathan Toups (1:06:43)

Yeah.

Carter Morgan (1:06:44)

I think sections three and four is if you're kind of like, okay, yeah, I'm on board with DevOps and with, ⁓ you know, continuous deployment and increasing our deployment velocity. How do I do that? This is what you want, parts three and four. ⁓ So I guess I'll flip it around. We usually, who would we recommend this book to last? I'd say that's who I would recommend it to at this point. Just anyone who is on board with those principles, but is wondering.

How do you do that? What does that look like at other companies? Obviously this is going to be best for enterprise leadership, but in general, that's, ⁓ that's who I recommend this to. And then Nathan, who'd you recommend it to? And then we'll flip and talk about career stuff.

Nathan Toups (1:07:24)

You know, I'm leaning in as we got some good feedback from one of our listeners who he basically was just like, hey, a very large percentage of all software engineers are actually enterprise software engineers. so like, yeah. So I, know, if you're enterprise leadership and you're struggling with modernizing, this is a no brainer. You should absolutely read this book. If you're new to DevOps and you just want to drink from a fire hose and really get

Carter Morgan (1:07:37)

Mm-hmm.

Yeah.

yeah. yeah.

Nathan Toups (1:07:52)

introduced a lot of these concepts so that when you see things that are your organization, you kind of at least have the vocabulary. I think this is great. ⁓ Engineers who have kind of been thrown into DevOps kicking and screaming. I've seen this where I have a friend who went, he's at a company that recently got series B funding. ⁓ You know, and up to that point, it's kind of like shared responsibility, but now some of the engineers are kind of naturally moving into more DevOps-y responsibilities. ⁓ This would be good. This will kind of get you caught up to speed on

big concepts. ⁓ I want to add something that I hadn't thought about till recently, which is who would I recommend avoiding this book? Yeah, it was a new one that came up to me yesterday. If you are an experienced DevOps person and you watch a bunch of videos and you are kind of up to speed on this, I don't think you're going to get a lot out of this book. I will like 100 % tell you that this is like,

Carter Morgan (1:08:33)

interesting.

Yeah, yeah.

Nathan Toups (1:08:49)

Unless you wanna kind of just like pat yourself on the back of like what an awesome job you're doing and just kind of like know that Gene Kim, you know, will give you the gold stamp of approval. ⁓ You know, I think that you can skip this. This really is for folks who maybe lack the vocabulary and lack the understanding. ⁓ That's how I approached this book of like, I'm like, yeah, yeah, yeah, you know, just kind of like, yeah, that's like, like the whole time I'm reading this book, I'm like, yeah, I mean, I'm sure, yes, these are all correct. ⁓

And so I wasn't looking at this and being like, everything I've been doing is wrong. And like, there's some great truth here that I hadn't thought of before. yeah.

Carter Morgan (1:09:21)

Yeah.

just want to call out one thing before we talk about what we're going to do for your career. This is a pattern I had never heard of before and I thought was super cool, which is because sometimes when you're building a feature, one of the problems you can have is like, well, how do we know this is going to work? this is going to like we need to simulate production level load or production ⁓ level variety, right? That whole classic QI engineer walks into a bar.

orders one beer, orders negative one beers, orders a lizard, right? ⁓ Your customers are gonna be your best engine for that. I thought it was interesting that basically they said like, so let's say for example, you're launching a new search feature and you know that when the customers enter queries, it should return some sort of new results. And this is gonna replace your old search feature. They suggested launching that into production and basically having it run silently. It's like when a customer makes a query with the old search feature,

You also make that query with the new search feature. And maybe you do that for 5 % of customers or something like that. And so you don't show the results to anyone, but you can watch, okay, here's what everyone's doing. You can test if there are errors and if you're handling it well, you can just kind of fail silently, but still alert. Like I'd never heard of that before. That seems really, really cool. Again, that works if you're dealing at a big scale, but I was like, wow, that's clever. That's very clever.

Nathan Toups (1:10:42)

Yeah, same. That's super

clever. I hadn't thought about that either in that, yeah, we absolutely should steal this. That's really cool. It's like a canary canary.

Carter Morgan (1:10:52)

Yeah. Well, okay. So what are we going to do differently? Yeah,

exactly. What are we going to do differently in our career? You want to go first, Nathan?

Nathan Toups (1:10:59)

Yeah, so they have this concept of like, if your team's having trouble building the confidence to ship to production, they have this concept of like ship something day, where literally just like pick a thing, could be the tiniest thing and ship it to production. I'm stealing this because we've had the research and development projects that we've kind of had internally, but I'm trying to shift us to being like building in public and us having it. There's no, like, it's not real production. It's like our labs environment.

Carter Morgan (1:11:29)

Mm-hmm.

Nathan Toups (1:11:29)

but

we should be shipping incremental changes that are deployable and work to our labs ⁓ environment on a daily basis. And I think it would be cool for me to challenge our team with this. And I think we might start having some official ship something days.

Carter Morgan (1:11:45)

Nice. Me, decoupling releases from deployments. I've just been thinking a lot about, you know, what's the right level of this to bring to my company. And I think this would be a good mindset change for our team and the product people, which is right now when you push something to production, it's like, okay, it's going live, right? And we do have a degree of feature flag. We're actually pretty good about feature flags, but.

Nathan Toups (1:11:49)

Yeah

Carter Morgan (1:12:14)

There is still the idea of when it goes into production, it's like, now that's real. And I'd like it to be a little less like that. I'd just be like, no, no production. We deployed a production. It's just something that happens, right? It happens automatically. Um, now the real thing is when you turn the feature flag on, right? And then if it breaks, that's okay. We'll just the feature flag right off. Um, so I think, uh, we're actually doing a trip, uh, next week where we live in, I'm in Utah and obviously there's lots of beautiful canyons. so.

Nathan Toups (1:12:24)

Yeah.

Carter Morgan (1:12:43)

We have a Starlink gizmo and the whole team next week, we're going to ⁓ the American Fork Canyon and we're going to work out of the canyon just for the work day. And, part of that is to kind of, we're going to, they're calling it the Thunderdome, which is, know, they said, kind of, kind of come with any wacky proposal you have about processes you'd like to change and we'll hash it out and figure out if it's a good thing. So I think this is going to be my proposal is this idea of how, yeah, how do we do a couple of releases?

Nathan Toups (1:13:05)

That's cool. My last

company had this kind of cool like hackathon days periodically. It was a couple of times a year. What was kind of cool about it though is the company would, whoever the winner was, the award was the company would officially back continued efforts in it for the next six months. And so it was one of those like, yeah, bring your wackiest idea, re-imagining new database, whatever, right? Whatever thing. And if it showed a glimmer of hope over this like 48 hour or 72 hour period,

Carter Morgan (1:13:11)

Mm-hmm.

interesting. Cool.

Nathan Toups (1:13:35)

then they'd be like, hey, we're gonna, we love this. We love this perspective and we'll back it, know, some percentage of your time we'll get ⁓ dedicated resources for it. that was a, it was fun because there was people who'd come in with really cool ideas and sometimes it would fundamentally shift how we did things.

Carter Morgan (1:13:53)

Well, I think that wraps us up for this week. I've been joking with my coworkers. I'm like, if you want to know what I think about the company, it's all on Book Overflow. And so who knows? Maybe my coworkers are listening and getting insights into my machinist. ⁓ But yeah, can contact us at contact at BookOverflow.io. You can ⁓ find us on LinkedIn, we're Book Overflow. You can find us on Twitter at BookOverflow.io.

Nathan Toups (1:14:08)

HR's got a file on you now. ⁓

Carter Morgan (1:14:20)

You can me on Twitter at Carter Morgan. You can find Nathan in his newsletter, Functioning Imperative, at functioningimperative.com. Yeah, stay in touch with the podcast. We love hearing from you guys. If you have any suggestions, listeners, on how you could become more involved with the podcast, ⁓ we have debated starting a Patreon. We don't want to make a Patreon unless we feel like they could offer actual real value to you as a listener. But if there's any Book of Overflow superfans out there that...

would be interested in a Patreon, let us know what you think might make that worth it. ⁓

Nathan Toups (1:14:53)

or Discord or whatever other community

features that you wish that we had. And who knows, maybe, would you be interested in a, ⁓ if you're a Patreon member, would you want to come to a private livestream of our backlog grooming or something like this that we would release later? Maybe some perks like that could be cool, but yeah, let us know.

Carter Morgan (1:15:08)

Yeah, yeah.

This podcast is still a hobby at this point. If you listen to the Manuel Pais interview, you'll notice that we did have our first sponsor, which is very cool. ⁓ but yeah, it's still a labor of love. And so, you know, we don't want to be greedy, but also, you know, you guys are smart and understand that if we are economically incentivized to build a community, ⁓ that obviously makes things easier for us. And, ⁓ you know, isn't just another, ⁓ chore we have to take on. So let us know if that describes you.

⁓ And we'd love to hear your feedback and we'll see you next week. See you folks.

Nathan Toups (1:15:49)

See you.

Episodes in This Series

Ep. 69Is DevOps a Silver Bullet? - The DevOps Handbook

Jul 7, 2025

Ep. 71Deployment Strategies for Success - The DevOps Handbook(This episode)

Jul 14, 2025

Ep. 73Shifting Left on Security - The DevOps Handbook

Jul 21, 2025

Ep. 87Patrick Debois Reflects on The DevOps Handbook

Oct 30, 2025