Ep. 84Monday, October 6, 2025

OTel at Scale - Mastering OpenTelemetry and Observatibilty by Steve Flanders

Book Covered

Mastering OpenTelemetry and Observability: Enhancing Application and Infrastructure Performance and Avoiding Outages

Mastering OpenTelemetry and Observability: Enhancing Application and Infrastructure Performance and Avoiding Outages

by Steven Flanders

Get the book →

Book links are affiliate links. We earn from qualifying purchases.

Author

Steven Flanders

Hosts

Carter MorganHost
Nathan ToupsHost

Transcript

This transcript was auto-generated by our recording software and may contain errors.

Nathan Toups (00:00)

we had to build an alert because every once in a while folks would release ⁓ debug logs to production, right? It was just like a thing. A team would not, they would have debug log flag turned on in dev and then they would accidentally do a production. And you can like, you can like denial of service, attack yourself on accident.

Carter Morgan (00:09)

Yeah, yeah.

Hey there, welcome to Book Overflows, the podcast for software engineers by software engineers where every week we read one of the best technical books in the world in an effort to improve our craft. I am Carter Morgan and I'm joined here as always by my cohost, Nathan Doops. How are doing, Nathan?

Nathan Toups (00:37)

doing great everybody.

Carter Morgan (00:39)

As always, make sure to like comment and subscribe. I leave a five star review. If you're on an audio platform, share the podcast on LinkedIn with your friends and coworkers. And if you want to chat with Nathan and I more, you can book a coaching session with us on Leland and the links that are in the description. ⁓ Nathan, you are in a new location this week, aren't you?

Nathan Toups (00:59)

Yeah, and I'm still moving in, hopefully it's not too too echoey, but if it is, I apologize. But yeah, this is my permanent office that I'll have set up, so excited.

Carter Morgan (01:06)

Hehehe.

There we go. Yeah, you might hear a little bit of echo listeners, but that's just what you get when you move in. We've recorded from lots of places. We recorded from hotel rooms, from a work phone booth, from, yeah, this is a versatile podcast.

Nathan Toups (01:17)

Yeah.

Yeah.

Carter Morgan (01:29)

⁓ Well, speaking of versatility, we got a lot to talk about this week with the back half of Mastering Open Telemetry and Observability by Steve Flanders. ⁓ Just give you a quick introduction on Steve Flanders again. Steve Flanders is a founding member of the OpenCensus and OpenTelemetry projects and has over a decade of hands-on experience in the monitoring and observability space. As a senior director of engineering at Splunk, a Cisco company, he oversees the Splunk observability platform and spearheads Splunk's

open telemetry contribution. was previously instrumental in building what is now the Splunk APM product at Omniscient and the Log Insight product at VMware. A sought after speaker and blogger, Steve frequently shares his insights at prominent conferences like KubeCon and on his blog. He holds an MBA from MIT, underscoring his blend of technical, academic, strategic vision, entrepreneurial spirit. I feel like I used to hear a lot about Splunk and I don't hear as much about them anymore. Do you know people who use a lot of Splunk?

Nathan Toups (02:23)

Mm-mm.

And most of the folks that I know of are in big enterprises and they've had positive experiences with them in the past and it kind of comes with them. ⁓ I wouldn't be surprised if they're trying to reinvent themselves is probably why they have folks like Steve Flanders, because they were around before open telemetry and then this book makes a really strong case that, and I think they talk about this later, we'll get into it that like.

Carter Morgan (02:33)

Yeah, yeah.

Right, right.

Nathan Toups (02:49)

not supporting open telemetry as its own form of tech debt. And yeah, so I have a feeling that that's what's going on, but I actually, I haven't paid attention to Splunk in a while.

Carter Morgan (02:59)

I remember when I used them way at the beginning of my career, they were big on like logging and kind of like, like we were talking about last week, the old way of like getting metrics just by like parsing logs. But maybe that was just one pattern that my team knew how to do. Anyhow, ⁓ because we know that that's not the pattern at least that most people do, or it's not the recommended pattern these days. Recommend pattern is open telemetry and

Let me give you the book introduction real quick. is discover the power of open source observability for your enterprise environment and mastering open telemetry and observability enhancing application and infrastructure performance and avoiding outages. Accomplished engineering leader and open source contributor Steve Flanders unlocks the secrets of enterprise application observability with a comprehensive guide to open telemetry, OTEL. Explore how OTEL transforms observability, providing a robust toolkit for capturing and analyzing telemetry data across your environment. Well,

We finished the book this week. did the first half last week and this week we did chapters six through 12. Give me your thoughts, Nathan, on chapters six through 12 and maybe the book overall.

Nathan Toups (04:10)

So.

This is one of those books where I really like it, but it's hard for the book overflow format. So meaning that like we're slogging through these books. I think this book, you need to have some time to spend with it and do the examples and things like this. Also, six, seven and eight are very technical in a lot of reference type material. And it was hard, hard to listen to, I mean, hard to read. And, but nine, 10, 11 and 12, those are like the meat of what I was excited about. And so when I got to those,

Carter Morgan (04:18)

Yeah, yeah.

Nathan Toups (04:41)

of perked up a bit. so hopefully we'll dive into the latter half of the second part ⁓ in this episode.

Carter Morgan (04:51)

Yeah, the structure of this book is strange. It's almost like, one, I agree. We just have to grade things on how they fit into the podcast format, which we recognize is not how a lot of people read these technical books. And also say that some books, which I've since come to really appreciate, I had a really hard time reading, working effectively with legacy code was like that. Like I remember.

the night before recording, like slogging through the end of working effectively with legacy code and being like, my gosh, what on earth am I doing? And that's one I think about a lot. I was joking with my coworkers, I'm like, you it's a good day at work when you start thinking, maybe I should bust out working effectively with legacy code again. Take a look at that. But that's been a book that's been really foundational for me, but was a tough read.

I think this book struggles with that a bit. There are some like weird decisions that makes like, was really surprised. We're like in chapter 10. And this is a small thing, but I think it's kind of emblematic of like what's going on with his book. We're in chapter 10. He uses the word open telemetry and then puts in parentheses again, hotel. And I'm a little like, this is chapter 10 of the book. Like it almost feels like it's like 12 different essays kind of cobbled together. ⁓

I'm not exactly sure. I think it could have used another pass by the editor. And at any rate, lots of valuable stuff in here. Not great for the book overflow format, but we still think we learned a lot and we're going to have a good discussion today.

Nathan Toups (06:34)

Yeah, yeah, super excited.

Carter Morgan (06:37)

Well, we're going to take a quick break and then when we come back we will discuss ⁓ everything we have about the back half of mastering open telemetry and observability.

And we're back. Thank you so much for tuning in. Well, we're gonna talk about the back half of this book. Like we'd said, chapters six, seven, and eight are kind of like a deep dive on like how to use open telemetry. There's a lot of code examples for a Python application. Yeah, we're not gonna talk about it really at all. And I think, I don't know, personally, I think I would have liked if that had maybe appeared.

Nathan Toups (07:01)

Yeah.

Carter Morgan (07:09)

earlier in the book. I don't know, maybe like, I don't know, what do you think?

Nathan Toups (07:15)

Yeah, I can't, kind of, because we've read so many books at this point, we get to see that we get to compare and contrast formats. I love concept, concrete example, why this is important. Like that structure, kind of like how refactoring did it, I think in a lot of ways. Because you can kind of skim, like if I don't really need to know the concrete implementation, I can skim that. Maybe I want to understand the concept and then like anti-patterns and you know, whatever, kind of like oscillate between those. Because that helps me.

Carter Morgan (07:21)

Right.

Yes, yes.

Nathan Toups (07:45)

remember those things well. This book is structured a bit differently. think if I had, like, now that I have this, I think I'd be more comfortable skipping around, because he also talks about this, he'll be like, oh, like we said in chapter one, you know, like, he'll bring these kind of concepts up. But again, six, seven, and eight, I are very in the details. Like, if I needed them for implementation reasons, I would do that, but otherwise, I cut, I,

Once we got to chapter nine and through the end of the book, was, think that's what's going to be the meat of this episode is like talking about these, these bigger concepts.

Carter Morgan (08:24)

Well, let's talk about chapter nine. Chapter nine is all about choosing an observability platform, which is an interesting chapter because this book is all about not any particular observability platform, but open telemetry. I found it funny at how many times in the book, the book's all just like, beware vendor lock in, which like, of course that's what you'd write if you're, you know, the inventor of open telemetry and it's just good advice in general. ⁓ so, that does let you do a chapter all about like, okay.

As long as you're choosing a platform that supports open telemetry, let's talk about all the different platforms you can choose and let's talk about everything you should kind of consider when ⁓ choosing a platform. So what stood out in this chapter to you, Nathan?

Nathan Toups (09:06)

Yeah, so this one really starts thinking about like costs, both operational, they call it capex and op-ex, right? So capital expenditures like infrastructure, op-ex is the operational expenditures. And, you know, to do this responsibility, you have to kind of get into that world. ⁓ Actually, when I was at ⁓ the previous company that was in, I was actually hired to be on the FinOps team.

So FinOps is literally, we're big enough that we're having to think about the financial implications of how we're running infrastructure, how we're running observability. And ⁓ it was interesting, this book gives a framework to say that companies will spend between one and 15 % of a good threshold is one to 15 % of cost of goods sold. They call it COGS. And so it gives you like a little formula to kind of estimate what you should expect. And it's funny because some companies want to spend zero, right?

observability and some companies give a good healthy budget for 10 to 15 % of their of their total spend. think I think when we were talking to Datadog they were like, oh yeah, you should spend 20 % and we were like, get out of here. Of course, of course Datadog would say that. But yeah, like, yeah.

Carter Morgan (10:17)

Now, when we talk about cogs here,

is cogs, is that including, that's not including salary, is it? Or are they kind of advocating like 15 % of like cloud SaaS spend?

Nathan Toups (10:32)

I thought it was, I don't actually know the best, I'm like not a finance person. We were thinking about, I think a lot of times we were basing it off of total cloud spend and like observability was like a function of the total cloud spend of kind of roughly what to estimate. There's probably more sophisticated ways and again, somebody who's got an MBA, who's the author of this book probably thinks in spreadsheets much better than I do, but.

Carter Morgan (10:39)

Right, right.

Yeah, yeah. ⁓

Yeah.

Nathan Toups (11:02)

There is some

percentage. the idea here, though, is that you should understand what a healthy amount of spend is and what you should expect. Now, if you figure out how to use a tool that costs half as much, that's awesome. But I think the company needs to have a budget. And the cool thing is that if you have a healthy relationship with observability data, you should have a budget. You should say, hey, I expect to spend up to this amount. And then if you can come under that, one of two things happens. One, it unlocks ⁓

maturity levels that you didn't have access to. we'll talk about this. Like he gets into this idea of ⁓ levels of maturity, which is, you know, first of your basic, basic tooling, then you have some intermediate and more advanced and more advanced. ⁓ You can't get to the more advanced if you're like, and we'll get into anti-patterns too. So if you're using observability in really inefficient ways, no one's going to clear you up to use the advanced tools, right? Like if you're blowing your entire budget, capturing all of the trace data,

and you're not sampling properly, they're not gonna let you get into AI-driven predictive observability analysis and stuff like this. They're gonna go, you've exploded the budget. You can't do this. I thought this was a healthy chapter and I kind of perked up because I was like, in the real world, I've said use observability data. I've made some recommendations on how we tool it up, but where does it go? Do we build this thing?

Do we buy this thing? And how do we manage it? And I think, again, this is another great section there.

Carter Morgan (12:35)

Yeah. And, and they talk about that. They talk about this idea of like build versus buy versus manage. so build would be this idea that like, Hey, just build your own platform entirely. Like, ⁓ like, and I'll say, I don't think, have I, I don't think I've ever seen this. Have you ever seen someone kind of build like an open telemetry collection platform kind of from scratch? Right.

Nathan Toups (13:01)

Not completely from scratch, a lot of times when I, when I'll

think of build, also think I put self-hosting in that as well. So this might be that you have a Kubernetes cluster and you're just like, I'm going to run Elastic and Grafana and all of these things. Like, I'm just going to, I don't want to pay for them. I'm just going to run them in my Kubernetes cluster. the only problem with that is that you end up having to have an ops team. That's very advanced, right? You're not only managing, cause you get into that problem of like, who's watching the watcher situation.

Carter Morgan (13:07)

Yeah, yeah.

Yeah, yeah.

Mm-hmm.

Nathan Toups (13:30)

if your observability infrastructure is on your Kubernetes cluster and the Kubernetes cluster is being observed by their observability structure, what happens when that goes sideways? ⁓ You lose both, right? ⁓ So maybe you spin up another Kubernetes cluster to run your observability stuff. Okay, well then what's watching that? And so you get into this weird ⁓ chicken and the egg problem. I've been in environments in which the ops team was quite advanced and they were able to handle this, or maybe their workloads were so large that

Carter Morgan (13:39)

Yeah.

Nathan Toups (13:59)

it was the salaries to manage this was less than a data dog bill, right? They would host that themselves and do those things themselves. ⁓ But yeah, most people, I think Carter, you and I both in this camp, we're both pretty aggressive about buying over building. ⁓ And so I think in most cases you should buy, right?

Carter Morgan (14:19)

Yeah, open.

And I agree. think metrics, am constantly shocked at the cost of like, ⁓ of, of metrics collection. Even like AWS, like it's expensive. I think every custom metric you upload to cloud watch with, ⁓ and, and, and every custom metric counts. If you have like different dimensions on the same metric name, it's like 10 cents a month or something like that. And like,

I think when I like just kind of automatically set up open telemetry to like export to AWS, like, okay, just log everything, right? It came up with like 6,000 metrics and that was before we had done like any custom ones. Is it 30? I think it is 30 cents. Yeah. Yeah. And so, I mean, if you just run that math, it's like 6,000.

Nathan Toups (15:04)

I think they're actually, I think they're like 30 cents a month. Yeah, yeah.

Carter Morgan (15:16)

6,000 times 30.

And yeah, it's like $1,800 a month. And that was just like, that was just our back end. I'm shocked by that. And then I know Datadog is, yeah, known for how expensive it is too. What's been your experience with, with vendors?

Nathan Toups (15:38)

It's the biggest thing where we're like,

so it's a couple of my jobs I've been in charge of managing our infrastructure costs, including observability. And it's the most humbling experience because you do your best estimate and then every month you kind of like open your email and go, man, that's what we really spent. And ⁓ you get better at predicting it over time, but I'll tell you that like Datadog, and I keep calling them out, but it's okay. I've been burned by them a few times. So I'm gonna call out Datadog.

Carter Morgan (15:55)

Yeah.

Nathan Toups (16:06)

we ended up creating this very complex cost calculator ⁓ because Datadog wouldn't provide it. They publish all of the data. like I will say to AWS's credit, to Google Cloud's credit, and I would imagine Azure has something similar. They have these like pretty sophisticated cost calculator web apps. And you can even export them and do all this other cool stuff. But I leaned on those so much because you're like...

Carter Morgan (16:13)

⁓ interesting.

Yeah, yeah.

Nathan Toups (16:32)

There's a base cost. There's a utility cost. If you have backups, those are going to have a cost. There's like, if you're going to do regional replication, like you have to, as a cloud architect, which is a lot of work I've done, you have to kind of set the baseline. And a lot of times I would give myself a buffer. would do like, depending on how good I was at that particular area, give it, know, 25 % or 50 % buffer and say, Hey, here's an upper bound. I think we can actually operate at this cost. And I got pretty good at estimating that. But if you're not for the faint of heart,

Or if you, the other thing with like Datadog and telemetry is imagine, we had to build an alert because every once in a while folks would release ⁓ debug logs to production, right? It was just like a thing. A team would not, they would have debug log flag turned on in dev and then they would accidentally do a production. And you can like, you can like denial of service, attack yourself on accident.

Carter Morgan (17:16)

Yeah, yeah.

Nathan Toups (17:25)

⁓ And so we had an alert and we would actually like,

Carter Morgan (17:25)

Yeah.

Nathan Toups (17:29)

stop that because it would also be a huge logging expense. I think we blew through our Datadog logging budget in like five days because a couple of teams who had run stuff in debug mode had like uploaded stuff. And of course it looks like amateur hour on our side. If this happens and we have to like do a postmortem and like figure out what to do next. ⁓ So it's non-trivial. And I think that's why there's like so much of this cost piece in this book is because

Carter Morgan (17:37)

That's crazy.

Nathan Toups (17:58)

And this comes into anti-patterns as well. ⁓ It's crazy. I think this is why their observability platforms keep popping up. ⁓ Why there's vendors like, I've talked about them before, but CoreLogix I've been really impressed with ⁓ because they can operate at a fraction of the cost of what AWS does at scale. So when you're smaller, even 1800 bucks a month, it's not great, but it's not the end of the world. But let's say you get to 60,000.

Carter Morgan (18:13)

Yeah, yeah.

Yeah, yeah.

Nathan Toups (18:27)

custom

metrics, or you get to half a million custom metrics, that's where it's not tenable to do it on AWS that way anymore, and you need to use something else. Because yeah, now we're talking about annual salaries of observability teams and stuff like this. So yeah, it's interesting. It's also constantly evolving. one of the things I do like about this book is that

they kind of give you a framework for assessing the sort of viability of using a set of tools. So like, if I'm using open telemetry and a new technology comes along and they support open telemetry, then we can switch to them with the least amount of pain, right? Like that's a pretty decent breakdown, because I think we've all felt this where like you get into the data dog ecosystem or maybe you're on New Relic or any of these other kind of like big players or...

Splunk, ⁓ where you get really used to their vendor locked ways of doing stuff. It can be like a multi-year transition if you're not careful and your company's big enough, right? ⁓ And so I think these cautionary tales are really important.

Carter Morgan (19:42)

Yeah, I've seen this transition happen at larger companies. The last company I was at had been a big data dog customer and then the bill just got too large. Um, and so they actually did go and kind of build their own, uh, observability platform. And that's why I say when, when we talk about build versus buy versus the last one is manage. And that's where he talks about like, okay, this is you.

setting up Prometheus, setting up Jaeger, using these kinds of open source components, because even at this big company I was at that decided to make their own platform, really that's what they were doing. Like they were self-hosting Grafana at scale. ⁓ So I'm curious what that kind of like true build versus just managing looks like. But anyhow, I saw, no, I saw a company.

Nathan Toups (20:33)

Yeah, yeah.

Carter Morgan (20:38)

throw out Datadog and switch entirely to doing this kind of on their own because it gets so stinking expensive.

Nathan Toups (20:45)

Yeah, it's kind of nuts and I think there are some novel ideas that'll come out and kind of disrupt things from time to time and that's cool. Most companies aren't that and I would say that most companies building it isn't interesting to your shareholders. It's not interesting to your users ⁓ unless your company is actually. ⁓

Carter Morgan (20:54)

Yeah.

Nathan Toups (21:04)

to reimagine how logs get gathered. If that's your business, then sure, like go for it. And maybe you can sell this product to other people, maybe you've come up with, and I think we have seen some of this, like there are interesting platforms like ⁓ Observe is one of them that I think it's founded by somebody ⁓ from Snowflake. And so they really understand data lakes really well.

Carter Morgan (21:09)

Mm-hmm.

Nathan Toups (21:27)

And so there's this idea that like, we collect all this data and then we can do multi-pass processing on it. And we have this very efficient way of gathering things up. There's like all these interesting emerging patterns with these big data pipelines are happening with AI and ML. ⁓ Those are going to positively impact observability because observability at the end of the day is time series data, right? All of this is time series data. There is a timestamp. It comes from some source.

⁓ And it has some sort of context. then open telemetry, of course, gives us the way to correlate across these events. And so as technology advances for time series based data, whether it's Internet of Things or, you know, stuff happening in the ML space, I think these innovations are going to go upstream. And so ⁓ if we go back to ideas of like building evolutionary architectures, which is I think is another kind of interesting thing to overlay with this book.

are you building your observability system so that it's evolvable, right? We know that there's a trend and they get into the final chapter in this book, they even talk about ⁓ the future of where things are going and like all the new signals that they're interested in pulling in. And as those signals mature and as companies come to expect those new signals to like understand how their business functions, the tools to back it, the build versus buy versus manage,

are going to be there. And sometimes that means you're going to have to self-host something because it's so new, there's no infrastructure and ecosystem around it. And then other ones are going to be new players that are coming out of Y Combinator that are being like, everything you think about signals is wrong. And it's these five signals that are actually going to make your business scale up and use our tool. And you're like, OK. if you're in a way of embracing that,

in using those innovations when they're there. ⁓ Observability is just another dimension of how you innovate in your own company.

Carter Morgan (23:30)

Mentioning Y Combinator, just, I'm looking at, I want to see here. Yeah, I was, I was looking through the, the Y Combinator batch recently, just like the most recent one. And I was just surprised. Let me see if I can. Yeah. I was really interested in this company because like everything in Y Combinator these days is like AI, like that's all it's doing. And so when there's a non AI company, really jumps out at me.

Nathan Toups (23:57)

Right.

Carter Morgan (23:59)

This one, out them, S2.dev. I was really impressed with this. This is their description. say, S2 is a serverless data store for real-time streaming data. The more we worked on data systems, the more we felt like there was a building block. The seamless experience of object storage simply did not exist for durable streams. So we set out to fix that. S2 reimagines streams as an unlimited and access controlled web source, like if Kafka and S3 had a baby. And then they advertise it for.

for agents, building agents use, creating their streams, manage context, real time, make workflows audible and coordinate between agents. Anyhow, I don't know a ton about it, but I was just like, that's really interesting.

Nathan Toups (24:33)

No, this is funny because this is actually

tooling that we've been building out internally. ⁓ This is a conversations that I've had with folks. ⁓ This is a real unsolved problem. And I think bigger companies have come up with these kind of like, yes, Kafka things where I needed an event stream, but I also need big associated data because Kafka is limited. I think it's up to like six megabytes by default and 20 megabytes. Like if you pack it out of like how much you can put into an event.

But a of times what I want is actually a big blob of data stored in something like S3 in a reference to it in my Kafka stream so that I know how to get back to it. And then I can look at these events over time and replay them and come up with clients that consume this data. ⁓ And if you look at that pattern again, that's time series data, right? Kafka gives us this really clear way of saying, here's a sequence of events on a specific stream. This structure.

is increasingly important for the types of large scale data processing that we're doing. And observability is just a subset of that, right? Observability is a time series data stream that has meaning to the business. it just, typically for us, the audience is internal, right? The customer doesn't care. Where the customer does care is that if the thing that they rely on from you breaks, are you able to, you know,

signal on this and then ideally are you able to predict that something's gonna break before it does, right? And that's the real gold standard.

Carter Morgan (26:05)

Well, he talks about that maturity framework. And there there's five levels. Maybe we can review the levels just briefly. He says level one is basic monitoring. Reading from the book, he says, at this level, organizations have basic monitoring in place, focusing primarily on uptime and system health. Monitoring is often manual, limited, simple metrics like CPU usage, memory utilization, and disk space. ⁓ I think if you are using any sort of cloud provider, you kind of get this out of the gate.

Nathan Toups (26:13)

Yeah.

Carter Morgan (26:34)

If you're even just applying in a single EC2 instance, you'll be able to look at the metrics for that EC2 instance and get some idea of CPU usage, memory usage, things like that. I would, yeah. Nowadays, how would you even not be at level one? Like I guess if you're doing stuff on-prem, but like I get.

Nathan Toups (26:52)

Yeah, if

you're doing stuff on print, or you're just not looking at it, right? So it could be that you have no discipline and you're not aggregating it. Like, you're click ops at your company and you're like, what's going on? And you go in and look at the EC2 tab and then scroll down to health and metrics. And then you're like, yeah, our CPU is maxed out, right? So ideally, you're at least taking that stuff and shoving it into some queryable system that you can at least put alerts on. ⁓

Carter Morgan (26:56)

I guess so.

Nathan Toups (27:20)

Enhanced monitoring, think, is level two. Let me hear. Do you have it pulled up? Yeah.

Carter Morgan (27:24)

Yeah, I got it. says

this level introduces more detailed metrics and some level of automation includes application level monitoring and the initial use of more than one signal for diagnosing issues. I will. I'm proud to say this is where I've taken my company to. We were level one before we're now level two. We've got some open telemetry going. We've got automatic metrics for, like our graph QL requests. And then actually just recently I was very proud of myself. We shipped a feature.

Nathan Toups (27:38)

Bye.

Carter Morgan (27:51)

And at the very after we kind of got it all done at the very end, I'm like, you know what? We should have some metrics around this. We should track, you know, it's like it's a fun. It's like a for redeeming discount codes. I'm like, we should track how many codes are getting redeemed. How many of those codes are failing? ⁓ So, yeah, we've got, we are a level two enhanced monitoring organization. That's as far as we're at right now.

Nathan Toups (28:07)

Yep.

And I'll tell you

from most organizations who have been struggling with this, that's...

It's low hanging fruit. And I think it's from like an 80 20 principle. you can do so much with just getting to level two, right? Just getting monitoring. You'll learn a lot about how the business functions. if you have, for instance, SLAs, ⁓ and again, an SLA, we're not getting too much into this for this episode, but like an SLA is a service level agreement. It's typically some sort of legal contract. Like we promised the customer that will be up, you know, three nines of availability, 99.9%.

Carter Morgan (28:25)

Yeah, yeah.

Mm-hmm.

Nathan Toups (28:47)

uptime,

that's a legal contract, right? That's just a declaration and you say, we'll give you a partial refund for all minutes, not up to that over the month or something. Internally though, you need to be able to measure this. And so in enhanced monitoring, you might tie that to like what's called an SLI, which is service level indicator. This is the metric that lets you manage, I mean, to measure what your...

⁓ SLO and SLA is. So these are the things that you would say like, ⁓ we measure uptime this way. This is how we measure it. So there's some tooling that goes into that. so enhanced monitoring will get you into this. Like when you work backwards from like, what is uptime for us? Is it just that the servers are up and running or is it that we're not resolving, that we're resolving 200s or 400s for all traffic and we don't do 500s or something, right? ⁓ That's, you're gonna get into the world of enhanced monitoring.

Once you get above that though, let's say you want to not violate your SLA. You don't want to find out that you violated your SLA. You get into what's called level three, which is proactive observability.

Carter Morgan (29:51)

Yeah, this says organizations move from reactive monitoring to proactive observability at this stage. This includes advanced metrics, comprehensive logging, and the widespread use of tracing to understand system behavior. So I have not gotten tracing working. ⁓ but, ⁓ they, they go on to say kind of key aspects of this level include comprehensive metrics collection, centralized logging with advanced search and analysis, distributed tracing for understanding request flows, dashboards and visualizations for real time insights and correlation across signals.

Now I will say, I think there's some overlap here with level two, like what is comprehensive metrics collection, centralized logging with advanced search and analysis. mean, again, if you're on AWS, you, you get a lot of that for free distributed tracing. mean, I think that like we talked about, this is by far the hardest of the three, ⁓ signal pillars to implement and then kind of dashboards and visualization for real time insights and correlation across signals.

Nathan Toups (30:23)

There it is.

Great.

Carter Morgan (30:49)

If you're pumping your stuff into Prometheus and hooking that up to Grafana, you get a lot of that too.

Nathan Toups (30:54)

Yep. This also gets into some like culture and behavioral stuff. I think one of things too, if you look, if you read the Google SRE book, which is a great one, we should probably read that one ⁓ for this podcast. Service level objective is sort of an internal set of tools that are typically stricter than SLAs, not always tied to SLAs, but a service level objective is.

we measure ourselves by like, this is healthy behavior for the company. If we operate within this objective, ⁓ in this lots of ways of measuring ourselves, this is where you start getting into this idea of proactive. So we know that if we, you know, our SLOs are met, that we're going to be well within the boundaries of our SLAs, right? We would like to have our operations, you know,

I used to do CrossFit, there's a CrossFit gym near my house, and we always talked about training beyond the standard. So like if I was going to compete and the expectation was to do like 10 pull-ups or something, I should train 15 pull-ups, right? I should train 15 pull-ups so that when I'm in competition and I'm tired from these other things, I can do 10 pull-ups, problem. Proactive observability is that. It's basically saying I'm going to hold myself to a higher standard than what's absolutely bare minimum expected by the customer. And this gives us some wiggle room if things come up.

And, ⁓ and so, yeah, I think from a maturity standpoint, you look at that and go, yeah, that is a mature way of looking at things. ⁓ and we should aspire to get there, right? So level four, and I've only been in level four in a couple of companies. most of the time, I think proactive observability is probably what I would call terminal observability for a lot of companies. That's like, okay, you know, 95 % of your value is coming from this. ⁓ but if you want to be like elite level or doing things at like,

Carter Morgan (32:25)

Yeah.

Nathan Toups (32:36)

breakneck pace and be able to respond and react to them. Level four, level five is what gets you there. tell us about predictive observability.

Carter Morgan (32:44)

Yeah, predictive observability as ⁓ outlined in the book is, this level leverages ML and advanced analytics to predict potential issues before they impact the system. This level focuses on proactive problem resolution and optimization. Key aspects of this level include ML models for anomaly detection, predictive analytics for identifying potential failures, automated root cause analysis, continuous improvement through feedback loops. Sounds real nice. I don't think I've worked at a place like that.

Nathan Toups (33:11)

Yeah, no,

it's the tools are pretty advanced to do this and you have to have a team that's like obsessed with with doing this. Well, ⁓ I've heard great stories and I think companies like Netflix are a good example of this. This is where you're starting to get into like chaos engineering and all these other pieces. I know that a tool that Netflix used that's pretty famous was they had a, had an email tool that would constantly check, access like a, so, you know,

IAM roles, there's like a way that AWS resources have access to things internally. So you can have like an EC2 instance that's allowed to read and write to S3 buckets. And you'd make a role that says, yes, I'm allowed. And you want to make that scope to like a specific bucket and the specific like API operations. ⁓ It would constantly have an AI tool.

that would check the config it had, and then the actual calls it made over the last 15 days, and then say, hey, you're over-provisioned permissions. You should actually restrict the IAM role to this lower level. ⁓ You can do the same thing with telemetry. So you can basically say, hey, we're gathering all of these metrics, but there's no alerts tied to it, or it's not tied to anything. There's no dashboards. Imagine having these predictive tools. And then also, ⁓ I'll bring up CoreLogix again.

they have a really great one, is you can say, you can tell it to see your get pushes to production. like when I do a tagged release or something, I can say after a tagged release, if the error rate goes up more than 15%, automatically do a rollback of this release. So basically saying, looking for these, I don't know what's gone wrong, but the error rate has gone up a bunch.

Carter Morgan (34:50)

Yeah.

Nathan Toups (34:57)

let's revert back and like send out an alert and just say, Hey, look, something's wrong with this deployment. ⁓ those are cool. Like again, most places don't have a rollback system that's mature enough to do automatically that you can't just do that. But if you do, you end up getting these like self healing, you know, tools that kind of come out. It's pretty cool.

Carter Morgan (35:17)

Yeah, we've,

we've got like rudimentary stuff set up with like, ⁓ immediately after deploys, we're just, we're on it of S ECS and they have some tooling set up to, can like hook it up to a cloud watch alarm and after deploy like bakes for a certain amount of time. And then if there's an error, then it'll roll back. But, ⁓ I think that this idea of like kind of truly self healing infrastructure, like it will detect, okay, wait a minute. One of the hosts has gone bad. Let me kill it. Spin up a new one. ⁓

Nathan Toups (35:20)

You

Carter Morgan (35:48)

Yeah. To be tying that not just to like a, a basic health check, but also like actual error rates. I think again, like Netflix is kind of the example you hear about all the time here. I'd to learn more about kind of how all of that works at Netflix because, yeah, really fascinating stuff, but I find that you can't really get to any, anywhere close to like a level five observability unless you're working at kind of like a pure tech, often a big tech company. ⁓

where I mean that you know like Netflix I mean that is Netflix's business like if Netflix can't stream what are they doing ⁓ and so obviously there's a lot of observability set up around that to make sure that stays uninterrupted.

Nathan Toups (36:30)

Right.

Yeah, and so let's talk about level five, which I've never worked in an organization that has level five. I was looking at it and being like, whoa, that would be so cool. But what is autonomous observability?

Carter Morgan (36:43)

Yes, yes. And maybe, and I think I got ahead of myself. That's more what I was referring to with Netflix. This this all represents the pinnacle of observability maturity where systems are self-healing and capable of automatic adjustments. Human intervention is minimal and the system can adapt to changing conditions dynamically. Key aspects of this level include self-healing infrastructure and applications, including things like Kubernetes operators and chaos engineering, automated remediation response using tools like runbook automation, use of AI.

or ML to dynamically adjust monitoring and alerting threshold and fully integrating observability practices with the continuous integration and continuous delivery CI-CD pipelines. So I think, again, some of these sorts of things, like you do kind of, like these levels are a little, they bleed through to each other a little more. It's kind of like, yeah, like I don't think you to be at level five to integrate observability with your CI-CD pipelines. That's kind of like step two for me.

Nathan Toups (37:28)

Right.

So

we were, when I was at Flyer, which is this airline software company, we were at like 4.5, I'll say. Like we, it was the most mature SRE and platform engineering team. I was on the platform engineering team staff level. We were always looking for things that could drive cultural change. And, ⁓

We, while I was there, I helped with some of the deeper, we were transitioning off of Datadog, not, we were on Datadog, but we were transitioning from Datadog native tooling into open telemetry and deeply tying in the send to CI CD and getting better Kubernetes operator patterns. We ended up having, we built this really cool thing. One of the best DevOps folks I've ever worked with, he was like the Terraform master.

And we had a Terraform that would provision Terraforms. like you could spin up a project and it would build like a Terraform stack, but that Terraform stack could reference upstream stacks. So like if we made like organizational wide Kubernetes change, it would like matriculate through all of these ⁓ SDLC pipelines and stuff.

Carter Morgan (38:31)

interesting.

Nathan Toups (38:47)

And we had really amazing control over it. But if you were working on a team, all you thought about was like your little, your little slice of the universe. And, ⁓ and so you could zoom out as much as you needed to, but internally, and then we would provide all these modules like, here's our hotel, you know, observability module. If you use it this way, you get access to like this vast set of resources and data dog will automatically provision and all of your namespacing will show up and like really cool stuff where you're like this.

This is like nothing I've ever seen and ⁓ it was neat. And we were a pretty small company. mean, we had, you know, 700 employees total. I think it was 300 ICs. ⁓ But the stakes were very high. were, know, airline industry has very high standards. We weren't running like flight stuff. It was more like ticket purchasing and upgrades and, you know, ⁓ the add-ons, like getting a hotel booked through the airline and things like this. But,

Carter Morgan (39:31)

Mm-hmm.

Nathan Toups (39:45)

but they have very high standards for software just because they use it in other places with lots of stuff. yeah, but that's the only environment that I've been to. like, again, we weren't even full on five. We were like at 4.5. We had had a lot of the predictive stuff. We're on the path to getting this like fully self-healing things, ⁓ tremendous amount of effort. I mean, if you look at the salaries, we were probably spending, know, several million dollars a year on efforts around this, right? And so, yeah.

Carter Morgan (40:15)

Yeah, I think, ⁓ I would love to, if not work at one of them one day, at least do a deep dive with an engineer one day for someone who works at like a true level five. Like I was saying, like Netflix is kind of, I think the one you hear about all the time. ⁓ because even Google, right. But even as a, you know, the cloud provider, like we weren't level five, we weren't like autonomously healing. ⁓

Nathan Toups (40:28)

Yeah.

obviously Google, know, Google's gonna be there too, yeah.

Carter Morgan (40:44)

Maybe we were, and it was just kind of so, I don't know. We definitely weren't because our own calls were brutal. If we were autonomously healing, I shudder to think of what we weren't. Yeah. Of what it was picking up because my gosh, the pages were awful. Well, maybe we can talk about, ⁓ it was such a maturity framework. Maybe let's move on to chapter 10. This is one of our favorite chapters in any book we come across. We love a good anti-patterns.

Nathan Toups (40:53)

Right.

Carter Morgan (41:13)

Uh, chapter, this is called anti-patterns and missteps. Um, I was excited when this popped up. It was like, oh yeah. And so let's talk about some open telemetry anti-patterns and missteps. Um, we can't cover all of these. Maybe we can just give a, an overview of kind of what our favorite ones are. Um, I guess I'll, I'll lead with at least what Steve Flanders favorite misstep is, which is vendor lock-in. That is the whole thesis behind open telemetry is.

Nathan Toups (41:30)

Yeah, I like that. I like that.

Carter Morgan (41:43)

instrument your application, the open telemetry way. And then that buys you the ability to swap out vendors underneath the hood. I'm a big fan of open source non-vendor lock-in. I would talk a bit about this last week. Like sometimes I think it's a little more like we've been talking. mean, yeah. Have you ever used AWS cloud CDK instead of Terraform?

Nathan Toups (42:05)

Yes, and I regretted it. Yeah.

Carter Morgan (42:08)

You regretted it. Okay. I wanted to ask why, because we're

at the point where like one of the big like promises of Terraform again, it's like no vendor lock in, but like we talked about last week, like really, are you really going to design your Terraform in such a way that it's completely cloud neutral? So you can swap it out.

Nathan Toups (42:17)

great.

So if I were gonna start today, and actually I am starting to, so me and a buddy are working on this kind of like open source idea, and I've been a Terraform maximalist forever, since probably 2016. We're using Pulumi, and it's been a really, have you heard of Pulumi? you familiar? So imagine AWS CDK.

Carter Morgan (42:41)

I have not heard of Pulumi.

Nathan Toups (42:45)

but vendor neutral. it is, and not only that, but you can pick the language that you're comfortable in. So if you want to do it in TypeScript, if you want to do it in Go, if you want to do it you ⁓ you pick your favorite language, have just an SDK. It has the idea of stacks, which is very similar to CDK. ⁓ And it is much more, well, you can use, it's Turing complete language. So you can do a lot more of the, if you need to like use loops and do the other kind of.

Carter Morgan (42:46)

Okay, interesting.

interesting.

Nathan Toups (43:13)

interesting things, make your own library that kind of like gets a lot of context or builds custom naming patterns. ⁓ But I've been really impressed. Also, they've really embraced this idea of IDPs or internal developer platforms. The goal for things like this is I would like it if I had a really simple way, a template or something for a team to spin up a new project and it just be done right. So Pulumi has this ability, not only do you manage infrastructure, but you can also give this sort of like

self-service pattern on top of it. And because it's just code that you run yourself, ⁓ they have a Pulumi cloud, but you can also self-host. know, or I say self-manage, right? You can put it, point it to an S3 bucket and the state is managed there. ⁓ But I've been really impressed so far and we're doing some weird stuff. Like I'm making it so that you can dynamically provision a Pulumi stack based off of some thing. And so like we're doing like dynamic runtime stuff with Pulumi.

but if you wanna use it more like Terraform plus your own custom code, it's super popular. And I kinda like had it on the back burner for a while. I had friends I respected who had like made the transition, especially when Terraform had the big falling out. I don't know if you heard about this, but like they changed their license and they made it. So there's this like community version of Terraform called Open Tofu.

And it was like not controlled by Terraform and now Terraform is of course, HashiCorp is owned by IBM. So people are like, are they, are they cooked? Or are they, ⁓ or is there a future? And I'm not a doomsdayer. I think that Terraform has done some really important things, but I do think there's a need for these like CDK type patterns. I didn't love AWS's CDK because it uses CloudFormation under the hood and CloudFormation I've just...

I found to be really frustrating. you need to like, so one the things I like about Terraform and one of the things I like about Pulumi is that if I need to import existing infrastructure and then use it to describe moving forward, it's pretty straightforward. It's not necessarily easy. Like you have to wrap your head around how the state works, but it's near impossible. At least it was three years ago when I was using it. It was near impossible. We basically would just have to blow away.

Carter Morgan (45:08)

Yeah, yeah.

Yeah, yeah.

Nathan Toups (45:35)

entire stacks to like bring them back up from scratch and we had to build our stuff out that way. And which is, yeah, if you're trying to retroactively describe a production system, it's like bad, really bad. So yeah.

Carter Morgan (45:38)

Yeah.

Well,

all that to say, vendor lock-in, no good. So, yeah.

Nathan Toups (45:49)

Yeah, right, and so yeah, if you're

looking for an alternative to Terraform, I would look for other things, like Pulumi again is like a new way of imagining of like, yeah, you should check them out. It's worth a look and it's pretty quick to get it, if you understand how Terraform works, Pulumi does do some things differently, but it's less of a jump than if you're just starting from scratch. But yeah, that's what I would do today. If I wanted something that was more expressive than Terraform, I would use

Carter Morgan (45:58)

Yeah, I've never heard of that. That's really interesting. Yeah.

Nathan Toups (46:18)

like I am for this other project, which has been kind of a fun stretch for me.

Carter Morgan (46:23)

Let's go back and forth here. So that's the, I'll say that's one of my favorite missteps or anti-patterns. What, what about you, Nathan? You give me one, I'll give you one. Let's trade back and forth a bit.

Nathan Toups (46:31)

Yeah, so

this is one that's bit me. So I'm going to bring it up. So he gave three big categories. There's telemetry data missteps, platform missteps, which is where vendor lock in is platform missteps and then company culture. My two favorite, one is in telemetry and the other one's in company culture. So I'll, start with this one and we'll do back and forth. Um, inconsistent naming conventions. And I know that that's like the two hard things in computer science. It's like cache and validation and naming things. This is doubly true because

Carter Morgan (46:43)

Mm-hmm.

⁓ okay. Yeah.

Nathan Toups (47:00)

If you, like naming conventions are incredibly important and you want one that gives you room to grow how a company describes stuff. so you should spend some time on being like, should we namespace stuff and have like the org and the vertical and the team and you know, all these other things. I would say don't just phone that one in, like spend some time thinking about how you want to structure ⁓ your naming of stuff.

It's not a death nail if you come up with a bad one, but you should have it well reasoned and should be documented. Here's how we name stuff. Here's why. And it should kind of roughly break down to like how the company organizes projects. ⁓ Otherwise, what you're going to end up doing is just, it's, be a free for all. And when you need to actually like filter by like, if you do a good naming convention, you can do things like put wild cards and say, I want all projects underneath this vertical in the company.

and then boom, I have like these really expressive and simple, ⁓ you know, pattern matching queries. If every team's come up with their own naming convention, you end up having this like, this, if this, that, you know, like sort of if then statement that is quite cumbersome, lots of foot guns. ⁓ And so you just, yeah, spend the time and come up with, and if you don't want to come up with yourself, go pick a company you like.

go look at their technical blog and see how they name stuff, right? Like go copy somebody, because they probably put a bunch of thought into it and probably came up with something that's pretty good.

Carter Morgan (48:26)

Ha ha ha.

It's funny you talking about naming conventions and like, we're just not big enough yet for that to be a huge concern. So I hear out that I'm like, I'm excited for the day we have that problem, right? Like there's just, we're just one team of engineers at this point. So mostly, you know, we communicate by just yelling at each other from, know.

Nathan Toups (48:51)

Well,

in a perfect example is like I, one of the first projects I did on that FinOps team was came up with a consistent way that we actually tag stuff and annotate things mainly so we could even just run reports for FinOps, right? Like it was just like, man, all these teams have solved their own problems at 50 different ways. And so I had to go around and say, here's the way that we do things at Flyer. ⁓

here's the teams that are already doing it right or mostly right. And then I would go in and like open PRs across projects or encourage folks to come help. And then we got it. And then I remember like two months later after we got two full billing cycles and leadership was so excited that we could actually like measure things. And it was just because

we came up with proper naming conventions. It's not like I did anything magical. It was literally just like, some teams capitalize and some teams lowercase, or ⁓ they use this, but they attribute it to a team, and this one, the owner is a team, and this one, the owner is an email address. Well, which one are we doing? What's the value of this? And again, as pedantic as it seems, these are the ideas that'll scale with your business over time.

Carter Morgan (50:02)

I'll give my next one. This falls in the platform missteps, anti-patterns. Alert storms. It defines alert storms as configuring alerting systems with excessive or poorly defined thresholds, which can cause teams to be inundated with a high volume of alerts, many of which are false positives. We talk about this a lot on my team, which the only thing worse than no alert is a noisy alert, an alert you don't pay attention to. I was actually really, really impressed with one of our junior engineers.

I did a lot of work to set up open telemetry and to get alerts going. And I defined the thresholds like pretty generously. Like, for example, one of our applications, like it's got like, I think I said like 5%, like 5 % if 5 % of requests in a five minute period or 500 errors, that's a problem. Right. And I would, again, I think it's pretty darn generous.

And that one was like alarming more than we wanted to. And he had kind of came up to me and been like, Hey, this alarm, like it's too noisy. It's like, I agree that it's noisy. I just think 5 % is a really generous threshold as is. Like, I don't know if we can tune that down. Like we got to figure out why we're getting all of these errors. And he actually took it on himself. He created a ticket and fought to get time for it in the sprint, um, just to go and like audit all of our alerting and just see like, okay, why are, why.

are these pretty noisy? And then he found out that, those 500 error rates, like it was some random path that was completely non-critical and was like getting pinged as like part of like a secondary flow. ⁓ And for some reason was returning 500 errors. And so I was really, really impressed with him. ⁓ He set up like a separate staging alert Slack channel so that we could ⁓ forward the staging, so we could separate out the staging environment and the production environment and not just have one alerts channel. ⁓

Really, really impressed by him to take that on himself. And I think he's right because like, I just kind of set up a lot of this open telemetry stuff, like almost in my spare time. And so I didn't have a ton of time to go and ⁓ fine tune it. So very impressive work by one of our junior engineers. And I think it's a great act to, you know, not having an environment with alert storms.

Nathan Toups (51:56)

Yeah.

Yeah. We hit this problem at one of the companies I was working in, for whatever reason, they were using 500 errors to signal stuff. And I was like, guys, 500 errors are like the world's falling apart. And for the uninitiated, guess, a little inside baseball, HTTP codes are these codes that signal 200 OK, maybe you've seen this places, 200 OK means the response came back correct and it was proper.

Carter Morgan (52:28)

Interesting.

Yeah.

Nathan Toups (52:46)

There's 300 level errors, we're typically around redirects. are errors or unauthorized, things like this. then five, yeah, yeah, it's exactly, it's 400, maybe you don't have access, maybe you didn't, whatever, or the resource doesn't exist, things like that. 500 errors are literally that there's server errors, like something existential happened.

Carter Morgan (52:55)

Usually the client made a mistake.

Nathan Toups (53:10)

those really should be rare and they really should be something that like you need to go check it out pretty immediately if something's going on. And so I was very pedantic about meaningful 500 errors. Like I would go in, I would do the same thing though. I'd be like, we need to learn on these and they're like, it's really noisy. And I'm like, well, we need to go fix it. We're either giving a 500 error when it's not needed or something's really broken, right? Something's really broken and it's gonna bite us if we don't fix it. So. ⁓

Carter Morgan (53:21)

yeah.

Nathan Toups (53:40)

Meaningful, again, just like naming stuff, proper HTTP codes, they're so important. It's just like good hygiene. You should know that you should, if you're building REST APIs and stuff, a 200 versus a 203, those matter, right? 300 versus a 303, those matter. 400 versus 403. These different error codes should have meaning, and you should put a little bit of thought into why.

you're returning what you're returning because other people should be able to take that and say, you know what, I want to process this differently because it's a 401 versus a 403 or whatever we have going on. so yeah, I do, think that that matters a lot. It's cool too though, like I think that what you did was healthy, which was you kind of came in with a wrecking ball, right? Which is like, we went from zero to, I'm gonna put something out there.

and I'm gonna throw spaghetti on the wall and see what sticks. And then somebody comes in and goes, you know what, this is noisy and I don't know if I should actually pay attention to this or not. And then they fix it. That's a healthy org. think, yeah.

Carter Morgan (54:46)

yeah, and I had told

my team, I think someone at some point had referred to it as Carter's Alerts, and I said, I'm like, they're not my alerts. These are the team alerts. I'm like, if you think they're too noisy, adjust them. And so I was really proud of this engineer. He really took that to heart and was like, okay, this is my work team.

Nathan Toups (54:57)

Right.

Great.

So this is, I love this too

because ⁓ this gets into company culture. Yeah, that's my other one. So ⁓ short-term thinking is an anti-pattern. And again, short-term thinking would have been, well, just turn Carter's alerts off, right? It's annoying. And then you're like, okay, cool. Yeah, failed experiment. And that was it. And we just like put a bow on it. A culture that says,

Carter Morgan (55:11)

Okay, was that your other one?

Right.

Nathan Toups (55:30)

No, we need these alerts, but maybe we can tune them or we need these alerts. can address this load. That's a longer term thinking that is a let us maintain this system. That's a culture that you really, yeah, it's really awesome that you have people in your, in your company and that you were able to push back and say, these aren't my alerts. These are our alerts, right? Like our infrastructure is doing this. I just shine the light on it. ⁓

And I love that because that gets into this, I guess that kind of ties into this other anti-pattern thing, which is lack of ownership and accountability. You kind of have to have these together, ⁓ which is, and I've been in those organizations, have you been in those where like, everyone just kind of holds, throws their hands up and you're like, ⁓ the lack of ownership piece, yeah, yeah.

Carter Morgan (56:10)

Oh yeah. Oh yeah.

And you know, it's interesting. Um, I can't remember which chapter it is exactly, but there's a, there's a chapter, which is kind of all about like, think it might be this kind of times and missteps, but basically like how to convince leadership that this is valuable. And like, I think that's a good skill. Like I think you should be able to convince leadership. This is valuable. I'll say it's really, really frustrating working at a place when you're trying to get leadership on board with like,

basic ideas like this. Like that's something I've liked about doing book overflow is that, and I've said this before, I mean, we learn a lot of new stuff from here, but more, I think the better value I get out of this podcast is just hardening my ideas around what quote unquote good is. And like, I think there's a lot of interesting questions to ask around open telemetry when you're implementing it as an organization. Like what should our alerting thresholds be? Should we be doing automated rollbacks or should we just be

⁓ using these kind of get some observability should be selecting a real deal, telemetry provider, or should we just be kind of self hosting using like Prometheus and Grafana? Lots of interesting questions, lots of, ⁓ there's not really right or wrong answers, but it just kind of depends on the context of your organization. But the question of, should we be collecting metrics? Should we be using those metrics to alert?

And similarly, I think it's, it's more or less a solved question at this point. Should we be doing open telemetry to do all those stuff? Like that's basically solved. And so I've really liked working at a place now where we're all agreed on what we want to ultimately get to. We're all agreed on these kinds of basic principles. Exactly how you implement them is, ⁓ is going to differ from team to team, bar into environment. ⁓ and so, yeah, I think.

When you're feeling out new jobs, like really try to work with people who are agreed on those high level concepts, even if they're not fully implemented yet, because, ⁓ yeah, it can be really, really painful trying to convince technical leadership to like embrace open telemetry because you wind up kind like what you're talking about, like short-term thinking. It's just like, that didn't work. Turn it off. ⁓ and I'm fortunate my team isn't like that. say, okay, this is noisy. We should fix it, but no one's saying like, well, I guess we shouldn't use open tele.

Nathan Toups (58:15)

All right.

Yeah.

Great.

No, that's that's so powerful. And I think, you know, this is this lays the groundwork. You've taken a very pragmatic approach, it feels like just from what you've described on on book overflow. And it's one of those again, it gets back into like, you know, Ford ⁓ building evolutionary architecture.

know, sort of pieces that this is the foundation, it's well reasoned, it's solid. We know that there's a lot of potential of where this can take us. ⁓ I guess actually this ties us into chapter 11 really well, which is observability at scale, right? So like, how does this stuff work as the company grows and as the things get more complex, ⁓ what you build as a foundation and what you're relying on.

really, really matters. ⁓ And what's cool is that you can grow into open telemetry for it to become as complex as needed to meet the organizational needs. Was there anything in this chapter that spoke out to you or is this just mostly a me thing?

Carter Morgan (59:32)

No, I mean, this

one I kind of was like, clocked as like, I'm not gonna really care about this one right now. Like we're just, we're not there. We're not gonna be there for a while. And, you know, I think there's only so much information you can absorb at any given time. had to kind of, ⁓ you know.

Nathan Toups (59:37)

Yeah. Mm-mm. Right.

So I will say, as you, I'm hoping that Leland is wildly successful, that you guys are like, know, three-exing head count annually, you for the next few years or something. These things will come up. And I think this chapter 11 will probably be something that you'll come back to and be like, okay, super neat. They start kind of addressing ideas on like, well,

Carter Morgan (59:50)

Yeah, that-

Yes.

Right, right.

Absolutely.

Nathan Toups (1:00:13)

A lot of stuff has creeped in microservices is a good example. Why open telemetry has become so important is because microservices have emerged as like a big pattern, know, domain driven development where teams are kind of owning their own little piece of the business and they're doing their own thing. But at the same time, need this sort of, as again, as we've talked about with Neil Ford, orthogonal coupling. Was it Neil or was it Kent Beck? I forget now. Was it Richard?

Carter Morgan (1:00:16)

Mm-hmm.

Mark Richards. mean, I believe this

is a I think it's a concept from fundamentals of software architecture

Nathan Toups (1:00:46)

I think you're right, actually, you're right.

And so this orthogonal coupling is this idea that like, hey, we let all these teams work autonomously, but we all agree on the language of observability so that there's a way for us to interface. so this allows us to take the distributed system complexity and minimize the risk of how we observe this distributed systems. There's also like kind of a cool section. And again, I've run into this with that 4.5 company I was in.

doing anomaly detection with things like z-scores and clustering and time series analysis, where we actually had folks who had analytics backgrounds who could come in and do deep data analysis on the behavior inside of our organization. And if you have the observability data, they can do that, right? If it's like nice, clean, well-formatted, well-named data, I can go get a data scientist to go analyze a process that we have.

And what a cool problem to have. Like it's, I loved it cause I'm like, oh, we're doing, we're doing like real science on top of like what we do. Like, you know, the, some manager comes in and says, this is how we should ship code. we can go, was that effective? Like, did you reduce the error rates in production or did these things happen? Or did you actually increase the risk to the company and operationally we can go and analyze that and do cool stuff.

Carter Morgan (1:01:42)

Hahaha.

Well, maybe let's touch on chapter 12, the future of observability, just real quick. That's what's nice about this book having been written. I think this book was written in 2023. so one of my favorite things, any theme park fans out there will appreciate this. The Carousel of Progress is like a little stage show of animatronics that Walt Disney made for like the World's Fair back in 1960s. And it's really cute. It's like a rotating theater. like you sit in the seat and then

Nathan Toups (1:02:09)

Yeah. Yep. Yep.

Carter Morgan (1:02:35)

The stage in the middle is stationary, then your seats kind of rotate around the stage. And so it's divided like five different sections until you'll get a different section. It's charming. I really like it. But what's so funny is, is that like each section, it follows the same family, like as time moves on. And so it's like the family starts at the 1880s and it's like the 1900s, the 1920s and 1940s. But then like the 60s were the present when Walt Disney made it. So then it just jumps from the 1940s to the future.

And the future is like I think they updated in the 80s, but it's still like an 80s vision of what the future would be. And it's so dorky. And I think they're redoing it, which will be fun. But anyhow, all that to say, like you get that sometimes with the books that are written even like five or 10 years ago. It's like now let's talk about the future. And then you're like, what is this? Like this book does not have that problem because it was written in 2023. But we'll see. Any thoughts are the future of observability that stood out to you here.

Nathan Toups (1:03:34)

So I remember there's a famous quote from Jeff Bezos in a fireside chat that he had about people asking what will change over the next 10 years. And he said the far more interesting question is what won't change, right? That people are gonna want cheaper products and delivered faster. Challenges and opportunities, I don't think these will change, right? And so one of the things that they outline is like four areas that are all Cs. So cost.

Carter Morgan (1:03:47)

Yeah.

Mm-hmm.

Nathan Toups (1:04:00)

particularly cost related to infrastructure and licensing of managed tools. ⁓ Complexity and complexity overload. ⁓ Compliance, which is increasingly becoming a big deal with GDPR, SOC 2 type 2, HIPAA compliance, SEC stuff, depending on whatever regulated industry you're in. The expectations just keep getting higher. The rules keep getting stricter. And then code, meaning...

are you managing observability in the same way that you do SDLC, CI, CD, you know, the other parts of your infrastructure is observability also in that domain. ⁓ I think that those are four areas in which like number one, all four of those would be great startup ideas. Number two, you owning one or more of these in your organization, if you want to be an innovator when it comes to observability.

diving deep into cost, there's a whole thing, certification track and everything called FinOps. ⁓ Complexity has a lot to do with developer experience. Compliance is its own track. You can get into the security and compliance world. And then code, of course, is also, again, platform engineering. ⁓ These are huge opportunities. some of this is complexity that we put on ourselves, because we kind of, open telemetry kind of requires this and is pretty complex. ⁓ But I think that...

⁓ That was sort of the groundwork of it. He outlined some trends. And so maybe we can, maybe you can dive into some trends that you see coming up, but I want to kind of give that sort of like, that framework there. ⁓

Carter Morgan (1:05:31)

No, I love that. And I really liked that Jeff Bezos quote you shared about like what, isn't going to change in 10 years. ⁓ I think as far as trends, the one that stands out to me the most. obviously what everyone's talking about is AI. I think there's obviously like this really bold vision for how AI is used in that, like it's a ton of sleep consuming all of your metrics and it's also watching your alerting. then it's looking at your logs and your run books and it's kind of like.

automatically resolving things for you. think that's a billion dollar idea is an, an on-call agent, basically. ⁓ I think, I mean, geez Louise, that sounds like a hard thing to build and who knows if that's even possible. ⁓ but very, very interested to see the developments in that space.

Nathan Toups (1:06:20)

Pager, yeah,

PagerDuty is actually trying this. There's a few others, Firehose and a few other, Firehydrant, I'm sorry, Firehydrant. There's a few other of these, the sort of companies that are focused on reducing in meantime to recovery and support tickets and things like this. Again, unsolved. They're all kind of figuring this out. One of the things that I thought was really interesting and I keep seeing come up,

Carter Morgan (1:06:38)

Right.

Nathan Toups (1:06:46)

There was an open source project and I guess a company now called N8n. I don't know if you've heard about them. Okay, so you'll see them where they've kind of positioned themselves as like a click, you know, it's like a GUI workflow tool. It's basically if this, that, that you can self-host. And so you basically can say like, here's an agent and here's some stuff and like tie in all these other pieces. And then like, you can build these low code, sort of no code workflows. What's funny is that N8n has been out for a while.

Carter Morgan (1:06:52)

Uh-uh.

Mm-hmm.

Nathan Toups (1:07:16)

they were, they were doing some kind of cool workflow automation stuff, but it was, they kind of had trouble figuring out exactly how to position themselves. And now that AI, agentic AI stuff is out, they've repositioned themselves as like the tool of choice for folks who want to do, you know, AI agent chat bots and all these other kinds of more sophisticated workflows. It is genius on their side. You know, like, I think that these kind of like workflow oriented

architectures are going to become increasingly important with MCPs and all these other things that are happening. Another one I'll bring up is a little inside baseball is EBPF. EBPF, are you familiar with EBPF? Okay, cool. So this is one of those magical things that happens kind of under the hood, but EBPF originally stood for Berkeley Packet Filter, which originally a way for you to put, to do processing of

Carter Morgan (1:07:59)

I'm not.

Nathan Toups (1:08:14)

packets in the kernel of a Unix operating system. And it would basically be in this special place, but it would get priority. So it's very, very efficient, but it's not Turing complete in that you can't write infinite loops. It has this little solver that makes sure that you're doing something very efficient. So you can say, anytime network traffic on this port comes through, do a thing, right? And it goes and does it, and then it's in the hot path of this stuff. ⁓ And so eBPF,

has become huge for telemetry and observability. Kubernetes, if you're using it, there's a Kubernetes tool called Silium. And if you use Silium, which ⁓ gives you this high performance monitoring and security framework on top of Kubernetes. And so I would say anybody who's using Kubernetes these days should probably just default to using Silium. And what makes Silium so magical is it uses eBPF under the hood. And it can do...

amazing things with observability of the containers and all this other stuff without the performance hit that you'd get when you would traditionally use a sidecar and using containers to monitor containers and all these other things. ⁓ It's super efficient and super, and it's those kind of technological breakthroughs that I think are gonna have a huge impact on observability. So that's why I kind of like, I'm riffing on eBPF for a second. ⁓ Yeah, it's really cool technology, regardless of whether you're using Kubernetes or not. ⁓

Carter Morgan (1:09:35)

Interesting.

Nathan Toups (1:09:41)

And you'll see this. I think for most people, eBPF is going to be invisible. Most people aren't thinking about what's the Linux kernel driver that's doing TCP routing or something. You're not thinking about it. You're just like, get a socket open and send traffic to it. eBPF is a lower level tool. ⁓

Carter Morgan (1:10:05)

Well, I

think that takes us through the content of the book. Maybe let's wrap up with our usual, give me your hot takes, Nathan. We now finished the book. are fully qualified to give hot takes.

Nathan Toups (1:10:17)

Yeah, yeah, yeah, yeah. So my hot take is really like after sitting with it for a week. ⁓ I think that your, your observation last week on the Taylor Swift, ⁓ Taylor Swift thing, which is like, who is this book for ⁓ is doubly true. like, we focused this whole episode on chapters nine, 10, 11 and 12, right? ⁓ We read six, seven and eight before that. And

Carter Morgan (1:10:41)

Mm-hmm.

Nathan Toups (1:10:46)

they felt like very different audiences. Like they felt like very different audiences for those. And I think the biggest value of the book was 9, 10, 11, and 12. Like that by itself could have been a book. And I think that, yeah. And so, yeah, I didn't know if I fully agreed with you last week, because I think the first part of the book kind of spoke to me a bit. But then I really

Carter Morgan (1:10:49)

Yes.

Yeah, I agree.

Nathan Toups (1:11:15)

understood where you were coming from, because I had a hard time getting through six, seven, and eight. I mean, this is my subject matter expertise. I was like, come here. But yeah, yeah. So that's my hot take.

Carter Morgan (1:11:23)

Yeah.

Yeah, I feel the same way and I'm just checking real quick. Like, who was this book published by? Does it?

Nathan Toups (1:11:38)

yeah, that's a great question.

Carter Morgan (1:11:43)

It's a Wiley. It's a Wiley book. and I mean this, like, I don't know, like this book read to me, like it was self published. ⁓ like there was just a lot in here where I was like, again, I mentioned it kind of up at the top, like why in chapter 10 is it like telling us again, like, O'Tell is the abbreviation for open telemetry. I'm like, did the editor not catch this? Like it doesn't flow well in parts. Like.

Nathan Toups (1:11:45)

Interesting, yeah.

Carter Morgan (1:12:12)

It's yeah, it's confusing who this book was written for again, like all of chapter one is kind of like the history of open telemetry. it's really hard for me to like, just like what I can say about this book is that confidently that this is the most comprehensive book there is out there about open telemetry. Who I would recommend it to is just, guess, like if you want to know literally everything about open telemetry, I wish I could say a little more like, this is perfect. This is open telemetry for beginners.

read this and it'll get you kind of up and running or this is for your experienced SREs who really want to contemplate more about like the economics behind open telemetry, the future of open telemetry, ⁓ some of the nitty gritty behind the scenes of open telemetry. like instead it's like this jumble of everything and it makes it kind of, yeah, it's a tough one to recommend ⁓ and not because

It's not full of great content. is full of great content, but it's organized a little strangely. So I think, I think this could have used another path through buying the editor. ⁓ but still, still a good book. ⁓ and I think it's definitely going to motivate changes in our career. ⁓ Nathan, what are you going to do differently in your career? ⁓ having now finished mastering open telemetry.

Nathan Toups (1:13:31)

Yeah, again, I love the anti patterns section. think using this to call out ⁓ anti patterns and pitfalls to leadership. So if I see that we're going on a treacherous path, having this book, which is the

unquestionably the authority on open telemetry and saying like, hey, this is what we're doing is actually identified as one of the anti-patterns and this is why it's dangerous. And actually there's some tactics. Like I think we should move to the direction. It's just nice to give you like this little air of authority showing that, hey, this isn't me just saying like, I don't like this or, know, what the frozen caveman problem or something else. And we just kind of being freaked out. It's like, hey, open telemetry says, not use it this way or hey, this is a pitfall that you're to run into. ⁓ I think.

I think I'm really happy that I'll have these to kind of lean on.

Carter Morgan (1:14:22)

Nice. ⁓ for me, I, I just want to be better about demonstrating business value with open telemetry. think. Yeah. I really want to be careful. Like I'm fortunate to work at a place where there is a lot of leash granted for like, let's just improve our technical processes. But at the same time, the only reason we improve our technical processes is because it helps us generate business value. ⁓

Nathan Toups (1:14:28)

Ooh, yeah.

Carter Morgan (1:14:49)

And so just want to be good about kind of showing like, that's why I wanted with this new feature we shipped to add some open telemetry metrics for it. So I can kind of show like, see, look, now we're getting more insight into this kind of new feature we shipped and this feature has kind of like an outside partner. And so it's good to have to kind of keep ourselves accountable. I'm actually doing this with a CICD. We do a monthly demo day where engineering demos for the whole company, what we've worked on. And so I'd kind of.

Teased last month that the CIC stuff we were working on, but ⁓ this month I actually took the time because I had basically like two clean months. I'm like, okay, let me compare ⁓ our deployment metrics from last month to this month. And so I kind of, ran some Python scripts and got like, basically like our, our change lead time. like I last month, the change lead time from it. So when a change was merged into the main branch to when it actually made it out to production.

was that we averaged 20 hours last month. And then we have now gotten it down to 47 minutes. And so I'm going to kind of demonstrate to the company, like, and I've got some other metrics with that, like, hey, look, we've got 97 % reduction. We're getting features out to customers faster. We have fewer regressions. ⁓ So yeah, I think that's just a whole space I've been thinking about. And I really want to apply that thinking to open telemetry. And then finally, we get who we would recommend this book to. ⁓

Nathan, I, yeah, like I, like I said, it's kind of hard one to recommend, but what do you got?

Nathan Toups (1:16:21)

Yeah, so I dropped Software Engineers from this general SWEs. ⁓ This book is really for platform engineers, SREs, DevOps, folks in software architecture ⁓ who really need to super deep dive into O-Tel, both the context and history, where the project's going, the proper best practices as it stands currently. ⁓ If you're tasked with being a deep expert in open telemetry, this is a phenomenal book for that.

Carter Morgan (1:16:25)

Yeah.

Yeah.

And, and I would agree with that. Um, and I'd also say like, give yourself permission to skim parts of this book. We can't do that because we're book overflow and we have to read the whole thing. Um, but there were definitely parts where I think if you are like an SRE and you already know how to like do basic open telemetry implementation, like you don't need chapter six, seven, and eight, like you can either skim them or skip them entirely. Um, but yeah, uh,

We're glad to have read it, glad to be moving on to other books. We're doing React. I can't remember the name exactly, but we're doing a deep dive on React these ⁓ next couple of weeks, which will be a lot of fun. And thanks for sticking around, everyone. You can always ⁓ message us at contact at BookOverflow.io. You can find us on Twitter at BookOverflowPod. You can find me on Twitter at Carter Morgan. And you can find Nathan as work with Functionally Imperative at FunctionallyImperative.com. Thanks for sticking around, everyone. We will see you next week.

Nathan Toups (1:17:49)

See you.