Ep. 83Monday, September 29, 2025

OTel Fundamentals - Mastering OpenTelemetry and Observatibilty by Steve Flanders

Watch on YouTube

Book Covered

Mastering OpenTelemetry and Observability: Enhancing Application and Infrastructure Performance and Avoiding Outages

by Steven Flanders

Get the book →

Book links are affiliate links. We earn from qualifying purchases.

Author

Steven Flanders

Hosts

Carter MorganHost

Nathan ToupsHost

Transcript

This transcript was auto-generated by our recording software and may contain errors.

Nathan Toups (00:00)

in the olden days, everybody basically grabbed logs and you would extract metrics out of logs. That was kind of like a thing. You would use some MapReduce type stuff.

and you would do some regex and pull out the number of successful requests out of your logs that would come in. And we knew that that wasn't an efficient or good way of doing it. And so this idea of metrics, which has things like gauges, encounters, and all these other things, would literally just give you a snapshot at request time.

Carter Morgan (00:16)

Right, right.

Hey there, welcome to Book Overflow. This podcast for software engineers by software engineers where every week we read one of the best technical books in the world in an effort to improve our craft. I am Carter Morgan and I'm joined here as always by my co-host Nathan Toops. How are you doing, Nathan?

Nathan Toups (00:45)

doing great. Hey everybody.

Carter Morgan (00:47)

Well, as always, make sure to like, comment, subscribe, leave a five star review on whatever audio platform you're on. Share the podcast on LinkedIn with your friends and coworkers. If you're a student, share it with your fellow student. If you're a teacher, share it with your students and you can, it all helps to grow the podcast. And if you're looking for more help with your career personally, you can book a coaching session with either Nathan or I on Leland. The link is in the episode description. And we are excited about today's episode. This is.

Open telemetry. What is the I just think of this as the open telemetry book, but it's it's got a real name.

Nathan Toups (01:23)

yeah, I think I did a bad job of putting that in the notes. I think it's mastering open telemetry.

Carter Morgan (01:30)

Mastering open

telemetry

Nathan Toups (01:33)

Yes, that's it. ⁓ mastering

observability in open telemetry. There we go. It's a mouthful.

Carter Morgan (01:41)

Wait, is it observability first? I've got open telemetry first.

Nathan Toups (01:45)

⁓ then ⁓ I type it that way. ⁓

Carter Morgan (01:48)

Okay, I got mastering

open telemetry and observability, enhancing application infrastructure performance and avoiding outages.

Nathan Toups (01:52)

and

Yep,

this is I get for typing stuff and not just letting other things do it for me. Okay.

Carter Morgan (01:58)

All

right. So that's the book this week. ⁓ Let's we'll give you the author introduction and the book introduction so we can all get on the same page here. So it's written by Steve Flanders. Steve Flanders is a founding member of the Open Census and Open Telemetry projects and has over a decades of hands-on experience in the monitoring and observability space. As a senior director of engineering at Splunk, a Cisco company,

He oversees the Splunk observability platform and spearheads Splunk's open elementary ⁓ contributions. He was previously instrumental in bringing what is now the Splunk APM product at Omniscient and the Loginsight product at VMware. A sought after speaker and blogger, Steve frequently shares his insights at prominent conferences like KubaCon and on his blog. He holds an MBA from MIT, underscoring his blend of technical acumen, strategic vision, and entrepreneurial spirit.

And the book introduction is, discovering the power of open source observability for your enterprise environment and mastering open telemetry observability, enhancing application infrastructure performance and avoiding outages. Accomplished engineering leader and open source contributor, Steve Flanders, unlocks the secrets of enterprise application observability with a comprehensive guide to open telemetry or OTEL. Explore how OTEL transforms observability, providing a robust toolkit for capturing and analyzing telemetry data across your environment.

So we've of done a few weeks of more culture stuff or career focused stuff. We're back in the thick of it with a deeply technical book. We read the first half, we read through chapter five this week. Give me your thoughts, Nathan. How are you feeling about this?

Nathan Toups (03:35)

Yeah.

Yeah, this was a big pendulum swing. mean, first of all, you couldn't read a book about open telemetry by a more qualified person. And I think this has really taken the whole industry by storm. I went through this shift in thinking, and I think that'll be something kind of cool to talk about. Also, I've always called it O-Tel, just like you naturally did. And then I just watched a session the other day from like GoFerCon UK, I think. And there was someone who called it

Carter Morgan (03:48)

Absolutely.

Nathan Toups (04:11)

and I was like, I had to do a double take and apparently there's like a whole, I exactly, was like, Carter's going to hate this. that's immediately what I thought. But it's just like with the like kube control versus kube cuddle versus like, there's a lot, like for some reason SREs and DevOps folks just want to give the weirdest pronunciations to things. And I think some of it's cause we read stuff all day and maybe don't talk to other humans. But also just like, auto.

Carter Morgan (04:15)

I don't like that at all.

hahahaha

Right, right.

Nathan Toups (04:41)

Really? ⁓ Tell us in the comments if you like to call it auto versus o-tel. ⁓ I'm an o-tel person myself, but yeah. To get into the general thoughts though, ⁓ yeah, this book is deeply technical. also one of, it kind of changes one of the rules that we typically have, which we've been largely avoiding sort of like frameworks and specific tooling. I think we're gonna deviate from this a bit if it has enough of an industry focus. So open telemetry is sort of like,

Carter Morgan (04:48)

Me too.

right.

Nathan Toups (05:10)

And you'll see the book kind of outlines this really well. Everything was fragmented. Everyone was having different ways of solving this problem. And a group of consortium kind of came together and said, hey, actually, if we think about it this way, if we structure it this way, we can get past all the madness. And they were correct. Every observability platform, everything from, you know, Data Dog to New Relic to all these vendors that are involved, have all kind of gathered around open telemetry for the betterment of everybody, right? And so I think that this...

helps a lot. I've been working with open telemetry for a few years now. And I'd never really gotten into the deep dive of the innards of how things are working or the vision of where things are going. And so I appreciated this book a lot because I was like, that's why it's this way. Or, that's why we're transitioning from this pattern to that pattern or this thing's getting deprecated. So I think some of this was speaking to me. But I will say, you know, even I, my eyes would glaze over a little bit in certain parts where I'm like, okay, like,

I'm reading this for the podcast. I got to get through this. I'm not going to skim it. It's a deeply technical book. think it's like 50 % oh, here's everything about open telemetry and 50 % deep reference manual, which is not a fun thing to read cover to cover. So yeah.

Carter Morgan (06:10)

Yes.

Yeah, I agree. ⁓ I wish it had been a little more like refactoring by Martin Fowler when it was a little kind of like, here's a scenario without open telemetry. Here's a scenario with, you know, and then as we apply open telemetry, then it turns into this, right? and you're right. It does lean a little too much into like reference manual. I was excited to read this book, ⁓ because

I've been doing a lot of this stuff at work. We were not in a place where you had sort of like auto-instrumented metrics for the application. We were just getting kind of whatever you got for free through AWS. ⁓ But even then you get for free with AWS, you're kind of limited and you can't use like Prometheus, which has like the very powerful PromQL ⁓ query language. Anyhow, so pick this up, excited to read it and...

Yeah, it's like this weird mix of like too broad and also too focused at times, right? Which like, I can't fault the book for being that way. I think I was a little, I was looking a little more for like a how-to guide maybe. ⁓ And instead this kind of starts with like, okay, how did this project come to be? Where is it going in the future? What does it mean? Like, what is the status of the project? And like, what does it mean if the project is

I don't even remember all the different statuses, but it mentions all the different components of O'Tell, what status each one are in, and how you can know what the status means and how it's changing. At a certain point, I start wondering, OK, wait, who is this book really for? ⁓ Which is not to say that there aren't good things in this book. I'm learning a lot from this book. it is kind of like Steve Flanders said, I'm going to put everything I possibly know about O'Tell into a book.

Nathan Toups (07:56)

Mm-hmm.

Great.

you

Carter Morgan (08:18)

Like, yeah, it's a little, all things come back to Made to Stick. And it's a little like, I think he might be suffering a little from like the curse of knowledge that Made to Stick talks about. He knows so much about O'Tell, right?

Nathan Toups (08:22)

that's a, yeah.

It also, this book in style reminds me a lot of the DevOps Handbook, which I think suffered from the same problem. The ⁓ other thing that I think is worth bringing up is he does make these sort of like fictionalized anecdotes of this one SRE that's sort of implementing this thing. again, the reason I bring up DevOps Handbook is the same critique that I gave, which was it's always the rosy path.

Carter Morgan (08:36)

Yeah, right.

Yes.

Nathan Toups (08:57)

Like every time it's like, and then there's this like clever way that I've introduced this idea that we just read about in the chapter. I think it would have also been useful to have examples of like the risk of cargo colting, right? Just saying like, hey, I'm going, I don't know much about open telemetry, but I saw this cool thing at a conference and I'm going to advocate for open telemetry. And it's a disaster because I don't really know what I'm doing. Like those are these cautionary tales. Cause I've seen this, I've seen this where somebody just like,

Carter Morgan (08:58)

Right.

Right, right.

Nathan Toups (09:24)

spouts a couple of phrases, they don't really understand what instrumentation is, or they maybe don't really understand ⁓ how difficult actually getting like useful spans, for instance. Now the world's changed. There's a lot of great auto instrumentation that exists now. And you actually can get pretty far with just using the like well support. And we'll get into the details of what this means. Cause I think some of these terms that we're using right now is going to be like completely alien to someone who's not used to this. ⁓

Carter Morgan (09:41)

Mm-hmm.

Nathan Toups (09:54)

And, ⁓ but yeah, categorically, open telemetry solves problems that were not solvable before open telemetry. Like I will, and that's a full stop thing. Before we had this focus of we actually care about context. We care about your metrics and your logs ⁓ and your spans. And we'll talk about tracing and stuff. If we can correlate across those.

there's actually really interesting questions we can ask that we couldn't ask when they were three separate silos. And ⁓ that's a really powerful idea. I mean, I remember I was doing all three of those things separately and actually I wasn't, I was doing logs and metrics, which was very common back in like the mid-20 teens when distributed tracing became a thing because of microservices, it changed the entire conversation. We'll get into all of that, but I.

Carter Morgan (10:27)

Right.

Nathan Toups (10:49)

I think that was my other critique though, is it's just like, this is a hard one to do. I don't know how I would approach this book either. And so, ⁓ you know, maybe it is the curse of knowledge. Maybe it's also that ⁓ there's a version of this that would be good for a software engineering, a general software engineering audience. And this book's maybe a little bit more focused on like an SRE kind of DevOps person where they're kind of talking inside baseball a little bit. ⁓ Because I came to the table with a lot of like,

Carter Morgan (11:14)

Yeah, yeah.

Nathan Toups (11:17)

this context, right? I was nodding my head and going, yeah, I experienced that. I remember that. So.

Carter Morgan (11:21)

And there are also points where it's like trying to explain like what an API or an SDK is. And I'm a little like, wait, who's reading this book who doesn't understand what an API or an SDK is? It's a little like cracking open refactoring and being like, by the way, if you don't know what a computer is, I'm like, right.

Nathan Toups (11:26)

Yeah.

Yeah, that's a good point.

Right. Yeah,

that's interesting. I hadn't thought about that. Or getting into an array is a sorted list. Yeah.

Carter Morgan (11:42)

Right, right. ⁓

yeah. So it's, like I'm with you, obviously this guy is very qualified as Steve Flanders. ⁓ and there's lots of good stuff in here. ⁓ you know, I wonder, does anyone do a tech blog, like a sub stack on, ⁓ I wonder if anyone's done something kind of like, ⁓ like the Phoenix project, but like essay style, like instead of having like one giant overarching narrative of each

post is a narrative about about like one specific thing, like someone implementing open telemetry or like, like, I wonder if it's like a tech blog that's like kind of purely narrative like that. Do know anything like that?

Nathan Toups (12:23)

experimented with a single post like that a while back, like over a year ago on Functionally Imperative. ⁓ And I was like, yeah, and then I'll write about this more later than I never did. It's actually kind of fun. think that I would, first of all, I'd be surprised if no one has done it. And secondly, we need more of it. So I think both of those are true statements. Yeah. Yeah.

Carter Morgan (12:28)

Uh-huh.

Right, Yeah. Let us know in the comments if you've seen anything like that,

Well, let's get into it. So guess we've been talking a lot about open telemetry, O-Tel, O-Tel, as the Brits call it, apparently. If you're British, let us know. Is that a British thing or is that just a one guy thing? But okay, so let's talk a bit about this because you might be hearing all this. I like, I don't even know what open telemetry is. So...

We have a Nathan you've written here. Did you grab this quote from anywhere or did chat GPT generate this? This open telemetry is an open source observability framework.

Nathan Toups (13:16)

⁓ which

Yeah, so this was summary of ideas. So one of the things I want to do a little bit different is that because this is such a deeply technical book, I think that we're just going to talk about concepts and the concepts that we learned and things that stuck out to us. So this will be a little more free form. I don't think we're going to go chapter one, chapter two, things like that, because it doesn't make a lot of sense. I do think for anybody who, like, this book is about mastering open telemetry. ⁓ We're not going to teach you how to master that in a podcast, a couple of podcasts about it.

Carter Morgan (13:27)

Right.

I think so, right.

Yes.

Nathan Toups (13:49)

But what I can do is introduce you, what we can do is introduce you to concepts and talk about, I was thinking that maybe ⁓ there's these three big ones, like there's the three pillars of open telemetry, there's this idea of pipelining, and then there's this, and they kind of talk about core architecture stuff. Again, they do, they make this kind of like big to do about the API layer versus the SDK layer and.

Carter Morgan (14:02)

Right, right.

Nathan Toups (14:16)

Collectors, which is an area that like I remember when I first learned about collectors in what they did for data pipelining. It kind of like blew my mind a little bit. And so I think maybe it would be good to touch on that or have collectors even stuck out to you, Carter. Like is this, is this one of those like, have you had an aha moment about how mind blowing these are? Is this like, you look at it you're like, duh, that's how I would do it anyway. So, ⁓ yeah. Yeah.

Carter Morgan (14:37)

No, I think there's a lot of great stuff here and I

do have thoughts on collectors. But yeah, I like all of that. So maybe we just start at the high level, the 30,000 feet and then kind of go there. So you've got written here on our notes, you say open telemetry, OTEL is an open source observability framework that standardizes how we generate, collect and export telemetry data, metrics, logs and traces. And we're going to talk about the three of those in more detail in a sec. From applications and infrastructure, think of it as a universal translator and collector for observability data.

Nathan Toups (14:42)

Yeah, cool.

Yeah, sounds great. Yep.

Carter Morgan (15:07)

provides a vendor neutral way to instrument your code once and send data to any observability backend. So those are the three pillars. You got your metrics, your logs and your traces. We're going to talk about those more in detail in a second, but that was really the big revolution with open telemetry is a standard is open sourcing the standard and saying, we're going to have one way for sending your metrics, logs and traces to the backend. And then that way, you know, ⁓

Nathan Toups (15:22)

Yeah.

Carter Morgan (15:35)

You can, it's vendor neutral. So you can swap out, like we said, by they're using data dog or new relic or any of the other observability platforms. And, like if you're using something like, ⁓ Grafana for viewing your data and you've got Prometheus as a data source, Prometheus understands all the data that's coming in. Or I think actually what happened, think. Othel translates the data into Prometheus is desired format. I can't remember how that, which way that goes.

Nathan Toups (16:05)

Yeah, yeah, there's a, well, the spec now is Prometheus compatible. And so that's the kind of cool thing about it. And that's the nice thing, they very pragmatically have taken lots of stuff like Jaeger had its own method for doing a bunch of distributed tracing. Well, it turns out that there's a small translation piece to it, but a lot of the ideas are there. And then Jaeger decided to just deprecate their.

Carter Morgan (16:27)

Array.

Nathan Toups (16:31)

sort of like special way of doing it and said, we're just going to use open telemetry structure. Jaeger is still a web interface that allows you to visualize distributed tracing and look at latency and all this other kind of cool stuff. ⁓ But the big thing that happened for me, and I've, gone into several startups with this. ⁓ Perfect example is a data dog, love them, hate them. Data dog's pretty amazing, but they definitely take advantage of what they call the hotel California business model.

Carter Morgan (16:36)

Mm-hmm.

and bright room.

Yeah.

Nathan Toups (17:01)

which is that

once you check in, never leave. But one of the things I've saw, it was like a code smell to me, is if the Datadog SDK starts sneaking into your code and you have annotations for Datadog all over the place, it's incredibly useful, but you have the worst vendor lock for your business logic. Step one would be to use inversion of control, right? Have some sort of, you know,

Carter Morgan (17:12)

Yes, yes.

Nathan Toups (17:28)

wrapper around it so that you're instantiating the data dog, whatever, and then inside of your code you have these other things. But even better is to not even have that data dog stuff in there at all and use open telemetry and shove that into your version of control. And now you have what Neil Ford talked about in the evolutionary architecture stuff of like orthogonal coupling, right? So we all agree that we use open telemetry. So there is a coupling in that we like

Carter Morgan (17:39)

Right.

Nathan Toups (17:56)

We're using the same language, but none of us have to care about which framework we're using in any language and where it's going at the end. All I do is I say, I instrument my code. I measure metrics, I create logs, I annotate my spans with my trace IDs. And if I do all of these things, I have faith that it's going to be gathered up in some collector and that collector is going to do something that my company cares about.

Carter Morgan (18:05)

Mm-hmm.

Nathan Toups (18:26)

Right? ⁓ And it takes cognitive load off of the developer. Like, you know, it's your responsibility to do those things, but you don't have to think about like, what's the Datadog SDK? Tell me, like, what's some special weird incantation that I have to do? And that's a huge, and that was a big change, right? Like, ⁓ across the board. Again, we had, I had a huge refactoring that we did and we showed why this was so useful. And we were helping teams get off of the Datadog SDK. We were still shipping to Datadog.

Carter Morgan (18:54)

Yeah, ⁓

Nathan Toups (18:55)

You know, we still found

a lot of value in Datadog, but we also wanted to have the ability to not have to use Datadog forever if we didn't want to.

Carter Morgan (19:03)

Well, and to me, it's, such an interesting example of like the victory of the open source software movement in that. I think if you kind of were to like pitch the open source software movement at the beginning, or even if someone who's not familiar with software, say like, well, there's all the software that people just contribute to for free and anyone can use it. Like why on earth would you do that? Why would a business ever support that? But then you see the big data, ⁓ are the observability providers like data.new relic.

Nathan Toups (19:11)

Yeah.

Carter Morgan (19:33)

All start conforming to open telemetry and you're like, why, why would they do that? Wouldn't they want you locked into their SDK to their proprietary way of doing things. ⁓ but when you're competing, when you all conform to the same standard and you're not competing on standards anymore, now you're competing on feature offerings and you're making it easier for people to get into your ecosystem. Like we, as a team, one of the reasons we had kind of put off observability for a while.

Nathan Toups (19:51)

Right.

Carter Morgan (20:02)

was because we really didn't want to make that kind of big scary decision about what provider are we going to use. But then open telemetry lowered that barrier. like, we don't even need to choose a provider. Like we can just, like right now we're just using ⁓ Amazon managed Prometheus and Amazon managed Grafana. And so we're just kind of sending it there.

Nathan Toups (20:23)

That's awesome,

because I had a great success, you, I'm like a little inside baseball, are you doing a multi-account org? Or is it a single account, like monolithic AWS account right now? Sweet, okay, awesome. So yeah, that's the cool thing about Managed Prometheus and Managed Grafana on AWS is that it's natively multi-accounts. You can even like look across all these things and yeah, it's super pragmatic. If you decide later, you want to go to Datadog or Grafana Labs or,

Carter Morgan (20:34)

No, it's a multi-account

Yeah, yeah.

Nathan Toups (20:52)

some CoreLogix or there's all these cool vendors. If you're using open telemetry, you literally just change the ⁓ sync with the syncs in your collector, or you could dual ship. That was something we did where we wanted to de-risk and we continue to use ⁓ Google Cloud Services, but we also wanted to trial a new vendor. And so we added a new sync to it and sent to dual shipped, shipped to both places.

Carter Morgan (21:01)

Yes.

All right, right.

Yeah,

Nathan Toups (21:21)

Yeah, it's amazing. It actually gives you a little more, ⁓ it's like liquidity in the market. It's like you have more options. Yeah, it gives you optionality and it also makes, it does, it makes the vendors have to be awesome, right? You have choices and you're not stuck in something and you have to do some multi-year refactor to do it.

Carter Morgan (21:28)

Yes, optionality as a,

Right. But now that we're in it, now that we are like, okay, we're, gathering our metrics through open telemetry. think we're now looking more seriously like, okay, do we need like a real deal observability provider? and, and who, who should we go with? And so, yeah, I think, ⁓ it helps the observability providers compete where they're actually competitive and not maintain their own kind of standards. And then also for them, it helps with switching costs, right? Like you kind of see the opposite with like.

Nathan Toups (22:04)

Great. Yep.

Carter Morgan (22:12)

AWS versus Azure where it's kind of like, theoretically, have you ever seen anyone do this? Cause like, I know one of the big advantages with like Terraform theoretically is that you can make it cloud neutral. I, I've never seen that actually happen.

Nathan Toups (22:24)

This is the,

I've had deep conversations with other friends of mine that are SREs. Everyone wants this. I've almost, very rarely, cause the thing is, that you basically, it's basically like saying, Linux has driver support for all the major graphics cards or whatever, right? You're like, yeah, kind of, unless you have some proprietary thing, you basically have to, to make a Terraform like multi-cloud, you have to have a module abstraction.

Carter Morgan (22:34)

Right.

Mm-hmm.

Nathan Toups (22:54)

that kind of abstracts that away. And so it's non-trivial because you are writing basically like custom drivers because, yes, both cloud vendors have Blob Store, but the way that you set up permissions might be slightly different. And so you might have to make like a Blob Store module. And now you kind of don't get native use of either one. there's some thought that goes into it. You can do it. It's a non-zero cost. ⁓

Carter Morgan (22:54)

Yes.

Nathan Toups (23:22)

Technically, yes, if you can describe the work in a way that is API compatible across cloud providers. And most of them, I will say, most of them have some version of key management service, which you have to have. Most of them have Blobstore. Most of them have some sort of API cloud gateway type thing, a Kubernetes layer. If you have managed databases, all of them have kind of a flavor of those things. But perfect example, Google Cloud.

doesn't have an equivalent to SES, but SES is the simple email service. Google Cloud's like, yeah, we don't want to be in that business. Go use Syngrid or something, you know? And you're like, okay, well, I guess I can. Okay, and so there's something that are not one-to-one, or there's no equivalent to DynamoDB, right? Like DynamoDB is like this magic special sauce that AWS has, ⁓ Or some of the weird global scale time series databases that Google has. Like you're not going to replace, you know, ⁓

Carter Morgan (23:53)

really? Yeah.

Right. Yeah.

Right, right.

Nathan Toups (24:20)

that with something that a cloud provider natively has, you might have to move to something like CockroachDB or something. There are these edge cases where you're like, well, I can't exactly move. I almost can. It's like 95 % overlap. But yeah.

Carter Morgan (24:36)

Yeah, it's a

right. Like I've never, and it's so much work. It's so much work to be like, okay, let me abstract all of my Terraform, you know, ⁓ so I can like swap it out. So if one day in the future, we decide to move from one of the cloud providers, AWS to one of the two other options, like it's a little easier. Just kind of like guys, if we're to switch cloud providers, it's going to be a nightmare no matter how you do it. open telemetry is like on the other end of that where it's like actually

You get a lot of value immediately by just instrumenting your application with the open telemetry SDK and doing everything with the open telemetry configuration. And then it makes it very easy to swap out providers underneath. so it's like, is, is it, I don't know. I've never like been like, okay, just give me data dog. And I'm just going to start with data from the ground up. Like I could imagine a scenario where.

It's easier to do that, like, cause data dog wants you to be able to do that, but it's not that much easier than just doing open telemetry and the benefit you get from not having that vendor lock in is huge. So.

Nathan Toups (25:39)

Right?

One thing that surprised me, I did not know this detail. So there's the Cloud Native Foundation. It's like anything that goes into the Kubernetes ecosystem. And OpenTelemetry is a big part of the Kubernetes ecosystem. So you don't have to use Kubernetes to use OpenTelemetry. That's a really important distinction. ⁓ But it's the second most popular project other than Kubernetes in the Cloud Native ⁓ Foundation, which

Carter Morgan (25:52)

Yes.

interesting.

Nathan Toups (26:12)

Yeah, and so it's the meteoric rise, because again, it's one of those where like in retrospect, this is the obviously correct answer, the correct approach to taking here. But I remember being very proud about being like, I remember when I really thought about metrics. And so in the olden days, everybody basically grabbed logs and you would extract metrics out of logs. That was kind of like a thing. You would use some MapReduce type stuff.

Carter Morgan (26:47)

Right, right.

Nathan Toups (27:01)

this is a good one. I think it's a good segue. Push versus pull. So this actually comes up in the book. ⁓ And so for a long time, you would ship

your observability data, meaning like you'd use Munin or some of these other tools. And I would point it at something externally. And like, as I was like generating logs, I would ship those off to some log, you know, endpoint. If you ever heard somebody have like the elk stack, which is like the elastic, that's elastic, ⁓ log stash and cabana. That was like a very popular observability stack that people ran for a long time.

Elasticsearch was like full text search. People have used that for all kinds of things, but we'd also use it for logs. And it was like what powered LogStash. LogStash was this way to like very powerfully go through and then Kibana is this observability piece and has some other, know, dashboards and visualizations and stuff. But you would ship it off to that thing. Companies like Google and Facebook and some of these other places, they were generating so many logs that they would actually cause

denial of services on their own services with it, the back pressure of trying to ship things off. You can imagine the control flow of like, if the collector got too backed up, it would not accept new traffic and then they wouldn't be able to ship stuff. So they became, they ended up scraping. so Prometheus, if you look at traditional Prometheus stuff, you actually expose like a, you expose a metrics route and the Prometheus scraper will come in and hit the matrix. ⁓

Carter Morgan (28:10)

No, I'm.

Nathan Toups (28:37)

endpoint and like grab a snapshot. so whatever frequency that it's pulling in stuff, it's updating its database. there's this whole thing here. Open telemetry allows both push and pull. So pull is nice because as the thing is available, it'll go out and find these things. And if you have Kubernetes, you can have things like service discovery. So things would auto register. Prometheus would then start scraping from it. It's a cool pattern if you have a Kubernetes environment.

Carter Morgan (28:52)

Right.

Nathan Toups (29:08)

But open telemetry, oh, the reason I'm getting back to this is that the collector pattern and the agent pattern that we'll talk about with open telemetry shifted because all of a sudden, registering and then scraping really lost its meaning when you got into a world of serverless. So all of a sudden, when you had lambdas that live for maybe 200 milliseconds and then put themselves away, there's no way for a scraper to come out and grab that in time. And so they're like, maybe we should ship these off again. So it's like, we've gone full circle.

Carter Morgan (29:26)

Right, right.

Absolutely.

Yeah.

Nathan Toups (29:37)

And so anyway, that was, wanted to kind of give this because open telemetry introduces not just the three pillars of the types of messages or signals that we're sending, but also there's some really creative ways to gather those signals up and ship them off.

Carter Morgan (29:54)

So let's talk about those three pillars. ⁓ Let's talk about metrics, then logs, and then traces. Because I feel like traces is the one that people don't understand as much. Or maybe let's start with logs, just because logs, I think, if you're in programming, you understand what logs are, right? You are logging to the console. You logging information about your system. You should be doing structured logging.

Nathan Toups (30:08)

Yeah, everybody gets logs, right? Yeah.

Carter Morgan (30:22)

Right? instead of just printing a random line to the console, you should be using a sort of logging library like, ⁓ you know, log for J is the big one in Java that is automatically attaching to each log, the timestamp, the host it's running on, you know, lots of information about it. ⁓ that'll make your life a lot easier.

Nathan Toups (30:40)

Great.

Exactly. We call

those enrichments. So enrichments are the context around it. So you'll see this too in a structured log. If you have a good log platform, this will let you filter by, let's say if you're running Kubernetes cluster, maybe I want to look for the logs on a particular pod. So I get the pod ID or a particular cluster that I'm running. I can slice and dice by those. could like log level, ⁓ you know, these other pieces.

Carter Morgan (31:08)

Mm-hmm.

Nathan Toups (31:09)

And this was huge, right? When you started doing this, it gave you the ability to query, structured queries on top of your structured logs. ⁓ And I think most people who've spent maybe five minutes thinking about it are like, yeah, this is a good thing. This is like a net positive. ⁓ I think we've all seen those bad logs where you're like, I remember doing some Python data pipelining where the logger would literally just take whatever came out of standard out. And so it would just like...

it would do these multi-line logs, but each one of them would show up as a timestamped reference and you're like reading it backwards because that was maybe the way it was sorted. And you're reading like a stack trace upside down and you're like, this is the worst, this is the worst thing ever. ⁓ Structured logs at least shove all of that multi-line stuff inside of like the message ⁓ portion. And you can actually like make heads or tails of it.

Carter Morgan (31:36)

Yep. Yep. Yeah. ⁓

And so OpenTelemetry supports sending, you know, it has an API and a specification for logging, for sending log. I will admit, I've not done a ton of work with this. I think a lot of people with logging, you get a lot of pretty robust logging capabilities out of the gate with like AWS. you're, you know, between, you know, like CloudWatch logs and then there's log groups and log insights and they have a pretty decent, it's a proprietary.

think what do they call the CloudWatch Insights query language for searching through your logs. So I have not felt the need to like get this all set up necessarily like the hotel way. Have you ever done that?

Nathan Toups (32:36)

Yeah, we ended up having everything go in. And again, you can think of it, if you think of the pipelines, you kind of imagine open telemetry can act as sort of a middleware between your application and then the destination. So you can just say, log to standard out, use structured logs, and then AWS, you'll tell it, okay, anything that's in standard out and standard error, ship off to our logging, you know, the logging system, CloudWatch, right?

Carter Morgan (32:41)

Mm-hmm.

All right.

Nathan Toups (33:06)

And that's fine. That's totally fine. As soon as you want to start doing special things, and again, this is where things will come up, and we'll talk about this at the collectors and some of the processes that are inside of this. Let's say you wanted to have a consistent way to filter out personally identical information, right? ⁓ Or maybe there are certain types of logs that you never wanted to have in your long-term system. You actually can, in your collector, you can set these collector-side rules that

do some sort of like post-processing outside of your service. And a lot of time, good security posture companies will have these filters in the application itself, right? We put, you know, a bunch of ⁓ asterisks on top of social security numbers or something so it never shows up in logs, right? ⁓ But you can also do this, the security team, like your DevSecOps folks can make another rule in the collector that says, hey, even if a programmer makes a mistake, we'll make sure that if it...

matches this regex, we will filter it out. And so you can do this for logs too. And we actually did this with a, I did a SOC2 type two compliance thing and we had a bunch of collector pipelining logic just to like be extra sure that we were handling data properly. ⁓ You can also do this thing called sampling though I've not seen this as common with logs in the collectors, but I'm sure you could do it. And I think we'll get into that more with metrics and traces.

because those can get really verbose. But yeah, again, the other cool thing about open telemetry is that you don't have to have everything go through the collector to use open telemetry, right? ⁓ You just need a way of, you need a consistent way to annotate things. And I think we'll loop back to logs after we talk about traces, because there's actually an important relationship between the two.

Carter Morgan (34:45)

Right, right.

Yes.

Well, so let's talk about metrics. So metrics, think is what people kind of classically think of when they think of like open telemetry. When you think like, okay, we need observability into the application. Again, most applications, like you're going to have logs and it's not very hard once you like ship it to your cloud service. mean, if you're, a lot of places, especially if you're using kind like a more managed deployment infrastructure, like

Elastic Beanstalk is obviously like very old news at this point. But if you were to ship with Elastic Beanstalk, like they take care of logging for you, right? And then if you're trying to do something a little ⁓ more advanced, like logging is kind the very first thing that a cloud service is going to help you take care of. Metrics are more complicated. And metrics, think that's when you start to think like, okay, I'd like to have more observability into the platform. And so you might be thinking like, well, Carter, I get metrics automatically.

with like AWS and you get some metrics like AWS for example, if you're using target groups or like a load with a load balancer, right? AWS will tell you like, okay, these are how many requests were made to the target group and these are how many, and then this is like how many of those were 500 errors or 400 errors, 200 errors. Like it always gives you that right out of the box. That's great. What if you want to go lower level? What if you want to understand, okay, how many failures did I get at this particular endpoint?

Nathan Toups (36:01)

Yep.

Carter Morgan (36:23)

What's the latency on this particular endpoint? We're using graph QL, a lot of thoughts in graph QL. ⁓ and graph QL, if you want to say like, okay, well, what's my most common graph QL operation out of my resolved fields? What are my most common resolved fields? How long are those resolved fields taking? What's the P 99, the 99th percentile latency on those fields? ⁓ that you can't get out of the box with AWS. And that's when you start, have to start instrumenting.

your application on its own. Then you can get even further, which is like, what if I want very custom metrics? What if I want to know every single time a customer signs up using a particular promo link, right? Then you've got to start, instrumenting your code, ⁓ more explicitly by actually using the, the open telemetry SDK within the code itself is right instead of like a configuration layer. So that's.

Where metrics, where they come in handy, you get kind of a, ⁓ there are kind of three big metrics. have counters and counters tend to be what they call monotonically increasing, which is that the value of a counter only ever goes up. And so that can be a little funny if you're looking at it, because like, for example, HTTP requests is a counter. So you want to know like, okay, how many requests, ⁓ is this endpoint getting? But if you look at just that metric, like

on its own, it's only ever going to be going up until the system restarts and you're going to go like, what the freak is going on? Like, I want to know, I got three requests, ⁓ these past 10 seconds and, but, but my, you know, but I'm, seeing that my, counter is going, you know, from three to six to nine to 12. ⁓ and that's just how, I don't know. Do we know why, like the design decision behind that? Cause it really freaked me out at first.

Nathan Toups (38:18)

Yeah, and it's interesting. There's

always a trade-off in that it basically, if you are able to express it in a way that's like stateful and not eventually consistent, right? You know actually atomically the correct value for this. Having the one that just adds up over time makes the query difficulty, ⁓ it makes the query difficulty like very simple. If I look at a span of time, I don't have to do an aggregate lookup.

⁓ I can actually just say, what's the total storage used over the last 30 minutes? And then let's say we sample this every once a minute. So I should have 30 data points and it tells me the total storage used and that data should go up. ⁓ And ⁓ then there's also one where if I just record, you can also in metrics just record that an event took place. I can say that a new signup. So I say new signups.

Then when I do a query over the last three minutes, say, how many new signups did I have in the last 30 minutes? It's going to go back and look at the last 30 minutes and then give me the sum of that. And I'll do that post-processing. Maybe I can see that, and you'll see this in a dashboard, we'll look at the number of subscriptions that we got on YouTube. And so I can slice and dice this by the number per day, the number in the month, ⁓ are we trending up or down? ⁓

Carter Morgan (39:38)

Right, right.

Nathan Toups (39:43)

If I ask it just the total number of subscriptions, that's maybe a metric that it keeps. But because you don't want to have to go back and sample every piece of data since the beginning of time to get that, versus if I want to know the number of subscriptions I got in the last 30 days, the easiest thing to do is just to go and ask for every new subscription metric event and then just sum that up, right? Or I could ask it, give me the average sign-up

Carter Morgan (39:54)

Right, right.

Right.

Nathan Toups (40:09)

per day for the last 30 days. And so then it would just tell me, know, ⁓ the average amount, maybe that's the people who signed up versus minus the people who, you know, unsubscribed or whatever. ⁓ Yeah, metrics are important for a lot of, I mean, you know, in the traditional sense, you'd probably think of it like CPU, right? CPU is an excellent example where you're getting like a percentage, it's gonna tell you like 0.8, right? Or 0.1. And that's the actual sample in that moment.

Carter Morgan (40:29)

Right.

Nathan Toups (40:38)

that data point is meaningful, right? 80 % of my CPU is being used at this moment, 100%, 120 % or whatever, I'm actually backed up. These are the kinds of things that is meaningful in that case. But that might also give you a good example of like, if I look at a moment, I see that it's at 100 % CPU utilization. But if I actually look at the last 15 minutes and I see it was a spike versus that it was sustained,

a lot of times you're gonna end up doing, and this is where like, we start getting into the world, if you get into matrix, you start, the more you know about statistics, the better your job will be, because like, there are a lot of statistics, like P95 plus P50, or versus P50 versus these other things will tell you aspects of your data ⁓ that are important, right? Like, are a lot of people experiencing long latency, or is it really that...

Carter Morgan (41:19)

Right.

Nathan Toups (41:34)

95 % of people are not experiencing long latency, but 5 % are experiencing horrible latency, right? Like, and maybe I really just, I care about that 5 % that, you know, most people are getting sub 200 millisecond responses, but 5 % are getting six second responses, right? If I look at the average, it looks like everyone's getting second and a half responses. And you're like, well, that's not a meaningful number. Most people are actually experiencing great service.

Carter Morgan (41:40)

Right, right.

Nathan Toups (42:00)

but something's happening where 5 % of people are getting six second response times, right? Like that's bad. And so you, again, we can trick ourselves with the numbers if we're not careful. And I think ⁓ it's a whole rabbit hole. Like you could, we could probably find a book and do some deep studies on like just statistical analysis of your metrics.

Carter Morgan (42:21)

Yeah. Well, and that's what's so interesting is that, ⁓ like with counters, for example, the F like, like Prometheus, right? Prometheus expects your counter to be monotonically increasing for it to always go up. Now, granted, when you like kill a host and stand up a new host, ⁓ open telemetry isn't going to keep track of what the metric count was on the previous host. It's going to set it to zero. And so.

But Prometheus expects that. so like, if you look at like your, the graph of your metric, like you'll see it going up and up and up, and then you'll do a deployment and then it goes down. You're like, what the freak is that about? Prometheus understands that and knows how to handle it. And I don't know what magic it does in the backend, but it does that. I found out though, cloud watch metrics does not understand that. Right. And so I had to set up and we're going to talk a bit about processors. and, and when we talk more about the agent, the collector,

Nathan Toups (43:10)

interesting.

Carter Morgan (43:18)

But I had to basically say, okay, when, when exporting the cloud watch and I have this kind of dual export pipelines, say, when you're going to Prometheus, just do it like normal. say, when you go into cloud watch, I have to use this special processor called cumulative to Delta, which says don't ship the cumulative metric, the one that's increasing instead ship the Delta between each period. And then that helped cloud watch because otherwise like with cloud watch, like you want to take a rate. basically want to say like, okay, how many, ⁓ you know, what's my.

Nathan Toups (43:40)

Interesting. Yep.

Carter Morgan (43:48)

error percentage, right? Um, and you do that by say, but you know, per minute or whatever. So you do that by taking a rate and then, uh, of the, you know, of the requests and then the rate of the error, which one of those are errors and dividing by each other, right? But that rate doesn't understand when those metrics drop. And so you wind up with like this really weird thing where it thinks you had like negative requests per second, because it's just saying like, okay, well before, you know, like I was at 100 then 200 then 300. So

Nathan Toups (44:11)

Right.

Carter Morgan (44:18)

It's doing good. like, okay, I'm just gonna have a hundred requests per second, but then it dips down to zero because it reset as like, okay, the past minute we had negative 300 requests per second. You're like, what the, that doesn't make any sense. So you just had to switch. I had to switch it up. So now I just get that study 100, 100, 100.

Nathan Toups (44:31)

Have fun.

So yeah, metrics

become really important, right? So like there's systems level stuff that you might really care about, like you wanna avoid out of memory errors and you see that the average, you know, memory stuff. Then there's also a lot of like what I would call like business logic. Well, there's application runtime, which is like, you know, maybe I care about request latency or some of these other things and it's like good to keep track of this. And then there's like straight up business logic stuff. Like maybe you care about, you know, like in...

Carter Morgan (44:46)

Brime, brime.

Nathan Toups (45:00)

number of signups or something like this that's actually like a good useful business thing. maybe a signup represents the fact that they made it successfully through an entire pipeline or a funnel, right? That you actually had to get them to do five steps to get this, but a signup is an indication that they actually converted, right? Your business team would really appreciate having access to some of this data. And you might even wanna do weird correlations with like,

Carter Morgan (45:11)

Mm-hmm.

Nathan Toups (45:31)

user churn or something like, have our signups gone up, but the amount of time spent on the website goes down or, know, like these other kinds of pieces that maybe, yeah, maybe our conversion pipeline is really high, but what we're offering or how we got those people to sign up maybe aren't compelling, right? Like there's, and there's stuff that you can overlap between, you know, sort of the business needs and this other stuff. And again, it's so that you can ask questions of the data.

You could also do like, okay, they get through the signups, but then do they also get an error message? We can look up that correlating between some other pieces of the application, or if they do it in this particular order of sequence, that causes an error condition. They use the application in a surprising way. They did the signup page, but then they hit the back button, then they signed up again, and now they have an error. There might be some weird things that we just hadn't thought about before.

Carter Morgan (46:23)

Yeah.

And then, aside from counters, you've got counters are kind of the big one. You have gauges. Now gauges are a, that's reporting just one value and that value does change. so like CPU usage is a great example there, right? Where like,

Nathan Toups (46:30)

Mm. Yep.

Right, yep, yep, memory usage, CPU

usage, disk usage, things like that.

Carter Morgan (46:44)

Right.

Right. You wouldn't want that to be increasing. You just kind of want to know, okay, I want the snapshot of what this was at this moment. ⁓ and then histograms, histograms are useful for like, ⁓ like request latency. It's like, what that will actually do is it'll bucket out and it will say, ⁓ so for request latency, you've created these 10 buckets or whatever, which is like 10 seconds, five seconds, three seconds, two seconds, one second, you know, half a millisecond, half a second, whatever.

Right. And then for every request that comes in, it will update those buckets for each request. so it kind of looks funny because it's like, okay, how many requests were under 10 seconds? And let's say you 10 requests in the past minute or whatever. say, so 10 go under that bucket. And then 10 were also under three seconds and 10 were also under two seconds, but only seven were under one second. And then three were under 500 milliseconds. Right. And then.

That can be really helpful because that helps you determine your P99 latency. ⁓ And you can figure, how many, know, my 99th percentile performance, what was that? And you can either explicitly define the buckets or you can use ⁓ what are they called? Exponential bucketing, which like kind of like automatically defines the buckets for you.

Nathan Toups (48:06)

yeah, allows you to have, I think you end up having, yes, you can kind of like, again, slice and dice. And this is the weird thing too, if you know what the shape of your data should look like, and you can store it in that format, there's some nice performance efficiencies. A lot of these things can also be kind of converted and aggregated over time, right? You can, know, if you're actually just grabbing these individual events in your metrics or doing these other things, and I can turn that into,

I can turn that into a counter, right? I can turn that into only give me this growing piece or I can make a histogram out of other pieces. But I think maybe this is something we should bring up too. ⁓ Open telemetry encourages this. This is still how it is. Logs typically go to like a log optimized database, right? So yes, we're doing open telemetry. Yes, it goes through a collector, but you're going to put it into something that's made to like efficiently get data in and out of that system. Metrics as well.

Metrics typically has its own sort of like specialized database for this. And ⁓ those are typically, you know, some very like optimized sort of time series databases. And ⁓ have you heard about the looks good to me stack, the LGTM stack? Yeah, I hadn't heard this till recently. It's really funny. It's a LGTM is the Grafana cloud. like they call it logs, Grafana, traces, metrics. So LGTM.

Carter Morgan (49:21)

No.

Interesting.

Nathan Toups (49:34)

And, but it's a good way of thinking about it, regardless of whether you're using Grafana or not. ⁓ So, you know, basically you're have some sort of log store and query system. You're gonna have ⁓ Grafana would be your, the G is Grafana, which is, you know, some sort of a dashboard, dashboarding tool that is gonna, and a dashboarding tool typically will talk to those different database types. ⁓ Tracing, I think Grafana uses one called Tempo, but there's lots of other,

there's several other databases that are pretty, pretty common. then like metrics, this would be like Prometheus or Murmur or TempoDB, or there's like a bunch of places that you can like throw sort of metrics data. And then Grafana is tooled out to query all of these things. You can like grab stuff out of them. And so while we are talking about open telemetry being this one thing, if you look at most of the diagrams, it's...

it understands that it's like shipping off to various backends. Like you're not shoving metrics and logs and tracing, which we'll get into in a second, all into like one big database. It's not something like that, not typically. Yeah.

Carter Morgan (50:36)

Right.

Right, right.

Well, so let's talk about traces. ⁓ These are interesting. I feel like, so I feel like everyone has logs, right? A lot of places are going to have metrics, right? traces, traces I have seen less at all of the companies I've worked at. I've only had traces at one of them. And even then they would like kind of implemented their own in-house solution for it. was, you know, one of the cloud providers. ⁓

Nathan Toups (50:47)

Yeah.

Carter Morgan (51:11)

So traces are just, yes, distributed request tracking across services. So this shows the journey of a request through your entire system. ⁓ They're composed of spans, which, what is a span exactly, Nathan?

Nathan Toups (51:12)

This is the white whale.

Yeah, so the fundamental building block is when a request comes in, we'll start with web servers and stuff, but you can use this for other contexts, but we're gonna talk about REST API, kind of API server. ⁓ When something comes in, it's gonna say, do I have a trace ID? And if it doesn't, it'll create a trace ID. And then for the rest of the data flow, it's going to annotate this trace ID.

typically in the header. And again, we'll talk about REST API. So we'll talk about this like adding stuff into the headers that go between requests. And so a trace ID is basically saying, hey, there's some unit of work end to end, and I'm gonna attach a trace ID to this so that we can now start doing traces. But the trace ID by itself doesn't tell us anything. What we really care about are the spans. And a span is, if you've ever seen like, if you've ever done like systems stuff with like pprof or any of the things,

flame graphs, you might look at a function call and then you see all the child function calls and it grows up into this flame, it looks like little flames. ⁓ And so there's the largest call waiting for all of the little pieces here. And then once all of those return, the span of the function call ends. ⁓ The span is measuring some unit of time, it is measuring all the children that come out of it. And so...

Anytime I do something and then I maybe call another service or call another request, anyone in that thing can take that trace ID, take the parent span if there is one, and then create a new span. And so what I can do is I can reconstruct the entire path. I can take this whole annotated nested tree and I can actually say like, cool, across this distributed system, I understand.

how the data flowed, where the latency's existed, all these other pieces. ⁓ And so, yeah, like you can, and it lets you ask questions. And again, it's like, imagine having flame graphs for distributed systems. that, I think that's like the easiest way to think about it.

Carter Morgan (53:40)

But this is really valuable in distributed systems, right? Because, and I remember that when I was at the cloud provider, we had that. And so you could kind of track and see, okay, how long did this request take in each part of the system? It can be helpful, like a monolith environment, because it could still like tell you, you know, like, like you got to go to the database, right? And so it could let you know like, well, how long is it, is the request taking at the database? But.

As far as traces go, like definitely helpful in a larger, more distributed environment. And maybe that's why you don't see it as much in another T or teams.

Nathan Toups (54:17)

Yeah,

I think the pain had gotten so high. And again, if we went back and looked at like some of the like fundamentals of software architecture and some of the other things where they start advocating things like, you know, event based systems or some of these things, ⁓ you're just dead in the water. If you don't have tracing in place, a distributed tracing. will say it's still useful for a single monolith because, and I have a good example here. I was actually helping a team that was struggling. We actually,

Carter Morgan (54:35)

Right, right.

Nathan Toups (54:46)

set up OpenTelemetry and use Datadog as the backend. And there were software engineers who had never seen the visualization of some of these functions ever. And we had found some places where like, because of a misconfigured database, the reason this service was taking over two seconds was because it was actually having to retry connecting to like a queue, or no, retry to connect into a database three times, always.

It would never be successful until the third retry. And it was because there was like a latency race condition problem, but no one had ever seen it. Some people kind of suspected maybe something was going on, but there was no investigation. And we had a picture. We had a picture that we could show people and be like, every time you make a request, it fails twice. It gets it on the third time. Then it goes off and does this thing. And that was just on a single service. Like it wasn't going off and doing other things. They've fixed the bug within like a day. We were able to ship it. It was...

Carter Morgan (55:15)

interesting.

Nathan Toups (55:44)

really kind of crazy that they've been struggling with this for, I think, a couple of years, which is insane. But it was one of those things where the system was so complex, two-second latency in this one very complicated pipeline wasn't really noticed, but it actually had a huge impact on available resources and these other things. ⁓ Where it was really useful though, and this is where open telemetry becomes magic, if you put the trace ID in your metrics, if you put the trace ID in your logs,

Carter Morgan (55:48)

wow.

Nathan Toups (56:14)

you now can find associated metrics and logging for a set of traces. And so I actually can then correlate, I can say, okay, something funky is going on in this call. And I look at all the tracing stuff, and then I go, give me the logs that were associated with this. I can do that now, right? Because the trace ID was like associated with this. ⁓ It's really powerful when it comes to that. ⁓ you just can like dig right in for troubleshooting.

trying to figure out production issues, it will absolutely change the entire way that you think about stuff.

Carter Morgan (56:49)

Yeah, I mean, this is huge because when I was at a cloud provider, the cloud provider, we, ⁓ we couldn't run our application locally. You could do some amount of local, testing, but if you were working on a complicated flow and needed to test it kind of end to end, like you, we just, you just had to have the staging environment and, ⁓ and so they had set up.

tracing across all of the different services within our kind of division at the company. And it was so critical for debugging because you'd say, okay, send out the request. have my, my request ID, right? And then I put it into the system and I didn't recognize that's what I was working with. But when I was reading through this, I'm like, ⁓ that's a trace. I was using tracing back then. And yeah, absolutely critical to understand. Okay. Where is my, ⁓ where's my code?

Nathan Toups (57:34)

Yeah.

Carter Morgan (57:43)

Where's my request going? How is it interacting with my code?

Nathan Toups (57:46)

forgot the name of the, so AWS has this thing, their distributed state machine. I forgot what it's called. It used to be called, no, so it uses X-Ray though. And so, but there's this distributed state machine where you basically draw out this big dag and it lets you any step in it. Like you can use lambdas, you can use fargate, you can use, but anything it's in this like big, ⁓ you know, kind of state machine workflow thing. And.

Carter Morgan (57:55)

X-ray?

Okay. Okay. Okay.

Nathan Toups (58:13)

if you didn't have X-ray and you just literally turn on a switch and it auto instruments your, your tracing and you can put more spans inside of it, right? If you actually use the SDK, they give you context and you can put more spans inside of it. But even if you don't touch it, it'll at least give you spans at like when the Lambda was called and when it puts itself away, when the next thing comes in and you could, again, the visualization, if you've never seen what the visualization for distributed

stack, mean, distributed tracing looks like it will blow your mind. Like the first time I saw it, was like, how can I just look at this? Like it's, really cool to think about what the instrumentation is to, to, pull off. And then once you see it and you get to certain part in any application and you don't have it there, you will get a pull and you'll be like, I want to be a part of making this happen. Because, ⁓ because.

Carter Morgan (59:06)

Yeah. ⁓

Nathan Toups (59:10)

Again, when you need to troubleshoot something or reason about a part of the system, you're like, I know something's wrong, but I don't know where it's wrong. And then you go, why has this thing taken six seconds? And you go, it's mostly in this area. Why is it taking that long there? Oh, it's because, you and you're like, and then all of a sudden you can just like pinpoint it.

Carter Morgan (59:29)

Yeah, it's really, really cool. So maybe let's, let's take a step now. We've talked about kind of, okay, we've got our metrics, our logs, our tracing. We understand what open telemetry is rolling all that. They've created a standard that they've created a standard. ⁓ and so that helps you avoid a vendor lock-in. So what does the flow kind of look like at this point? So you have your code, you're going to outfit your code with your instrumentation. So you're going to use the open telemetry SDK, right? And

⁓ open telemetry, SDK often works in a way where you can kind of just like. Injected almost at the base layer and it can do like this automatic, ⁓ metric generation for you. And then if you need to more detailed, like we talked, for example, you want to know if someone signed up with a specific promo code, if someone made it through a user flow when I, at my last company, we would do kind of like processing for these, ⁓ we called like audience segments. like, basically it was this workflow process that would determine.

which customers had entered a particular audience and which had left it. And so one of those segments failed, we would admit a metric. So we would know, okay, segments are failing. ⁓ So you have that. So the OpenTelemetry SDK is gonna kind of gather all those things from your application and emit them. ⁓ And it's gonna send them out using what's called the OTLP. That's the OpenTelemetry protocol. ⁓ And that is then going to send it. And here's where you have options because

Nathan Toups (1:00:34)

Yep. Yep.

Carter Morgan (1:00:54)

That's going to either send it either directly to your, ⁓ your kind of backend, your, your data sinks, right? To your, your Prometheus or your, your X-ray or, know, your, your low key for logging. or it can send it to a collector and the collector then sends it to those backends. But let's talk a bit about the collector. What is the collector doing? Why is it value?

Nathan Toups (1:01:22)

Right.

Yeah. Collectors are neat. I think anybody who's been in the... There's a couple of pieces to this. It does this receive process export pattern. So again, can take some of the... Let's say you consistently need to transform... Let's say you're collecting logs in a certain structure, but you know you need to transfer them or add some extra annotation, but you don't want maybe the business...

the business logic of the team to actually have to worry about that. I can have a process step inside of my collector that kind of like just injects some extra annotation and then ships it off and exports it for me, right?

Carter Morgan (1:02:04)

Or for me,

remember, I had to switch from the cumulative metric to the delta metric. Right. The collector is handling.

Nathan Toups (1:02:08)

Exactly! ⁓ perfect example, perfect example.

Exactly. And so what's nice is that you end up getting this ability to make changes to the way that you ship stuff without having to make runtime changes to your application. I can even evolve that over time. I could have it so that I split that and maybe for legacy reasons, I need the old version of the logs, but I also want to have the new version and go to a new system. The new system expects the logs to have some extra annotations. I can do all that in the collector. I don't have to go out and do a pull request.

across 15 services, I can have the collectors, maybe it is the central point across 15 services and it just magically does it, right? Very, very, very cool. The other piece is ⁓ I think that ⁓ if you're in an environment like ⁓ a Kubernetes, these will very naturally become like what they call like the sidecar pattern where collectors can sit. Sometimes you can have a collector that sits right there in the pod. That's super, super tight.

You can have a collector that sits in what they call a demon set, is like, ⁓ sits right there in the each host so that you know that there's a predictable spot on the host that the collector lives. But then there's other patterns and models. they call it the gateway pattern or the agent pattern. And these have to do with like, maybe you need the collector to absolutely never be down. never can have loss of records. And so I have a load balancer with like high availability cluster of a collector and I'm shipping all that stuff off because let's say I have.

10,000 lambdas that are going to ⁓ spin up and shut down. And I want all those 10,000 lambdas to talk to my collector gateway. And this is the patterns that they kind of give you. You have the flexibility. In my application runtime, a lot of times it's literally an environment variable that just says, where's the collector? What's the IP address of the URL in the port? And then that's the only environment change that you have to do to your application.

Carter Morgan (1:04:02)

Right. Right.

Yeah, I really liked the collector pattern. liked the separation of responsibilities. Your application is responsible for generating the metrics and sending them off, but that's it. It just says, Hey, give me the end point. Tell me where to put these metrics. And then the collector, like you said, is responsible for, for what actually happens with, those. so whether that's like, I like you've got your processors, which are modifying the metrics or my collector, right. Is responsible. send our metrics into two places. We spend all of our metrics to Prometheus.

Nathan Toups (1:04:19)

Yep. Yep.

and

Carter Morgan (1:04:38)

But we send a subset of metrics also to CloudWatch so that we can alarm on those metrics. But my application knows nothing about that. It's the collector that handles all.

Nathan Toups (1:04:42)

Cool. Nice.

Right. The

other thing that's really nice, and they talk about this, I don't have as much experience of doing head-based versus tail-based sampling. I've always done head-based sampling because it's just easier. Tail-based sampling sounds awesome, but I don't know as much about it. can kind of talk about this. Sampling becomes a big deal at scale. if you're an early-stage startup or a mid-stage startup, you can probably just eat the cost of...

Carter Morgan (1:04:57)

Yeah.

Nathan Toups (1:05:14)

accepting all the metrics, accepting all of the traces, but traces can get pretty, pretty big cost over time. And I think Datadog, like that was a big thing that we had to deal with. We started doing sampling. One of the reasons we got open telemetry is that we could pre-process which traces did we actually care about, right? Like we don't nest, we don't need most 200 okay, happy path traces, right? We really would care about error, anything

Carter Morgan (1:05:18)

Mm-hmm.

Right, right.

Nathan Toups (1:05:43)

anything that has an error in the, in the, in the trace, ⁓ in the spans, we probably would care about. And so maybe we'd take all of those, but if it was happy path, maybe we'd take 5 % of those. Like we want to still know kind of like the behavior of this thing, but maybe we would disregard most of them because it's just too cost prohibitive. ⁓ data dog traditionally it's things have changed, but data traditionally was just like, no, no, no, no, no, no. Like just we'll take it all in, you know, we're very cost effective, but they're not at scale.

But if I sample, then I get to make that decision myself and then ship off what I want that's meaningful to Datadog. And again, this is the beauty of this pattern, is doing this.

Carter Morgan (1:06:19)

Okay.

Well, that was a lot on open telemetry.

Nathan Toups (1:06:30)

Yeah,

maybe you feel like you're mastering open telemetry right now.

Carter Morgan (1:06:34)

I feel, I think all the time about how when I graduated college and like you do that dumb thing in your resume where you you list all the skills you have and you list like, are you an intermediate? Are you a beginner? Are you an expert or whatever? And I put something dumb like, I'm, I'm advanced in Java. And now I have like 10 more years of experience with Java than I did when I first wrote that resume. And I would say I'm a beginner at Java. Right. And

Nathan Toups (1:07:00)

That's funny.

Carter Morgan (1:07:01)

I feel that way with like open telemetry. It's like, like, I think someone listening to this podcast might be like, Oh, wow, Carter, Nathan know a lot about open telemetry. I I personally, I'd be like, I am a baby when it comes to open telemetry. Like I've successfully instrumented it in an application. I'm proud of that, but like, I only just barely understand what's going on here. Um, but I guess it's good. It you, it keeps you humble, keeps you hungry. Well, why don't we talk about some of our hot takes? We read this book. I mean, give me, give me your hot takes, Nathan. What do you think after reading it?

Nathan Toups (1:07:12)

Great.

Mm-hmm.

Carter Morgan (1:07:31)

having read the first half of Master Opetalem.

Nathan Toups (1:07:33)

Yeah.

So, and this is still true. I think it's gotten a lot easier to do this, but like when I first started making the transition to open telemetry, you have to really commit to being vendor neutral. It's so easy to just have like your data dog rep being like, man, just put the data dog SDK, it just works. It's great. And you, when you're first making this transition to be vendor neutral, it's a little scary. And so like, I will say,

I don't know, this book, I guess my hot take here is that like, you still have to commit to this and you might have some self doubt on like, am I making the right decision? Am I drinking the Kool-Aid? Cause I still feel like, I don't know, anybody who gets excited about open telemetry, you're like, okay, here's an open telemetry guy over here. You know, just like, won't shut up about it. Yeah, Hotel, do you know about hotel? Yeah.

Carter Morgan (1:08:20)

That's all my coworkers right now.

I'm

like turning my monitor towards them like, guys, look at this. And it's just like the Grafana Dash where they get Carter we know.

Nathan Toups (1:08:35)

Also, kind of, the book kind of breezes over at this point, maybe in the second half we'll get to it more, but like statistics and sampling are hard. And I think that that's probably the biggest disservice that we see is that they're like, yeah, you can sample or yeah, you should, you know, like the way that you do visualizations and you're like, you can do it wrong very easily. So a lot of people just skip sampling ⁓ to your own, you know, to your own detriment, but like it's the same thing. It's like, as soon as you start getting into what the

cooler features that open telemetry are, it's non-trivial. It's a very complex beast, as amazing as it is. There's a lot to think about, to take it on your shoulders. So this book made that more apparent and not less, just like your expert level Java. I read this book and I'm like, I don't know. I thought I knew a lot about open telemetry and I read this book and I'm like, I've got so much stuff to learn. So yeah.

Carter Morgan (1:09:28)

Yeah, yeah.

I think with this book, my hot take, this book would be better if it just knew who it was writing to. I'll give a kind of a, I have a lot of respect for when someone prioritizes it and says, we are going to purposefully ignore this segment of the audience because we are going to write something that's very well crafted for this segment of the audience. I kind of feel that way about Taylor Swift in that when she was doing her big eras tour, obviously Taylor Swift, super, super hot. And so I was like, you know what?

Nathan Toups (1:09:38)

Great.

Carter Morgan (1:10:01)

Like I listened to Taylor Swift like a while back, know, like, like love story and shake it off and like, you know, her old like 1989 album. I like that. But her newer stuff I was not familiar with, but she's super, super popular. And so I was like, I'm going to listen to Taylor Swift. I'm going to, you know, this is a cultural phenomenon. I should, you know, be in on it. And I listened to some of her newer stuff and I was just like, this is not for me. Like I am not a 20 something woman. Right. And, but I was like, you know what? I respect it.

Like I think it's great that Taylor Swift has made something that resonates really, really strongly with a certain segment of the population. Even if the cost is that it doesn't resonate strongly with me. And I feel like this book could have learned that lesson from Taylor Swift where they should have said, you know what? This book is only for software engineers who want to implement open telemetry in their application. Right. Or it's only for site reliability engineers who want to know the nittiest

Nathan Toups (1:10:55)

Great.

Carter Morgan (1:11:00)

riddiest ⁓ details of open telemetry. But instead you wind up with this weird thing where like, there's a whole chapter devoted to explaining kind of like what APIs and SDKs are, but then it's getting into like the very lowest levels of like the open telemetry configuration file. And it's weird. like, I don't know. I think there's like three people in the world who this book is perfectly suited for. And I wish that Steve Flanders had just said,

Nathan Toups (1:11:12)

Right.

Yes.

Carter Morgan (1:11:29)

I know exactly who I'm writing towards. I'm going to write towards that person, even if it means that other people might feel a little lost or the opposite, or they might feel like I'm covering too much of the basics. ⁓ a little more focus here would have been good. ⁓ but yeah, it's a, I'm excited to finish it. ⁓ I'm excited personally to be reading it because I want to know a lot of this stuff as of yet.

As far as who I would recommend it to, maybe let's flip it this week. Who got a good transition here? As far as who I recommend this book to it's tough, but I'd say open telemetry in general. I'd recommend to any senior software engineer who's looking to make a big impact to their project. I, if you haven't instrumented this stuff already, this is some really, really fertile ground. If you have instrumented this stuff already, find out how it all works. Because like, I remember that, like when I was working with a

systems that had this in the past. And like, okay, make a new metric. And I was kind of like, my gosh, how do I make a new metric? And that's probably going to be a nightmare. kind of register the metric somewhere. It's like, no, if you're doing open telemetry, you just register the metric in your code. And then it takes care of it kind of on the backend. If I had understood open telemetry better, I would have understood that process better. So I would recommend kind of like understanding open telemetry to any senior software engineer. This book in particular, I don't know if it's the best vehicle for understanding.

Nathan Toups (1:12:53)

Interesting.

Yeah, that's interesting. I do, there's the cool thing about open telemetry is this is something that you can very much consume in some YouTube videos from tech conferences, which I will tell you is excellent resources online. Excellent resources of like, because I would much rather see, you know, an SRE from Uber.

Carter Morgan (1:12:55)

Open telemetry.

Right.

Nathan Toups (1:13:18)

talk about how they transformed stack tracing by switching to open telemetry. That's like a useful talk, right? Somebody who's a domain expert talks about their struggles, kind of went through false starts, all those kinds of things. Blog posts are excellent. There's also, you kind of have to get your hands dirty. And I will tell you, I didn't talk about this, but like in the book I had it, I mean, I forgot and left it on and it was like draining my battery, but they have this like, they actually have ⁓ this like toy, ⁓

shop, shopping website, fully instrumented with open telemetry and collectors. And you, it's literally just like a single, you know, you know, run kind of command. You download their repo and make run, and you have Docker set up. You can run a local environment and kind of see how all the metrics and stuff work. And it's, you got to spend time with it, right? Like these books are great. You're not going to be able to read this book and just like, go do it. You need to go like, get a collector set up.

Carter Morgan (1:13:51)

Right, right.

Nathan Toups (1:14:17)

and play with some processes and make some custom metrics and go visualize some traces. Like you need to wrap your head around it to really kind of be like, have that aha moment, right? Now, if you're like, why does it work like this? Or what's the vision or what other cool things could I do? This book really is, I think the audience really is SRE and platform engineers. I think that if there was a focus.

I think that yes, senior staff kind of software engineer would get a lot out of this book if they really want to deep dive, or maybe you're in a company that doesn't really have SREs. It's like a real discipline and you kind of take on some of that stuff on your shoulders and you want to be a domain expert. Like there's also a ton of resources. Like even if you just use this as a reference book, it has like a ton of links to additional reading, additional websites. I didn't click all of them, but I clicked a decent amount.

And I was familiar with a decent amount. so like, if you just need like a well organized thing and your company is going to pay for the book, ⁓ go for it. You know, like that's, think that would be like the sweet spot is to be like, I have the authoritative texts by one of the people who wrote the spec. You know, it's not a bad thing to have.

Carter Morgan (1:15:29)

Right, right.

Well, what are you gonna do differently in your career, Nathan, having read the first half of this book?

Nathan Toups (1:15:36)

Yeah, know, reading the book, I realized that like I only had a sort of like tip of the iceberg understanding of the collector processing stuff. There's like a ton of things you can do inside of the collector that I haven't spent the time to like poke around with. And so I wrote, I want to master collector pipelining. So I think I'm going to use that toy app and start instrumenting. mean, start, ⁓ start playing around with like weird things I could do with the collector.

just so I can like kick the tires and see like, can I do something funky? Can I do, you know, really weird mad science experiments? What about you?

Carter Morgan (1:16:15)

Yeah, I want to look into traces more. I, I don't know if it makes sense for us at this scale, but it's funny. I, I've been so proud to get open telemetry set up for our application and we have a lot of automatic metrics flowing in, but like in some ways it's just like, given us more insight to our pain without knowing how to necessarily resolve the pain. Like someone would be like, the website's slow and I can look open the dashboard and be like, I can confirm the website is slow.

Right? Like, you know, our latency has gone up. Why is it slow? That's anyone's guess. Right? And it's like traces might be really helpful there. Right.

Nathan Toups (1:16:44)

Right.

Well, yeah, because you

could absolutely hit that, you know, that get request on the root URL and then feel like, what, where's, what's the, what's the graph telling us? Yeah.

Carter Morgan (1:16:58)

Exactly.

So I need to figure out now that we've kind of got it in open telemetry and if we were to send it like AWS X-ray, like how much of a lift is this? ⁓ I don't know. I need to play around with it.

Nathan Toups (1:17:13)

I bet it's both more and less than you'll think it is. yeah, X-ray is kind of magical and it is cool that it talks to hotel. So ⁓ if you don't want to get a, know, Jaeger or some, know, or some of the other visualization stuff, ⁓ it's really, it's really nice. I think you could probably get a lot of mileage with X-ray now pricing wise and all the other stuff that you that's, that's the other part of this equation. You're like,

Carter Morgan (1:17:15)

Right, right. That was my experience with metrics, so we'll see. ⁓

Right.

Yes.

Nathan Toups (1:17:42)

Is our interviewist bill going to explode because of what we're doing?

Carter Morgan (1:17:42)

Yes. I know. Right. Well, we'll find out. well, thanks for tuning in everyone. Uh, we'll be back next week. I will finish up the book next week. Um, and yeah, thanks for sticking around. You can always, uh, email us at contact at book overflow.io. You can find the podcast on Twitter at book overflow pod. You find me on Twitter at Carter Morgan and you find Nathan and his newsletter, functionally imperative at functionallyimperative.com.

Nathan Toups (1:17:49)

Nice.

Carter Morgan (1:18:10)

Thanks for sticking around and we'll see you next week, folks.

Episodes in This Series

Ep. 83OTel Fundamentals - Mastering OpenTelemetry and Observatibilty by Steve Flanders(This episode)

Sep 29, 2025

Ep. 84OTel at Scale - Mastering OpenTelemetry and Observatibilty by Steve Flanders

Oct 6, 2025

Ep. 92Steve Flanders Reflects on Mastering OpenTelemetry

Nov 24, 2025