How Much Should I Be Spending On Observability

How Much Should I Be Spending On Observability?

Part 1: Is your observability a cost or an investment?

In 2018, I dashed off a punchy little blog post in which I observed that teams with good observability seemed to spend around ~20-30% of their infra bill to get it. I also noted this was based on absolutely no data, only my own experiences and a bunch of anecdotes, heavily weighted towards startups and the mid-market tech sector.

This post should have ridden off into the sunset years ago. To my horror, I have seen it referenced more in the past year than in all preceding years combined. I’ve stumbled over it in funding announcements, analyst briefings, startup pitches, podcasts, editorials… and even serious news articles written by actual journalists.

I’ll be reading some piece of tech news or whatnot, and it’s talking about “the soaring cost of observability, which experts say can be as high as 20-30% of your infra bill” and oh shit, that’s a link to my post?

But I pulled that number out of my ass! And I said I was pulling it out of my ass!

Also: that was almost eight years ago. The world was very different then! ZIRP was in full bloom, infrastructures were comparatively simpler (and thus cheaper), and a lot of people were pursuing a “best of breed” tooling strategy where they tried to pick the best tracing tool, best metrics tool, best APM, best RUM, etc., even if they were all from different vendors. All of which drove up costs.

So how would I update my rule of thumb today, in 2025? I would lower my estimate by a little, but complicate my answer by a lot.

New to Honeycomb? Get your free account today.

In 2025, how much are people paying in observability costs?

After seeing what lots and lots of people pay for observability, an estimate of around 15-25% of your infra bill is straight down the middle. This will buy you quality observability, but you typically have to make some tough choices and sacrifices. People won’t get everything they want.

My rule of thumb does not scale up linearly to higher cost ranges. If your infra bill is $100m/year, you shouldn’t be paying $20m/year in observability costs—but if your infra bill is $100k/year (and your product is live and you have real, paying customers) you’re probably paying at least $20k/year. Sounds about right.

People who claim that they pay less than 15%, in my experience, usually fail to include the cost of engineers and engineering time in their estimates. Or they have a very simple architecture, a system that does not change very often, and/or customer experience is not a priority or a differentiator for them. (If you claim to pay <10% of your infra bill on observability and think you have high fidelity observability, I have many questions and would love to speak with you. Seriously, would you drop me an email?)

Worth noting: I am still pegged unduly to midmarket companies, because enterprises are very tight-lipped about how much they spend on infra (and/or they genuinely don’t know). It’s actually really hard to find someone at e.g. an international bank who can tell you these things, and even harder to find someone both knowledgeable and willing.

Analysts tell me that a 10% number should be achievable for a large enterprise with discipline, agency, and executive buy-in. But I’m still searching for the evidence. (As Corey Quinn points out, you typically spend at least 7% of your infra bill on internal network transfers. It beggars belief to me that good, high quality observability could cost anything in the neighborhood of network transfer costs.)

Thank you, Gartner, for publishing some real numbers

Gartner put out a webinar on observability costs earlier this year that allows us to attach some numbers to what has been a fairly vibes-based conversation. They don’t tell us how much people are paying as a percentage of their infra bill, but they do give us some high-level data points:

36% of Gartner clients spend over $1M per year on observability, with 4% spending over $10M
Over 50% of observability spend goes to logs alone
Many enterprises are using 10-20+ observability tools simultaneously

They also give the example of one particular Gartner customer, who spent $50k/year on observability in 2009 and is now spending over $14m/year (as of 2024). If you’re wondering whether those exponential cost increases have leveled off over the past few years, the answer is no; this includes 40% year-over-year growth for the past five years.

No wonder VPs and CTOs everywhere are pushing to contain or cap spend at all costs. But before we start blindly slashing costs, let’s take a look at what it is we’re trying to buy.

Investments pay off, costs do not

The #1 pro tip I can give anyone who wants to dramatically lower their observability spend is this: care less about the customer experience. Precision tooling for complex systems is not cheap. Is it worth paying for? I don’t know. Is observability a cost for your business, or is it an investment?

The difference is that investments pay off; costs do not. A good investment will not only pay for itself, it will deliver compounding returns over years to come. It doesn’t make sense to penny-pinch an investment that you expect to return 3x or 5x over the next two years. Instead, you should invest up until you near the point of diminishing returns. Costs that you don’t expect a return on, however, should be strictly controlled and minimized.

How can software instrumentation turn into an investment capable of generating returns? Tons of ways, which we can bucket in two groups: the external, customer-facing set of use cases, and the internal ones, linked to developer experience and productivity.

The external component is typically the one that is easier to calculate, estimate, and quantify, so let’s start there.

From a product perspective, when should you invest in observability?

It always starts with your business model. How do you make money? What are your core differentiators? What are customers sensitive to?

Does your business rely on ensuring every package gets delivered, every payment transaction succeeds, and that you can swiftly and accurately react and respond to customer complaints? Does high latency and laggy UX translate directly into lost revenue? We’ve all seen things like Amazon’s findings that every 100ms in latency cost them 1% in sales, or Google saying that a half-second delay costs them 20% of results, or that one out of five abandoned shopping carts were abandoned due to slow load times.

If you can draw a straight line from performance improvements to revenue, then good observability should be an investment for you; every penny you invest should pay for itself, many times over.

If you’re approaching this from the perspective of a team or group of teams, start by examining the portfolio of services that you own. Are they forward-facing, under active development, subject to frequent change, latency-sensitive, or directly tied to revenue-generating activities?

If you’re working at a startup or midmarket company, you should make this decision based on your business model. If you’re working at a large, profitable enterprise, you’re going to want to define tiers of service with different levels of observability.

The one exception here is when it comes to internal tooling—CI/CD pipelines, deploy tooling, that kind of thing. This is so deeply, inextricably linked to developer productivity that lack of visibility will cause you to slow down and suffer in compounding ways. If you care about speed and agility, you need precise, granular observability around these core feedback loops. Which brings us to the next point.

From an engineering perspective, when should you invest in observability?

The internal component is related to how swiftly engineering teams can ship, iterate, and understand the changes that they’ve made in production. This one is harder to quantify, identify, or even see.

I sometimes think of this as the dark matter of software engineering, the time we can’t see ourselves wasting because we can’t see it, all we can see is the endless toil and obstacles in our way.

Observability isn’t just about tracking errors and latency. It’s the grease, the lubrication, the sense-making of software delivery. When you frontload good observability, it makes everything else faster and easier. When sense-making itself is difficult, fragile, unintuitive, or requires a high bar of expertise or additional skill sets, it becomes a barrier to entry and a drag on development.

When should you invest in observability from an engineering perspective? I might be biased here, but I think the answer is “always, in theory.” What engineering org doesn’t want to ship faster and understand their software? In reality, it may not make sense to invest in better tooling if there’s no organizational alignment or commitment from management to follow through.

Why exactly are costs soaring? Why now?

The three big reasons Gartner cites under “Reasons for Increasing Costs” are:

Organic growth
1. Reality: costs increase linearly with business growth
2. Desire: disassociate rising application/infra costs from observability costs
Telemetry complexity
1. Orders of magnitude increase in quantity of data
2. New telemetry types—logs, metrics, trace, frontend, RUM, etc.
3. Overwhelming noise, which requires an analytics platform to make sense
Increased expectations
1. Growing dependency
2. Rapid adoption across organizations

I have a bone to pick with the first one. They claim that costs increase linearly with business growth, but the problem is that observability costs are increasing at an exponential rate, detached from business growth. I think cost increases pegged to the rate of business growth would make everyone very happy.

The second one seems like a bit of a truism. Costs are rising because we’re collecting more data and more types of data? Ok, but why?

The meta reason that all of this is happening is that our systems are soaring in complexity and dynamism. I gave a keynote way back in 2017 where I talked about the Cambrian explosion in complexity that our systems were undergoing, and things have only accelerated since then.

Back in 2009, when our friend from the Gartner example was spending $50k/year on observability, they probably had a monolith application, a web tier, and a primary database, and the tools they paid for told them if it was up or down, what the latency was, and a bunch of low level systems metrics like CPU, memory, and disk. Which brings us to reason number three.

The not-so-benevolent reasons why costs may be exploding

Those are my best-faith arguments for why observability bills have reached astronomical heights in the past few years, but it’s easy to come up with some not-so-benevolent reasons why vendors may be ratcheting up the costs. Here are a few of them:

They may be passing on the high costs of their own technical cost drivers. We haven’t talked about technical cost drivers at all in this article (that will be coming next week, in part two!), but any vendor that was built using the multiple pillars (“observability 1.0”) model is paying to store your data many times, in many different formats.

If you’re using 10 different tools from an observability platform, this means they’re storing your data (at least) 10x for every request that enters your service. They pay for that 10x, and so do you. They also have to staff up 10 different development, product, design teams, market 10 different products, etc., and then pass all those cost multipliers along to you.
Because they can. If companies will pay it, vendors will charge it. If a company was built on the assumption that it can charge particular margins, its entire business model comes to depend on it. Your investors expect it, your operating costs expect it. I am not a business cat, but my understanding is that this can be an extremely tough thing to adapt (I think this is what they call the “Innovator’s Dilemma”).
Data has gravity, even in an OTel-ified world. Changing vendors is a pain in the ass. It’s a lot of labor that could otherwise go to value-generating activities, and you have to update your links, dashboards and workflows, train everyone in the new system, field complaints… It’s just no one’s idea of a good time. Vendors know this, alas.

I also recently heard an intriguing hypothesis from Sam Dwyer, of Nine.com.au. He said,

“I think it’s impossible for vendors to provide the meaningful value people are looking for in their observability tools, because you can’t get that depth of introspection into your systems without manual instrumentation—all you can get is breadth.

But people don’t want to hear that, so from the very first sales conversation, vendors promise their customers that they can just “drop in our magic library/agent/whatever, and everything magically works!” Then they get stuck in a product development cycle that is entirely dependent on auto-instrumentation output, because trying to build any other features breaks the initial contract. Even if it would result in a better outcome for the customer and a better product in the long term!

So customers keep saying ‘we need more value’ and vendors keep frantically trying to provide more value without having to go back to the customer and saying, “To do that, you need to add some manual instrumentation” so they keep building more and more features on top of their platform which provide more and more ‘surface’ observability without the depth that customers actually need and want.”

I don’t know exactly how prevalent this situation is, or how to quantify it, but it certainly plays right into my existing beliefs and biases, so I had to include it. 😉

It’s good that people are learning to rely on observability

Let’s talk about those “increased expectations” for a moment. Gartner reports that companies are seeing rapid adoption of observability tooling across the organization and are growing increasingly dependent on their observability tooling. Two client quotes:

“Without our observability tool, we realized that we were blind.”

“Once implemented, it spread like wildfire.”

This is a good thing. It’s an overdue reckoning with what has long been the case: that our systems are far too complex and dynamic to understand by reasoning about the system model we hold in our heads.

Without realtime, interactive, exploratory observability at every level of the system, we are flying blind. Observability is not just an operational concern, it’s a fundamental part of the way we build software, serve users, and make sense of the world around us.

To some extent, the rapid spread of tooling and instrumentation is us playing catchup with reality. It reminds me of the early days of moving to the cloud, when it was the Wild West and we didn’t yet have best practices and rules of thumb and accounting tools. We’ll get there.

Who owns your observability bill?

I mentioned earlier the difference between costs and investments. If your observability tools budget is owned by IT or rolls up to the CIO, it’s going to get managed like a cost center. Not for any nefarious reasons, just because that is their skill set. This is what they do.

If you want your observability to be a differentiator for customer experience or engineering excellence, it needs to be managed like an investment. This means bringing it in under the development umbrella (or, interestingly, the product or digital umbrellas).

There are very few blanket recommendations I will make, due to the sprawling complexity of the topic, but this is one of them. Move your observability budget under engineering, or some other org that knows how to manage tools as investments, not cost centers.

I’ve seen so many engineering orgs kick off well-intentioned, seemingly well-resourced transformation agendas, only to see them founder due to poor sense-making abilities and lossy feedback loops.

It really is that important.

The observability cost crisis is a rare window of opportunity

I believe that the work of controlling and managing costs can go hand in hand with making your observability better.

This doesn’t have to be a choice between spending more money and getting better outcomes vs spending less money and getting worse outcomes. In fact, it shouldn’t be.

The more useless data you collect, the harder and slower it gets to analyze that data. The more widely you scatter your critical data across multiple disconnected tools, pillars, and data formats, the worse your developer experience gets. The more fragmented your developer experience, the more work it takes to reconcile a unified view of the world and make good decisions.

These top-down mandates for cost control can actually be an enormous opportunity in disguise. Under normal operating conditions, it can be hard to summon enough organizational will to support the amount of work it takes to transform internal tools. Every team is already so busy, with features to ship and deadlines to hit, that allocating developer cycles to internal tooling takes last priority, and is the first to get bumped.

The cost crisis changes this. This is a rare opportunity to rethink the way we instrument and the tools we use, and make decisions that lay the groundwork for the next generation of software delivery. A window like this comes along only once in a while and does not stay open for long. We should not waste it.

Easy to say, harder to do. ☺️

Next week, we will publish the second half of this piece, which will be a practical guide to the cost drivers for both observability data models—the multiple pillars model and consolidated storage model (also called “observability 2.0”)—and levers for controlling those costs.

Part 2: Observability cost drivers and levers of control

I recently wrote an update to my old piece on the cost of observability, on how much you should spend on observability tooling. The answer, of course, is “it’s complicated.” Really, really complicated. 🥴 Some observability platforms are approaching AWS levels of pricing complexity these days.

In last week’s piece, we talked about some of the factors that are driving costs up, both good and bad, and about whether your observability bill is (or should be) more of a cost center or an investment. In this piece, I’m going to talk more in depth about cost drivers and levers of control.

Business cost drivers vs technical cost drivers

The cost drivers we talked about last week, and the cost drivers as Gartner frames them, are very much oriented around the business case. (All Gartner data in this piece was pulled from this webinar on cost control; slides here.)

In short, observability costs are spiking because we’re gathering more signals and more data to describe our increasingly complex systems, and the telemetry data itself has gone from being an operational concern that only a few people care about to being an integral part of the development process—something everyone has to care about. So far, so good. I think we’re all pretty aligned on this. It’s a bit tautological (companies are spending more because there is more data and people are using it more), but it’s a good place to start.

But when we descend into the weeds of implementation, I think that alignment starts to fray. There are technical cost drivers that differ massively from implementation to implementation.

Executives may not need to understand the technical details of the implementation decisions that roll up to them, but observability engineering teams sure as hell do. If there’s one thing we know about data problems, it’s that cost is always a first class citizen. There is real money at stake here, and the decisions you make today may reverberate far into the future.

Get your free copy of Charity’s Cost Crisis in Metrics Tooling whitepaper.

Model-specific cost drivers

The pillars model vs consolidated storage model (“observability 2.0”)

Most of the levers that I am going to talk about are vendor-neutral and apply across the tooling landscape. But before we get to those, let’s briefly talk about the ones that aren’t.

The past few years have seen a generational change in the way instrumentation is collected and stored. All of the observability companies founded pre-2020 (except for Honeycomb and, it seems, New Relic?) were built using the multiple pillars model, where each signal type gets collected and stored separately. All of the observability companies founded post-2020 have been built using a very different approach: a single consolidated storage engine, backed by a columnar store.

In the past, I have referred to these models as observability 1.0 and observability 2.0. But companies built using the multiple pillars model have bristled at being referred to as 1.0 (understandably). Therefore, as I recently wrote elsewhere, I will refer to them as the “multiple pillars model” and the “unified or consolidated storage model, also called observability 2.0” moving forward.

Why bother differentiating? Because the cost drivers of the multiple pillars model and unified storage model are very different. It’s hard to compare their pricing models side by side.

Controlling costs under the multiple pillars model

When you are using a traditional platform that collects and stores every signal type in a separate location, your technical cost drivers are:

How many different tools you use (this is your cost multiplier)
Cardinality (how detailed your data is)
Dimensionality (how rich the context is)

Your number one cost driver is the number of times you store data about every incoming request. Gartner tells us that their customers use on average 10-20 tools apiece, which means they have a cost multiplier of 10-20x. Their observability bill is going up 10-20x as fast as their business is growing! This alone explains so much of the exponential cost growth so many businesses have experienced over the past several years.

Metrics-heavy shops are used to blaming custom metrics for their cost spikes, and for good reason. The “solution” to these billing spikes is to delete or stop capturing any high-cardinality data.

Unfortunately, high-quality observability is a function of detail (cardinality) and context (dimensionality). This is a zero-sum game. More detail and context == better observability; less detail and context == worse observability.

The original sin of the multiple pillars model is that two of the primary drivers of cost are the very things that make observability valuable. Leaning on these levers to rein in costs cannot help but result in materially worse observability.

Controlling costs under the unified storage model

When you are using a platform with a unified, consolidated storage model (also known as observability 2.0), your cost drivers look very different. Your bill increases in line with:

Traffic volume
Instrumentation density

Instrumentation density is partly a function of architecture (a system with hundreds of microservices is going to generate a lot more spans than a monolith will) and partly a function of engineering intent.

Areas that you need to understand on a more granular level will generate more instrumentation. This might be because they are revenue-generating, or under active development, or because they have been a source of fragility or problems in your stack.

Your primary levers for controlling these cost drivers are consequently:

Sampling
Aligning instrumentation density with business value
Some amount of filtering/aggregation, which I will sum up as “instrumenting with intent”

Modern sampling is a precision power tool—nothing like the blunt force trauma you may remember from decades past. The workhorse of most modern sampling strategies is tail-based sampling, where you don’t make a decision about whether to keep the event or not until after the request is complete. This allows you to retain all slow requests, errors, and outliers. Tuning sampling rules is, of course, a skill set in its own right, and getting the settings wrong can be costly.

It is a simplification, but not an unreasonable one, to say that under the three pillars model you throw away the most important data (context, cardinality) to control costs, and with observability 2.0, you throw away the least important data (instrumentation around health checks, non customer-facing services) to control costs.

There’s a perception in the world that observability 2.0 is expensive, but in our experience, customers actually save money—a lot of money*—*as a side effect of doing things this way. If you use a lot of custom metrics, switching to the 2.0 way of doing things may save you 50% or more off the top, and the rate of increase should slow to something more aligned with the growth of your business.

Model-agnostic levers for cost control begin with vendor consolidation

Ok, enough about model-specific cost drivers. Let’s switch gears and talk about cost control from a vendor-agnostic, model-agnostic point of view. This is a conversation that always starts in the same place: vendor consolidation.

Vendor consolidation is a necessity and an inevitability for so many companies. I have talked to so many engineers at large enterprises who are like, ”name literally any tool, we probably run it somewhere around here.” To some extent, this is a legacy of the 2010s, when observability companies were a lot more differentiated then they are now and lots of customers were pursuing a “best of breed” strategy where they looked around to adopt the best metrics tool, best tracing tool, etc.

Since then, all the big observability platforms have acquired or built out similar offerings to round out their services. Every multiple pillars platform can handle your metrics, logs, traces, errors, etc., which has made them less differentiated.

There can be good reasons for governing devtool sprawl with a light touch—developer autonomy, experimentation, etc. But running many different observability tools tends to get expensive. Not only in terms of money, but also in terms of fragmentation, instrumentation overhead, and cognitive carrying costs. It’s no wonder that most observability engineering teams have been tasked with vendor consolidation as a major priority.

Vendor consolidation can be done in a way that cuts your costs or unlocks value

There are two basic approaches to vendor consolidation, and these loosely line up with the “investment” vs “cost center” categories we discussed earlier.

Companies that make their money off of software are more likely to treat consolidation as a developer experience problem. They see how much time gets lost and cognitive packets get dropped as engineers spend their time jumping frantically between several different tools, trying to hold the whole world in their head. Having everything in one place is a better experience, which helps developers ship faster and devote more of their cognitive cycles to moving the product forward.

Companies where software is a means to an end, or where the observability budget rolls up to IT or the CIO, are more likely to treat observability as a cost center. They are more likely to treat all vendors as interchangeable, and focus on consolidation as a pure cost play. The more they buy from a single vendor, the more levers they have to negotiate with that vendor.

Right now, I see a lot of companies out there using vendor consolidation as a slash and burn technique, where they simply make one top-down decision about which vendor they are going to go with, and give all engineering teams a time window in which to comply. This decision increasingly seems to take place at the exec level rather than at the engineering level, sometimes even CEO to CEO.

I think this is unfortunate (if understandable, given the sums at play). I think that vendor consolidation can be done in a way that unlocks a ton of value. I also think that in order to unlock that value, the decision really needs to be owned and thoroughly understood by a platform or observability engineering team who will be responsible for unlocking that value over the next year or two.

Telemetry pipelines as a way to orchestrate streams and manage costs

Telemetry pipelines have been around for a while, but they’ve really picked up steam lately. I think they’re going to be a key pillar of every observability strategy at scale.

Telemetry pipelines often start off as a way to route and manage streams of data at a higher level of abstraction, but they also show a ton of promise in the realm of cost containment. Just a few of the many capabilities they unlock:

Make it easier to define tiers of instrumentation, and assign services to each tier
Make it easier for observability engineering teams to practice good governance at scale
Make it easier to visualize and reason about where costs are coming from
Get your signals into an OTel-compatible format without having to rewrite all the instrumentation
Make decisions earlier in the pipeline about what data you can discard, aggregate, sample, etc.
Offload raw data to a cheap storage location and “rehydrate” segments on demand
Leverage AI at the source to help identify outliers and capture more telemetry about them
Create feedback loops to train and improve your instrumentation based on how it’s actually getting consumed in production

This doesn’t have to be an all or nothing choice, between stripping all the context and detail at the source (like metrics do) or storing all the details about everything (structured log events/traces). Pipelines bridge this gap. There’s a lot of activity going on in this space, and I think it shows a ton of promise.

These are going to require us all to learn some slightly different skills—to think about data management in different ways; ways more like how business analytics teams are accustomed to managing their data than the way ops teams do. Telemetry pipelines are going to emerge as a place where a lot of decisions get made.

Over the long run, I think observability is moving towards a “data lakehouse” type model. Instead of scattering our telemetry across dozens of different isolated signal types and custom storage formats, we’ll store the data once in a unified location, preserving rich context and connective tissue between signal types.

What role does OpenTelemetry play in cost management?

In a word: optionality. Historically, the cost of ripping one vendor out and replacing it with another was so massive and frustrating that it kept people locked into vendor relationships they weren’t that happy with, at price points that became increasingly outrageous.

If you invest in OpenTelemetry, you force vendors to compete for your business based on being awesome and delivering value, not keeping you trapped behind their walls against your will.

That’s mainly it. But I think that’s a pretty big reason.

Note that OpenTelemetry does not solve the problem of data gravity, because observability is about much more than just instrumentation. Changing vendors will also involve changing alerts, dashboards, bookmarks, runbooks, documentation, workflows, API calls, mental models, expertise, and more. It’s not as hard as changing your cloud provider, but it’s not as easy as switching API endpoints. There are things you can do to ameliorate this problem, but not solve for it. (This stickiness is one of the less-savory reasons that I hypothesize that bills have risen so far, so fast.)

As time goes on and the world adjusts to OpenTelemetry as lingua franca, my hope is that more of the sticky bits will unstick. Decoupling custom vendor auto-instrumentation from their custom generated dashboards will help, as will moving from dashboards to workflows.

Tiered instrumentation

If you’re in charge of observability at a large, sprawling enterprise, you’re going to want to define tiers of service. This is actually Gartner’s top recommendation for controlling costs: “Align to business priorities.” They suggest breaking down your services into groups according to how much observability each service needs. Their example looks like this:

Top tier (5%): External-facing, revenue-generating applications requiring “full” observability: metrics, logs, tracing, profiling, synthetics, etc.
Mid tier (35%): Important internal applications needing infrastructure, logging, metrics, and synthetics
Low tier (65%): Internal-only applications requiring just synthetics and metrics

I have no idea where they got those percentages from—the distribution strikes me as bizarre, and I’m not big on synthetics beyond simple end-to-end health checks—but the concept is sound.

Here’s how I would think of it. You need rich observability in higher fidelity for services or endpoints that are:

Under active development or collaboration
Customer-facing
Sensitive to latency or user experience
Revenue-generating
Services that tend to break, or change frequently

Services that are stable, internal-facing, offline processing, etc., don’t need the works. Maybe SLOs, monitoring checks, or (in an observability 2.0 world) a single, arbitrarily-wide structured log event, per request, per service.

Other Gartner recommendations

Gartner made three more recommendations in this webinar, which I will pass along to you here:

Audit your telemetry
Implement vendor-provided cost analysis tools and access controls
Rationalize and consolidate tools

They suggest you can save 10-30% in telemetry costs through regular audits and cost management practices, which sounds about right to me.

I don’t love relying on ACLs to control costs, because I’m such a believer in giving everyone access to telemetry tooling, but I recognize that this is the world we live in.

Is open source the future?

I recently wrote the foreword to the upcoming O’Reilly book on Open Source Observability. In it, I wrote:

A company is just a company. If your ideas stay locked up within your walls, they will only ever be a niche solution, a power tool for those who can afford it. If you want your ideas to go mainstream, you need open source.

People need options. People need composable, flexible toolkits. They need libraries they can tinker with and take apart, code snippets and examples to tweak and play with, metered storage that scales up as they grow, and more. There is no such thing as a one-size-fits-all solution, and people need to be able to cobble together something that meets their specific needs.

There seems to be a recent uptick in the number of companies thinking about bringing observability in-house. Why? I think it’s partly due to the flowering of options. When we started building Honeycomb in 2016, we built our own columnar storage engine out of necessity. Now, people have options when it comes to columnar stores, including Clickhouse, Snowflake, and DuckDB.

I think the cost multiplier effect puts the whole multiple pillars model on an unsustainable cost trajectory. Not only is it intrinsically, catastrophically expensive, but as the number of tools proliferates, the developer experience deteriorates. I think a lot of people are catching on to the fact that logs—wide, structured log events, organized around a unit of work—are the bridge between the tools they have and the observability 2.0-shaped tools they need. And running your own logging analytics just doesn’t sound that hard, does it?

However, I predict that most larger enterprises will ultimately steer away from building their own. Why? Because once you count engineering cycles, it stops being a bargain. My rule of thumb says it will cost you $1m/year for every three to five engineers you hire. There are a limited number of experts in the underlying technologies, and the operational threshold is higher than people think. Sure, it’s not that hard to spin up and benevolently ignore an ELK stack… but if your reliability, scalability, or availability needs are world-class, that’s not good enough.

On the other hand… Infrastructure is cheap, and salaries are predictable. So maybe open source is the glorious future we’ve all been waiting for. Either way, I think open source observability has a bright, bright future, and I’m excited to welcome more tools to the table. Having more options is good for everyone.

Observability engineering teams, meet data problems

One prediction I feel absolutely confident in making is that observability engineering teams are poised to thrive over the next few years.

Observability engineering has emerged as perhaps the purest emanation of the platform engineering ethos, bringing a product-driven, design-savvy approach to the problems of developer experience. Their customers are internal developers, and their stakeholders include finance, execs, SRE, frontend, mobile, and everyone else.

As the scope, mandate, budget, and impact of observability engineering teams continues to surge, I think the other element that these teams are going to need to skill up on are skills traditionally associated with data engineering.

These are, after all, data problems. And the cheapest, fastest, simplest way to solve any number of data woes is to fix them at the source, i.e. emit better data. Which runs up headfirst against most software organizations’ deeply ingrained desire to leave working code alone, lean on magic auto-instrumentation, and just generally think about their telemetry as little as possible.

Observability engineering teams thus sit between a rock and a hard place. But I think — I hope! — that clever teams will creatively leverage tools like AI and telemetry pipelines to identify ways to bridge this gap, to lower the time commitment, risk and cognitive costs of instrumentation, so that telemetry becomes both easier and more intentful. Good observability engineering teams will accrue significant political capital in the course of their labor, and they will need every speck of it to guide the org towards richer data.

Earlier I mentioned that there is a perception that o11y 2.0 is particularly expensive. I find this frustrating, because it doesn’t match my experience. But cost is a first class consideration of every data problem, always. There is no such thing as a “best” or “cheapest” data model, any more than you can say that Postgres is “better” or “cheaper” than Redis, or DynamoDB, or CockroachDB; only ones that are more or less suited for the workload you have (and better or worse implementations).

A few caveats and cautionary tales

Be wary of any pricing model that distorts your architecture decisions. If you find yourself making stupid architecture decisions in order to save money on your observability bill, this is a smell. One classic example: choosing massive EC2 instances because you get billed by the number of instances.

Be wary of any pricing model that charges you for performing actions that you want to encourage. Like paying for per-seat licenses (when you want these tools to be broadly adopted), or for running queries (when you want people to engage more with their telemetry in production).

Be mindful of what happens when you hit your limits, or what happens to your bill when things change under the hood. Make sure you understand what happens with burst protection and overage fees, and be wary of things like cardinality bombs, where you can go to bed on a Friday night feeling good about your bill and wake up Monday owing 10x as much without anyone having shipped a diff or intentionally changed a thing.

Be skeptical of cost models where the vendor converts prices into some opaque, bespoke system of units that mere mortals do not understand (“we charge you based on our ‘Compute Consumption Unit’”).

Simpler pricing is not always better pricing. More complicated pricing schemes can actually be better for the vendor and the customer by letting you align with what you actually use and what it costs the vendor to serve you.

Be wary of price quotes pulled from the website. Any website. Engineers have a tendency to treat website quotes like gospel, but nobody pays what it says on the website. Everybody negotiates deals.

People are fond of pulling up metrics price quotes and saying, “But I get hundreds of metrics per month for just a few cents!” No, the number of metrics per month gets layered on top of ingest costs, bandwidth, retention, data volume, and half a dozen other pricing levers. (And the “number of metrics” is actually referring to the number of time series values.)

As Sam Dwyer says: “Beware vendor double-dipping where they charge you for multiple types of things—for example, charging for both license per user and data consumption.” If you are using a traditional vendor, they are probably charging you for many different dimensions at once, all stacked.

So about that rule of thumb

In 2018, I wrote a quick and dirty post where I shared my observation that teams with good observability seemed to spend 20-30% of their infra bill on observability tools.

This entire 7000-word monstrosity on observability costs started off as an attempt to figure out how much (or if) the world has changed since then. After 2.5 months of writing, researching, and talking to folks, I have arrived at this dramatic update:

Teams with good observability seem to spend 15-25% of their infra bill on their observability tools.

I have heard at least one analyst (who I respect) and two or three Twitter randos (who I do not) state they believe a number like 10% should be achievable.

I am not so sure. For now, I stand by my observation that companies with good observability tooling seem to spend somewhere between 15-25% of their infra bill to get it. I don’t think this rule of thumb should scale up linearly over $10m/year or so in infra bills, but I honestly have no solid data either way.

(If you work at a large enterprise and would like to show me what it looks like to spend 10% and get great observability, please send me an email or DM! I would love to learn how you did this!)

Maybe it’s not (primarily) about the money

If I had to guess, I’d say the absolute cost is less of a big deal to large, profitable enterprises than the seemingly unbounded cost growth. Finance types are annoyed that costs keep going up at an escalating rate, while engineering types are more irritated by the fact that the value they get from their tools is not keeping pace, or does not seem worth it.

With the multiple pillars model, the developer experience may even go down as your costs go up. You pay more and more money, but the value you get from your tools declines.

What people really need are predictable costs that go up or down in alignment with the value you are getting out of them. Then we can start having a real conversation about observability as investment vs observability as cost center, and the (hidden) costs of poor observability.

In the long run, I don’t think we’re trending towards dramatically cheaper observability bills. But the rate of growth should ease up, and we should be wringing a lot more value out of the data we are gathering. These are entirely reasonable requests.

You can’t buy your way to great observability

It doesn’t matter how great the tool is, or how much you’re shelling out for it; if your engineers don’t look at it, it will be of limited value. If it doesn’t change the way you build and ship software, then observability is a cost for you, not an investment.

If you weave observability into your systems and practices and use it to dramatically decrease deploy time, test safely in production, collaborate across teams, enable developers to own their code, and give your engineers a richer understanding of user experience, these investments will pay handsome returns over time.

Your engineers will be happier and higher performing, covering more surface area per team. They will spend more time delighting customers and moving the business forward, and less time debugging, recovering from outages and coordinating/waiting on each other.

But buying a tool is not magic. You don’t get great observability by signing a check, even a really big one, any more than you can improve your reliability just by hiring SREs. Turning the upfront cost of observability tooling into an investment that pays off takes vision and work. But it is work worth doing.

01 - Blogs