Buy Don't Build: Avoiding Ops For Fun and Profit

Standing up and managing a service or building a custom service is a common desire for engineers. It’s usually a major mistake, that ends up costing a ton of time and money. The desire to build custom versions of everything seems to come from a few places:

  1. The hope that it will be cheaper to build than buy.
  2. The idea that their companies process is special so industry-standard stuff will not work.
  3. That they need to have total control over what the service does.
  4. To avoid vendor lock-in

All four of those things are less true and less important than you would think. It’s worth building when something is core to your business or provides a significant competitive advantage. Otherwise, it’s probably worth using the services that your cloud provider has or another saas. Running your own stuff has a significant operational burden and a large opportunity cost. If you only get one thing out of this let it be: Building stuff is fun, but being paged at two in the morning about a Rube Goldberg contraption of a system to handle customer analytics isn’t.

Running services isn’t easy

Keeping systems up in production takes time and energy. Building them isn’t where most of the expense lies. Instead, that comes with running and maintaining complicated systems. Most enterprise systems require an engineering team to keep them running. Engineers aren’t cheap to hire and there is also additional complexity that gets introduced to keep a large number of teams coordinated. This all results in slower decision making.

Slower decisions happen because more teams are needed to maintain more services. These teams then need to work together and coordinate. All of a sudden, to make a change there are a million teams that need to be informed and handoffs that have to be managed. That can lead to fiefdoms for managers and way more politics, since there are now more teams and a more complicated organizational structure.

If you are following a DevOps model, the team that builds the service will also end up maintaining it. The more moving pieces that there are for the team to maintain, the less time that they will have for new feature development. This is painful, especially in young companies with a rapidly evolving product. Slowing down the time it takes to find product-market fit in exchange for getting to run your own stuff is a bad trade.

You also have to consider the level of operational excellence that exists in your organization. To put it bluntly, who do you trust more for uptime - Amazon or you? The answer, for services which you absolutely depend on for survival, might be you. Other systems though may suffer from getting less time and attention, because it becomes harder to justify the expense of keeping them up and running.

Vendor lock-in

At this point, you might be thinking “But I don’t want to be stuck on a vendor’s special snowflake of a system”. My counter-argument to that is there is also lock-in with internal systems. The most common version of this is The Keeper of The Spreadsheet. Now, if you’re going “what spreadsheet?”, well that’s a fair question. But it’s the one that for some important internal process that has turned into The Keeper of The Spreadsheet’s job. Most large companies have at least one spreadsheet like this. If you work at a large company, you probably realize that is a gross understatement - there can be many.

The Keeper(s) of The Spreadsheet will defend their process, and not want to change it at all costs, because they are worried that they’ll get fired if that process gets automated or is no longer necessary. You also see this with engineering teams, where they become the Keepers of A Database or ticketing system. All of a sudden you’ve got a system that sucks, and nobody wants to advocate for getting rid of it because their co-worker is convinced that they’ll lose their job if it happens. This also creates a political trap for the unwary when they try and fix that process.

Being The Keeper of The Ticketing System isn’t all that fun usually either. It’s a good way to get pigeonholed into boring work. It also means that you end up with a system that isn’t the most important thing to the business, instead of allowing an outside company to take it. That outside company is likely to specialize in solving that problem, and has built deeper expertise because of that. Unless of course you’re slightly evil and looking for awful projects to exile people to.

All of this makes being locked into a vendor less of a concern than most people think. There is lock-in no matter what you do. The thing that you want to avoid is giving wholesale pricing power to any vendor. This can be avoided by making sure that the key differentiators for your business are in-house.

Engineering time is expensive

Software and systems engineers aren’t cheap to hire. As a group, we also tend to undervalue our time. Think about how often you hear “Oh I could build that in a week”, or “That’ll be easy”. With luck that’s just a comment on Reddit or hacker news, but if it’s at work then it usually turns into a total slog.

It’s common to express the cost of owning or maintaining a service in terms of the total cost of ownership (TCO). This is often really hard to calculate since many of the things that go into TCO aren’t tracked. The major issue that you’ll run into is that it’s not just the cost of the engineer. The metric we care about is the opportunity cost of the other things that engineers could be producing.

Another reason that will come up for building something custom is for unique company processes. Usually with the idea that you couldn’t customize the software to make it work, or that it’d cost more than just building it. While these can be valid reasons to build, it’s true less often than you would think. Many processes are shared across a large number of businesses. Also, processes tend to get bloated over time. Large amounts of the custom work that is needed to match up with a business’s processes are stuff that could just not be done. For an absurd example, that happened at a company I worked with:

  1. Our process for recommending articles is complicated and requires tons of joins on fuzzy data
  2. We can build our own database system that is designed specially to handle this.
  3. A few months of intense development go by.
  4. It turns out operating this thing is hard, we don’t know why some queries take the system down, and why our customers complaining about the recommendations.

You really don’t want that to be you. It’s demoralizing to have built and maintained a product for something that doesn’t even work correctly. When things are getting so complicated that no existing tooling will work for it, you should be asking if all of the complexity is fundamental to the domain or if the model you are using is flawed.

Loss of focus

A significant problem that comes from running your own version of a service is that it’s another thing that engineers have to pay attention to. There is a limit on how many things can be important. What then happens to all of the non-core services that you are running is usually some form of neglect, where they are kept in a barely good enough state.

The problem then with that is everyone who is working on those services is usually trying to get off of them. After all, no one wants to work on something that their boss doesn’t care about. So you end up with a ton of maneuvering since people are trying to change teams, and that increases drag.

Compounding this problem even further is that not revenue-generating things are frequently ignored. Yes, your CI/CD system is absolutely critical, but it’s easy for executives to not think about. This leads to a failure mode where you have a lost garden of internal tools. Whereas if you are paying someone money to do the same thing, it is their business so they keep working on it.

Opportunity cost

In many ways, the biggest problem with building a service is opportunity cost. The reason isn’t salary but instead what else could be done. It’s basically the same problem that you see when a company is building a one-off feature to close a sale. The big difference is that engineering is doing it to themselves so there may be willful blindness to the damage being done.

Engineers like building things, and many like to be in control of all the buttons and knobs. Many times this is a good thing. After all, it’s how anything super cool actually gets built. The problem with it arises when that impulse exists, without a keen sense of the business effects of decisions that are made.

The question you should be asking is what else could be done instead of tuning your own stuff or building a new internal system. The answer is usually spending more time coming up with the correct architecture,or developing actual customer-facing features, instead of fighting fires .

Having large operational footprints usually results in reduced velocity and fewer changes happening per engineer. Think about the difference in speed between big companies and startups. This isn’t because startups hire smarter people, instead it’s the amount of stuff that is tied up with any change at a big company.

This is an area where engineering can’t just think about the software that is being built. Instead, you have to think of the health of the entire product. It’s about building stuff that is useful for the customer and letting go of things that aren’t critical. By keeping a tight focus on core projects things are built faster. There is also less craft and maintenance work that goes along with the product.

Summing up

None of these reasons might apply in your case. There are many good reasons to build. However, if you haven’t considered whether you can just buy something to solve a problem instead of building it yourself, you should.

Think about if it’s worth as being paged at 2 in the morning over. If you are willing to be paged over it, also consider if someone else can maintain it better than you. You know what is going on with your business and the details of your product way better than I do. However, this is a question which you should be asking and not just waving away.

I suggest that you prioritize buying to building, unless building will provide a real sustainable advantage for the business. Reducing the long term operational burden is one of the easiest ways to maintain developer velocity and happiness over time, and reduce costs for an organization as well.


1832 Words

2020-12-06 23:17 +0000