On dependencies and resilience

Jan 5, 2024 - 13 minutes

I’ve been thinking a lot about dependencies while working with my new website, built on Astro. Astro, with no additional packages, uses 400 total dependencies weighing in at about 122 MB. Andrei Kashcha has made a lovely visualizer where you can see Astro and all other packages as a graph. My website, with all additional packages, has 467 dependencies which brings me up to over 500 MBs, and that’s insane. Even a “lightweight” project such as 11ty has 213 dependencies.

However, I don’t want to discuss the JS ecosystem dependency issue, which has been talked about by some people way smarter than I¹. I also won’t directly discuss software bloat² or supply chain security³. Instead, I want to explore the theory that the software development culture shift towards more dependencies, especially in the web development space, is creating more products and companies that rely on closed-source software or APIs for their full functionality which could have future negative implications.

From the video game industry with love

In September of last year, Unity announced⁴ a new per-download fee for developers that would start January 2024. For Personal plans (Plus was being retired into Personal), games that have made more than $200,000 in the last 12 months and had at least 200,000 lifetime installs would qualify, and for the more expensive Pro and Enterprise plans, it’s $1,000,000 and 1,000,000 installs respectively. In exchange for the pricing change, the Personal plan would become free. Personal would be charged at $0.20 per install, and Pro and Enterprise would start at $0.15 and $0.125 per install respectively, and that would go down as installs per month increased.

As Unity had previously charged a subscription fee for each plan, this move was met with immediate outrage with the prime concern that the system could be abused with repeat installs to punish developers. However, what struck me the most, besides the fact that this move was purely profit and stock-driven, is how game studios left so quickly from Unity despite them later walking most of the changes back. Some studios even described Unity as “an operational risk”⁵.

Independent developers are hit the hardest as switching software so quickly is extremely difficult, especially as all of the tooling from libraries to game editor to distribution, even programming language, can change. Unity, in this case, has truly caused irreparable damage to its reputation with developers, but it begs the question: what if they had been more subtle? What if they slowly increased prices with little to no notice to developers? Even though the boiling frog analogy is false,⁶ if the water gets hotter slowly enough, will developers or studios not feel as much pressure to switch engines, allowing Unity to extract ever-increasing profits?

I think that Unity’s actions should be a reminder to not just full companies but individual developers that they need to be aware of the closed-source dependencies that they have and maintain a plan to switch if the need arises. The Unity issue is an easier issue for larger studios to solve – they could create an in-house engine if need be – and it hurts smaller studios and indie developers the most. However, any creator of any product of any size, in my opinion, needs to be thinking from day one about what it depends on, whether it be YouTube for video revenue or AWS for hosting, how trustworthy these dependencies are, and how to respond if their terms change.

An open door shut.

Even open source isn’t “safe”, as HashiCorp demonstrated when it switched⁷ from MPL 2.0⁸ to BSL⁹, which kept the source code available while preventing certain commercial use. In this case, HashiCorp’s BSL prevents commercial competitors from using their software without a license. This switch is problematic for two reasons: 1. HashiCorp decides who a “commercial competitor” is and 2. some open-source developers fear that using HashiCorp software in their work might poison their license and force them to adopt HashiCorp’s BSL.

For medium to large companies using HashiCorp software internally, this likely required a legal discussion or even an audit to double-check that their use of HashiCorp software, primarily Vault and Terraform, doesn’t compete with HashiCorp¹⁰. Other companies, especially new DevOps startups, they’re unlikely to be able to use these products without a likely expensive license – which is likely the exact product segment HashiCorp was trying to target with this change. HashiCorp’s change also seems to be one of the more extreme license switches in open-source history, as compared to MongoDB switching to SSPL¹¹ in 2018 or CockroachDB switching to BSL¹² in 2019¹³.

The Linux Foundation has since sponsored OpenTofu, a forever-open-source fork of Terraform, with the guarantee that the license won’t change at any point in the future. While the Linux Foundation, and many other open-source foundations like the Apache Software Foundation, provide logistic support and a guarantee of open-source licensing for all of their projects, very few small dependencies fall under their umbrella. Modern companies of any size need to inspect where their dependencies come from and who manages them to ensure that they’re not at risk of any future licensing changes.

Open-source projects with corporate ownership seem the riskiest, and it’s vital to examine what the incentives to keep the software open-source or eventually change its license are. Google, known as a strong open-source contributor, is unlikely to take, for example, Angular off of copyleft¹⁴. On the other hand, Automattic’s product offerings like WooCommerce are products with open source code – there would be nothing stopping them from restricting the license in the future. In my opinion, the safest dependencies are those run by a foundation, then ones with multiple corporate sponsors, and then historically trustworthy companies. Even open-source software run by individuals is likely safer in terms of license changes than single corporate owners. Despite these risks, open-source software is almost always better than closed-source as even in the event of a licensing change, a fork can always be made by the community as shown with OpenTofu.

Dealing with providers

Stripe, for example, lists 99.999% uptime for the last 90 days¹⁵. When I was at Stripe this past summer, Slack went down twice¹⁶ during those twelve weeks which was frustrating at the time, but they maintained 99.98% uptime for the last quarter¹⁷. Despite rarely going down, a Slack outage can often majorly affect office productivity and internal communications if no alternative communication is readily available. Luckily, most companies also have email, with Slack as the ephemeral less-professional alternative.

But, what happens if one of your product or business dependencies goes down, like Square did in September last year? Square was down for nearly two days globally, leading to a $1bn loss in sales, serving as a wake-up call for customers to switch providers or develop in-house¹⁸. AWS went down briefly in June of the same year, affecting operations in their US-EAST-1 region¹⁹. Cloudflare was fully down at the end of October for about half an hour with a smaller but longer outage less than a week later²⁰.

I think that there are some clear lessons to be learned from all the outages we see daily. While it can be costly or extremely difficult, maintaining a secondary provider in most cases is the safest option for all companies of all sizes. However, this can be overly complicated, especially when working with highly integrated APIs. Therefore, it’s important to have consistent abstractions on top of these APIs so that, even if there’s not an automatic fallback system, exchanging the underlying software should be much easier.

Additionally, I think that software designers and developers should be thinking about fail-safe vs fail-secure – that is, whether something should unlock in an emergency or the opposite. Most software is fail-secure: if a license manager can’t reach the host, you can’t get in; if payments go down, you can’t buy a product; if authentication providers aren’t responding, you can’t log in. This is typically the best default – you usually don’t want someone turning off WiFi to skip license verification or DDoS-ing an API to get a free item. However, in some cases, writing fail-safe code might be a better option for your business. Let’s say that you run a website with subscriptions through an API and that API goes down. And let’s say that your website keeps track of who’s subscribed and checks the API periodically for updates. If the API goes down, should subscribers remain listed as subscribed? I’d say yes. This is a relatively tame example but serves as a simple case study that every uncontrolled dependency should go through.

Lastly, medium to large companies (maybe even smaller ones) should be doing regular chaos and outage testing. The multi-day Cloudflare outage was caused by power issues with one of their major Oregon data centers and should’ve been prevented by high-availability multi-data-center clusters²¹.

In particular, two critical services that process logs and power our analytics — Kafka and ClickHouse — were only available in PDX-04 but had services that depended on them that were running in the high availability cluster. Those dependencies shouldn’t have been so tight, should have failed more gracefully, and we should have caught them.

While Cloudflare did pretty robust outage testing, they didn’t take the entire data center offline – only a portion of it and therefore didn’t catch the issue. You should read the entire blog post as it details how Cloudflare will be using rigorous testing and prevention measures in the future, but I think it serves as an example that this is a struggle for all companies. AWS now lets you simulate region outages²² and I’m hoping that more providers will catch up as this issue grows.

GPT-4 / Claude / Llama / the new one from 5 minutes ago

What got me thinking about this issue has been the growth of “AI” products that simply wrap OpenAI’s API, either offering proprietary embeddings or prompt engineering as their additional service. All of the previous examples have been about partial dependencies, where most of the work on top can be salvaged in the event of a dependency change. I worry that many of these GPT-based applications are making themselves overdependent on OpenAI, giving OpenAI a ton of bargaining power and creating a lot of risk for the product. Furthermore, since many of these applications are just adding a new embedding method like PDFs, this makes sherlocking²³ easy for OpenAI as they did last September²⁴.

At the end of the day, I think that we only need to ask two questions for each dependency:

What is the impact on my business if it disappears today? How likely is that to happen?

There are a bunch of questions that go along with the first – some of which I’ve already discussed: do we have a game plan if something goes wrong; how much time would it take to switch; can we automate this process; how graceful is failure; etc? The second has others as well: who’s behind the dependency; what’s their track record (uptime, behavior, etc.); what have similar dependencies done; etc? However, I think that these simple questions are enough to understand the basic dangers and hopefully kickstart an internal effort to decrease that potential impact. You could even put it on a chart:

Many companies are switching away from a direct OpenAI dependency to other providers, like Hugging Face²⁵. I even remember seeing on the Bay Area freeway – though I can’t find the company – an advertisement for a product that lets you switch between multiple LLM providers to improve your product’s uptime. The irony of this product is now they’re a dependency.

Solutions? Not really

We’re not going to escape from the interdependent world any time soon. Unless you’re the largest company in the world and have become fully vertically integrated (Apple’s #1 wish), there’s always going to be dependencies, and that’s good. It’s good that we have companies like Twilio to handle communications and Mapbox for GIS, as these companies can achieve economies of scale and abstract complexities that would be much too difficult for even some large corporations. It’s good that we have open-source software like React and Hadoop that also abstract complexity and often have a strong community, all for free.

Despite your best efforts, you’ll always have dependencies. If you’re a product company, even if you do your logistics in-house, weather becomes a dependency. Even the entire internet for most companies is a major dependency. At some point, we can’t worry about things out of our control – we can only plan for the most likely. I hope that this blog post sparks a discussion within your product teams and your company about how to build resilient software, as it could make or break your product if something goes wrong.

Thoughts? Leave me feedback!

JavaScript’s Dependency Problem, Ride Down Into JavaScript Dependency Hell ↩
Software disenchantment ↩
No Unaccompanied Miners: Supply Chain Compromises Through Node.js Packages ↩
Unity plan pricing and packaging updates ↩
How a Pricing Change Led to a Revolt by Unity’s Video Game Developers ↩
Boiling Frog ↩
HashiCorp adopts Business Source License ↩
Business Source License (BSL 1.1): Requirements, Provisions, and History ↩
MPL 2.0 FAQ ↩
I’m curious to know what the reactions at my previous employer Stripe were, as they use Terraform pretty significantly. Also, they should write more engineering blog posts. ↩
MongoDB now released under the Server Side Public License ↩
Why we’re relicensing CockroachDB ↩
I think it’s funny that most of these examples are database companies, as Elastic did the same thing. These companies are primarily trying to avoid commercial redistribution or hosting without a license, but I think that HashiCorp’s changes were received worse as their software was integrated by other projects unlike databases (which are typically not an integrated dependency). ↩
If Google ever were to take an open-source product private, it would be Android in my opinion. I think it’s still unlikely, however, as they rely on the open-source nature of Android to capture the market and increase users for Google products. ↩
Stripe Status ↩
Slack Experiences a Brief but Widespread Outage, Slack briefly experienced some major issues ↩
Slack Status ↩
Square’s outages will have global impact ↩
Amazon says AWS is operating normally after outage that left publishers unable to operate websites ↩
Cloudflare is (still) struggling with another outage - here’s what to know ↩
Post Mortem on Cloudflare Control Plane and Analytics Outage ↩
You’re so worried about AWS reliability, the cloud giant now lets you simulate major outages ↩
a developer’s guide to apple, sherlocking, and antitrust ↩
A minor ChatGPT update is a warning to founders: Big Tech can blow up your startup at any time ↩
Pivot! AI Devs Move to Switch LLMs, Reduce OpenAI Dependency ↩