Fire up any application you’ve ever written code for, and try out just one of its features.
How many different ways could that feature break due to an unexpected change elsewhere in your system?
This isn’t an especially easy question to answer, even with the source code for your application in front of you. But even in clean and well written systems, the honest answer ranges from “I don’t know” to “Probably quite a few different ways.”
For example, think of the complications that can arise from data alone. What happens when there is no data in the dataset? What happens when there is a ton of data in the dataset? What happens when a validation isn’t completely accurate, and data gets corrupted in a hard-to-predict way? What happens when data is lost due to a hard-to-detect silent error? What happens when a query that once took 20ms to run starts taking 2 minutes because of <some rare problem>?
Now apply a similar line of investigation to any dependencies on web services. To libraries and frameworks. To workers and queues. To operating systems and platforms. To network infrastructure and CDNs. To someone with a credit card who needs to pay the bills for every service that is required to keep the lights on.
How many ways can a single feature break due to an unexpected change at a distance? How many ways can a whole system come crashing down because of a cascade of failures and minor errors that snowball into something larger?
The answer is a lot, always a lot. But how good are we at reasoning about and guarding against these sorts of systemic failures?
These are universal challenges in software development, they’re not in any way language or framework specific. So why are they not discussed more often?