“Yesterday, as a result of a server configuration change, many people had trouble accessing our apps and services.”
Facebook’s explanation of its 14-hour outage last week sounds simple enough, but very possibly belies an incredibly complex series of failures across its incredibly complex infrastructure that spans data centers across the world. Fourteen hours is an awfully long time for a company whose systems are more or less designed to maximize uptime, and that employs some of the smartest software engineers on the planet.
But Facebook is hardly alone in suffering lengthy outages caused by seemingly inconsequential things. Just about every large website, web company and cloud provider has been through the same thing, including AWS, Google, Microsoft and Apple. At their scale and with the complexity of their architectures—physical and software—all the automation and engineers in the world sometimes aren’t enough. One thing goes wrong, and it cascades.
This is one of the reasons why some people have a difficult time understanding, or at least accepting, the rush toward microservices architectures and all things Kubernetes. As the saying goes, “Shit happens.” When it does, it’s probably easier to debug a relatively simple monolith than to track down the cause across a collection of interconnected microservices running on ever-changing infrastructure.
That being said, when a company’s software footprint, user count and ambitions reach a certain scale—things that are almost certainly true for any large enterprise—microservices (done right) are almost certainly the right option for bringing order and agility to its IT organization. Depending on its application portfolio, Kubernetes might be, too. Companies like Facebook and Google don’t operate globally distributed systems and build the tools they build because they want to; they do it because they have to.
Of course, there are also business benefits to these types of architectures when they’re done well. Google’s just-announced streaming gaming service is perhaps an extreme example, but the software engineering culture and technologies the company has put in place do help it jump into new digital opportunities when it sees an opportunity.
However, the trick for most mainstream enterprises is taking advantage of the architectural lessons large web companies have taught the world (and the software they’ve developed) without taking on their do-it-yourself and/or not-built-here attitudes. Finding the budget, the people and, frankly, the institutional DNA to tackle every part of enterprise IT is hard work (thus the upcoming PagerDuty IPO). For example, standing up a Kubernetes cluster might be easy enough; operating it and all the complementary components at any reasonable scale, security level, etc., can prove to be a different story.
That’s why there’s a raging debate over open source licensing happening right now, but the gist of the argument is who has the right to serve enterprise customers with commercial versions of popular projects.
The great message of Amazon CTO Werner Vogels in the early days of cloud computing was that companies shouldn’t invest in “undifferentiated heavy lifting,” by which he meant managing data centers and provisioning servers. The message seems to have resonated (if the success of AWS and its peers is any indicator), only now that heavy lifting has shifted to operating complex data center software and application architectures. Technologies like Kubernetes (or Hadoop or OpenStack before that) might not cost anything to install, but that’s where the free lunch ends.
Perhaps the rash of recent outages at webscale services, including Facebook, will be a useful reminder for enterprises to not fall into that old trap.