This past Monday, Facebook experienced an outage which lasted almost six hours. This had rattle-on effects. Facebook's pile of services all failed, from the core application to WhatsApp to Oculus. Many other services use Facebook for authentication, so people lost access to those (which highlights some rather horrifying dependencies on Facebook's infrastructure). DNS servers were also strained as users and applications kept trying to find Facebook, and kept failing.
CloudFlare has more information about what went wrong, but at its core: Facebook's network stopped advertising the routes to its DNS servers. The underlying cause of that may have been a bug in their Border Gateway Protocol automation system:
How could a company of Facebook’s scale get BGP wrong? An early candidate is that aforementioned peering automation gone bad. The astoundingly profitable internet giant hailed the software as a triumph because it saved a single network administrator over eight hours of work each week.
Facebook employs more than 60,000 people. If a change designed to save one of them a day a week has indeed taken the company offline for six or more hours, that's quite something.
Now, that's just speculation, but there's one thing that's not speculation: someone effed up.
IT in general, and software in specific, is a rather bizarre field in terms of how skills work. If, for example, you wanted to get good at basketball, you might practice free-throws. As you practice, you'd expect the number of free-throws you make to gradually increase. It'll never be 100%, but the error rate will decline, the success rate will increase. Big-name players can expect a 90% success rate, and on average a professional player can expect about an 80% success rate, at least according to this article. I don't actually know anything about basketball.
But my ignorance aside, I want you to imagine writing a non-trivial block of code and having it compile, run, and pass its tests on the first try. Now, imagine doing that 80% of the time.
It's a joke in our industry, right? It's a joke that's so overplayed that perhaps it should join "It's hard to exit VIM" in the bin of jokes that needs a break. But why is this experience so universal? Why do we have a moment of panic when our code just works the first time, and we wonder what we screwed up?
It's because we already know the truth of software development: effing up is actually your job.
You absolutely don't get a choice. Effing up is your job. You're going to watch your program crash. You're going to make a simple change and watch all the tests go from green to red. That semicolon you forgot is going to break the build. And you will stare at one line of code for six hours, silently screaming, WHY DON'T YOU WORK?
And that's because programming is hard. It's not one skill, it's this whole complex of vaguely related skills involving language, logic, abstract reasoning, and so many more cognitive skills I can't even name. We're making thousands of choices, all the time, and it's impossible to do this without effing up.
Athletes and musicians and pretty much everybody else practices repeating the same tasks over and over again, to cut down on how often they eff up. The very nature of our job is that we rarely do exactly the same task- if you're doing the same task over and over again, you'd automate it- and thus we never cut down on our mistakes.
Your job is to eff up.
You can't avoid it. And when something goes wrong, you're stuck with the consequences. Often, those consequences are just confusion, frustration, and wasted time, but sometimes it's much worse than that. A botched release can ruin a product's reputation. You could take down Facebook. In the worst case, you could kill someone.
But wait, if our job is to eff up, and those mistakes have consequences, are we trapped in a hopeless cycle? Are we trapped in an existential crisis where nothing we do has meaning, god is dead, and technology was a mistake?
No. Because here's the secret to being a good developer:
You gotta get good at effing up.
The difference between a novice developer and an experienced one is how quickly and efficiently they screw up. You need to eff up in ways that are obvious and have minimal consequences. You need tools, processes, and procedures that highlight your mistakes.
Take continuous integration, for example. While your tests aren't going to be perfect, if you've effed up, it's going to make it easier to find that mistake before anybody else does. Code linting standards and code reviews- these are tools that are designed to help spot eff ups. Even issue tracking on your projects and knowledge bases are all about remembering the ways we effed up in the past so we can avoid them in the future.
Your job is to eff up.
When looking at tooling, when looking at practices, when looking at things like network automation (if that truly is what caused the Facebook outage), our natural instinct is to think about the features they offer, the pain points they eliminate, and how they're better than the thing we're using right now. And that's useful to think about, but I would argue that thinking about something else is just as important: How does this help me eff up faster and more efficiently?
New framework's getting good buzz? New Agile methodology promises to make standups less painful? You heard about a new thing they're doing at Google and wonder if you should do it at your company? Ask yourself these questions:
- How does it allow me to eff up?
- How does it tell me when I've effed up?
- When I inevitably eff up, how hard is it to fix it?
- How does it minimize the consequences of my eff up?
Your job is to eff up.
The more mistakes you make, the better a programmer you are. Embrace those mistakes. Breaking the build doesn't make you an imposter. Spending a morning trying to track down a syntax error that should be obvious but you can't spot it for the life of you doesn't mean you're a failure as a programmer. Shipping a bug is inevitable.
Effing up is the job, and those eff ups aren't impediments, but your stepping stones. The more mistakes you make, the better you'll get at spotting them, at containing the fallout, and at learning from the next round of mistakes you're bound to make.
Now, get out there and eff up. But try not to take down Facebook while you do it.
This post originally appeared on The Daily WTF.