CodeSOD: Cool Power

Power outages are never good, and they're even worse when your facility needs to run 24/7. Now, Jaroslaw's organization didn't do a great job setting up for round-the-clock, always-on operations. It was the kind of thing where the organization grew, annexed the neighboring building, and kept growing. The result was hundreds of workstations, two separate power lines, two server rooms, three different Internet uplinks, and huge piles of switches responsible for making this network work.

Which added the problem that after a power outage, nothing came back on exactly right, either. It always took some time to find the one switch that opted not to reboot.

Now, many years earlier, someone had the bright idea of installing a generator. The hookup offered no easy way to switch over to generator power, and thus required an electrician with keys to the elecrtical boxes to actually make the change. While the servers had small UPSes, enough of the environment went down during a power outage that, by the time they had the generator on and everything powered back up properly, the outage was usually over.

And so it went for years, until someone higher up looked at the problem and freed up the budget to fix things. The generator was replaced, and there was a plan to change the wiring so that it was faster to switch over- but it turned out that would have tripled the budget and shut the facility down for days while electricians redid the whole electrical system. Instead, the budget was used to upgrade from small, consumer-grade UPSes to a big hocking, 10kW unit.

It was the size of a large refrigerator, and had enough power to keep all the critical elements of the facility powered on for twenty hours- time enough to switch over to the generator, if needed.

And then, miracle of miracles, they tested their switchover plan. They cut main power, saw the UPS come on, ensured work could continue, then had the electricians switch on the generator, and then reversed the whole process. It went off without a hitch.

And then a week later, the UPS screamed about an overload. It lasted for about 40 seconds, then cleared up. Considering that the UPS had way more capacity than they needed, that seemed like a serious problem. Two hours later, it happened again. And again. And again. Jaroslaw went through everything in the server room, trying to find the badly behaved device. At one point, he found an unplugged electric kettle sitting not far from the server room, and went on a hunt to see if anyone had been making tea in the server room, thinking that was the culprit. No one had.

Over three days, after checking all the equipment, Jaroslaw went to the building wiring diagram and started checking every outlet. He found one, hidden in the back of the server room, ostensibly unused, that had an extension cord plugged in. The cord was neatly tucked into the cable chase, as if it was part of the plan. Jaroslaw tracked the cable, and followed it around the room until he found a hidden refrigerator. Some of the 24/7 staff wanted easy access to snacks and drinks, and didn't want to constantly badge in and out of the server room to get them. While there were plenty of non-UPS protected outlets they could have used, someone had decided this was a better option.

And sure enough, while Jaroslaw was looking at the fridge, he heard the compressor kick on, and the UPS scream about an overload at the same time.

The immediate fix was easy: remove the fridge and extension cord, and have a serious discussion about proper server room safety. The longer term fix was spending the last bits of the budget to add keyed switches to all of the outlets in the server room, ensuring no one could plug things in without going through the proper channels.

[Advertisement] Keep the plebs out of prod. Restrict NuGet feed privileges with ProGet. Learn more.

This post originally appeared on The Daily WTF.

Leave a Reply

Your email address will not be published. Required fields are marked *