Back in the antediluvian times, when I was in college, people still used floppy disks to work on their papers. This was a pretty untenable arrangement, because floppy disks lost data all the time, and few students had the wherewithal to make multiple copies. Half my time spent working helldesk was breaking out Norton Diskutils to try and rescue people's term papers. To avoid this, the IT department offered network shares where students could store documents. The network share was backed up, tracked versions, and could be accessed from any computer on campus, including the VAX system (in fact, it was stored on the VAX).
I bring this up because we have known for quite some time that companies and governments need to store documents in centrally accessible locations so that you're not reliant on end users correctly managing their files. And if you are a national government, you have to make a choice: either you contract out to a private sector company, or you do it yourself.
South Korea made the choice to do it themselves, with their G-Drive system, a government file store hosted primarily out of a datacenter in Daejeon. Unfortunately, "primarily" is a bit too apropos- last month, a fire in that datacenter destroyed data.
The Interior Ministry explained that while most systems at the Daejeon data center are backed up daily to separate equipment within the same center and to a physically remote backup facility, the G-Drive’s structure did not allow for external backups. This vulnerability ultimately left it unprotected.
Someone, somehow, designed a data storage system that was structurally incapable of doing backups? And then told 750,000 government employees that they should put all their files there?
Even outside of that backup failure, while other services had backups, they did not have a failover site, so when the datacenter went down, the government went down with it.
In total, it looks like about 858TB of data got torched. 647 distinct services were knocked out. At least 90 of them were reported to be unrecoverable (that last link is from a company selling Lithium Ion safety products, but is a good recap). A full recovery was, shortly after the accident, predicted to take a month, but as of October 22, only 60% of services had been restored.
Now, any kind of failure of this scale means heads must roll, and police investigations have gone down the path of illegal subcontracting. The claim is that the government hired a contractor broke the law by subcontracting the work, and that those subcontractors were unqualified for the work they were doing- that while they were qualified to install or remove a li-ion battery, they were not qualified to move one, which is what they were doing and resulted in the fire.
I know too little about Korean laws about government contracting and too little about li-ion battery management to weigh in on this. Certainly, high-storage batteries are basically bombs, and need to be handled with great care and protected well. Though it seems if one knows how to install and uninstall a battery, moving a battery seems covered in those steps.
But if I were doing a root cause analysis here, while that could be the root cause of the fire, it is not the root cause of the outage. If you build a giant datacenter but can't replicate services to another location, you haven't built a reliable cloud storage system- you've just built an expensive floppy disk that is one trip too close to a fridge magnet away from losing all of your work. In this case, the fridge magnet was made of fire, but the result is the same.
I'm not going to say this problem would have be easy to avoid; actually building resilient infrastructure that fails gracefully under extreme stress is hard. But while it's a hard problem, it's also a well-understood problem. There are best practices, and clearly not one of them was followed.
This post originally appeared on The Daily WTF.
