Not-so-Simple Network Management Protocol

"Phillip, are you familiar with SNMP? Do you know Python?" asked Phillip's new boss.

"No."

"Would you like to be?"

That's how Phillip got thrown into the deep end of a conversion project that was tasked with updating their infrastructure monitoring system. It was developed by a team a few years ago, and they all quit at the same time, which made it hard to maintain.It was a homegrown set of Python scripts which polled a Prometheus service, which itself was gathering metrics from a pile of Docker containers. These scripts gathered those events and turned them into SNMP alerts for a third party receiver.

The upgrade promised to be simple: take the SNMPv2c code and update it to use SNMPv3, which supports encryption. Phillip started by pulling the existing code from their repository, only to find that while there was a repository and it had a pile of branches, nothing had ever been merged into main/master/trunk. So Phillip pinged one of the other developers: "Which is the correct version of the code?"

He started playing with the PySNMP library while he waited. From reading the docs, Phillip thought that it should be a very simple change- essentially a find/replace to update one class name. After a few days of poking at the library, the previous developer gave him about three candidate branches which were probably what was actually deployed. Phillip found a Docker image that seemed to have the same code as one of those branches, and was probably the one deployed- no one knew for certain- and set to work.

His first attempt didn't work at all, which wasn't surprising. There was no logging, the configuration was hard-coded in half a dozen places in the code and didn't always agree with the other places, and someplace along the way it needed an SNMP command, which nothing was sending. It was mess, but eventually, Phillip got it running.

For a few minutes, anyway. Then it would crash. It took Phillip two months of research and experimentation to discover that the SnmpEngine object wasn't threadsafe. It wasn't his choice to use it across multiple threads- that was the architecture that the previous developers had used.

That's when it dawned on Phillip: this program had never worked. Sure enough, the deployed version crashed all the time, too. So Phillip set about refactoring and debugging it, and after two more months, he had a thing that could read the metrics and report them to SNMP.

Except the third-party receiver that expected to receive those messages didn't get them. Phillip could confirm he was sending it, but it wasn't arriving. Cue another month of scrambling and head-scratching, only for Phillip to discover that the network team had blocked the routes from the server he was running on and the third-party receiver.

"Oh, it was sending so much traffic, we assumed it was broken," was the best explanation he could get.

So the service never worked in the first place. Even if it had worked, the network team had blocked it from working. No one talked to anyone else about what was going on. And, as the cherry on top of the sundae, the main customer for this data didn't believe in "on-call rotations"- they got ahold of the developers' cellphone numbers and called them directly whenever they had an issue.

All that added up to Phillip leaving the company after a very long six months. It also neatly explained why the previous dev team had all quit en masse.

"Still," Phillip writes, "at least I got some Python experience out of it."

[Advertisement] Utilize BuildMaster to release your software with confidence, at the pace your business demands. Download today!

This post originally appeared on The Daily WTF.

Leave a Reply

Your email address will not be published. Required fields are marked *