Jon supports some software that's been around long enough that the first versions of the software ran on, and I quote, "homegrown OS". They've long since migrated to Linux, and in the process much of their software remained the same. Many of the libraries that make up their application haven't been touched in decades. Because of this, they don't really think too much about how they version libraries; when they deploy they always deploy the file as mylib.so.1.0. Their RPM post-install scriptlet does an ldconfig after each deployment to get the symlinks updated.
For those not deep into Linux library management, a brief translation: shared libraries in Linux are .so files. ldconfig is a library manager, which finds the "correct" versions of the libraries you have installed and creates symbolic links to standard locations, so that applications which depend on those libraries can load them.
In any case, Jon's team's solution worked until it didn't. They deployed a new version of the software, yum reported success, but the associated services refused to start. This was bad, because this happened in production. It didn't happen in test. They couldn't replicate it anywhere else, actually. So they built a new version of one of the impacted libraries, one with debug symbols enabled, and copied that over. They manually updated the symlinks, instead of using ldconfig, and launched the service.
The good news: it worked.
The bad news: it worked, but the only difference was that the library was built with debug symbols. The functionality was exactly the same.
Well, that was the only difference other than the symlink.
Fortunately, a "before" listing of the library files was captured before the debug version was installed, a standard practice by their site-reliability-engineers. They do this any time they try and debug in production, so that they can quickly revert to the previous state. And in this previous version, someone noticed that mylib.so was a symlink pointing to mylib.so.1.0.bkup_20190221.
Once again, creating a backup file is a standard practice for their SREs. Apparently, way back in 2019 someone was doing some debugging. They backed up the original library file, but never deleted the backup. And for some reason, ldconfig had been choosing the backup file when scanning for the "correct" version of libraries. Why?
Here, Jon does a lot of research for us. It turns out, if you start with the man pages, you don't get a answer- but you do get a warning:
ldconfig will look only at files that are named lib*.so* (for regular shared objects) or ld-.so (for the dynamic loader itself). Other files will be ignored. Also, ldconfig expects a certain pat‐
tern to how the symbolic links are set up, like this example, where the middle file (libfoo.so.1 here) is the SONAME for the library:libfoo.so -> libfoo.so.1 -> libfoo.so.1.12
Failure to follow this pattern may result in compatibility issues after an upgrade.
Well, they followed the pattern, and they found compatibility issues. But what exactly is going on here? Jon did the work of digging straight into the ldconfig source to find out the root cause.
The version detecting algorithm starts by looking directly at filenames. While the man page warns about a convention, ldconfig doesn't validate names against this convention (which is probably the correct decision). Insetad, to find which filename has the highest version number, it scans through two filenames until finds numeric values in both of them, then does some pretty manual numeric parsing:
int _dl_cache_libcmp(const char *p1, const char *p2) {
while (*p1 != '\0') {
if (*p1 >= '0' && *p1 <= '9') {
if (*p2 >= '0' && *p2 <= '9') {
/* Must compare this numerically. */
int val1;
int val2;
val1 = *p1++ - '0';
val2 = *p2++ - '0';
while (*p1 >= '0' && *p1 <= '9')
val1 = val1 * 10 + *p1++ - '0';
while (*p2 >= '0' && *p2 <= '9')
val2 = val2 * 10 + *p2++ - '0';
if (val1 != val2)
return val1 - val2;
} else
return 1;
} else if (*p2 >= '0' && *p2 <= '9')
return -1;
else if (*p1 != *p2)
return *p1 - *p2;
else {
++p1;
++p2;
}
}
return *p1 - *p2;
}
NB: this is the version of ldconfig at the time Jon submitted this, and the version that they're using. I haven't dug through to check if this is still true in the latest version. That's an exercise for the reader.
While we have not hit the end of the first string, check if the character in that string is numeric. If it is, check if the character in the second string is numeric. If it is, keep scanning through characters, and for as long as they're numeric, keep parsing them into numbers. If the numbers aren't the same, we return the difference between them.
If the first string contains numbers at this point, but the second string doesn't, return 1. If the second string contains numbers but not the first, return -1. Otherwise, increment our pointers and go to the next character. If we reach the end of the string without finding numeric characters, return the difference between these two characters.
Also, correct me if I'm wrong, but it seems like a malicious set of filenames could cause buffer overruns here.
Now, I'll be honest, I don't have the fortitude to suggest that ldconfig is TRWTF here. It's a venerable piece of software that's solving an extremely hard problem. But boy, DLL Hell is an unending struggle and this particular solution certainly isn't helping. I'm honestly not entirely certain I'd say that there was a true WTF here, just an unfortunate confluence of people doing their best and ending up laying landmines for others.
But here's the fun conclusion: the 2019 version of the library actually had been updated. They'd deployed several new versions between 2019 and 2024, when things finally blew up. The actual deployed software kept using the backup file from 2019, and while it may have caused hard-to-notice and harder-to-diagnose bugs, it didn't cause any crashes until 2024.
This post originally appeared on The Daily WTF.
