Discerning the type of data stored in a file is frequently a challenge. We've come up with all sorts of ways to do it- like including magic bytes at the start of a file, using file extensions, appending MIME type information where possible, and frequently just hoping for the best. Ivan was working on a Python system that needed to handle XML data. Someone wanted to make sure that the XML data was actually XML, and not some other file format.
Any string of text which starts with
< is clearly an XML file. This certainly won't give any false positives. If we assume that they at least
trimed whitespace off, I think we can be fairly safe that there won't be any false negatives at least. Though if there is some way to generate a valid XML document where the first non-whitespace character isn't a
<, I'd be curious to see it.
The real question is: what if this check is actually successful at filtering out a large amount of invalid files? If this check is basically useless, that's a WTF. If this check is actually valuable- that's a bigger WTF.
This post originally appeared on The Daily WTF.