Filenames.WTF

Publication date: 3 February 2012
Originally published, in a much shorter version, 2011 in Atomic: Maximum Power Computing
Last modified 03-Feb-2012.

File extensions are ridiculous.

Here we are, with gigabytes of RAM and CPUs that do tens of billions of operations per second, and we're still using three-character suffixes to tell different kinds of file apart.

"Suffix typing", identifying file type by letters on the end of the filename, was getting old when the concept of the directory was not often seen in home computing, and Bon Scott was still alive. But here suffix typing still is.

There's been some progress. We've got to the point where a.file.name.can-look-like-this.doc.exe.xxx.txt and still work (provided it really is a text file). And you can have suffixes with more than three characters.

But you usually don't, do you? My copy of Windows knows six hundred and something suffixes, and the great majority are three or fewer characters.

Why, you might fairly say, does my amazing megacomputer not just look at files to figure out what they are? It might not be able to determine every file type that way, but common things like MP3s and JPGs and Word DOCs should be easy, right?

Well, yes, that is right. It's what applications do when you tell them to open a file. Sometimes, if you've got a file of format A that the app understands, but it's suffixed as format B that the app also understands, the app will smile at you patronisingly and open the file without complaint. And pretty much every flavour of UNIX back to time immemorial (which, in this case, means the mid-Seventies) has a command inventively titled "file", whose purpose is to figure out what a file is, regardless of the file's name.

You can have "file" on a Windows system too, if you install a UNIX-alike shell like Cygwin. Mac OS X is UNIX-based and thus has "file" built in, along with numerous other standard UNIX commands.

And, if I may be allowed yet another digression into nostalgia for a long-vanished age, version 2 of AmigaDOS introduced "datatypes", which allowed it to identify any kind of file you had a datatype for, and which instantly made new kinds of file accessible to programs that dealt with that kind of data - image files, sound files, word-processor documents, et cetera.

Suppose you want to make your Amiga paint program understand a new file format, say PNG, which was created more than two years after Commodore released the last Amiga and then went bankrupt. You just find yourself a PNG datatype, and copy it into the datatypes folder. And now your paint program can load PNGs.

Stepping back into the twenty-first century, it's now normal for personal computer OSes to include some sort of background indexing service - Spotlight on the Mac, Windows Search in, uh, Windows - which tootles along through all of your data files and indexes their entire contents. Identifying the file type along with the contents wouldn't be a big deal. If the user opens a folder that has files in it that the OS hasn't yet scanned, it could just whip through them on demand, in the same way that it creates thumbnails for images and video files.

And yet, nobody does that. Even the Macintosh has moved away from "smart" file identification; suffixes are more important in Mac OS X than they ever were before.

(Before OS X, Macs didn't need filename suffixes at all, instead using a separate "resource fork" for every file. This generally worked pretty well, as long as the files were never touched by any heathen resource-fork-less foreign systems. OS X has resigned itself to having to cope with dumb, suffixed, forkless files.)

So why is all this so?

Basically, it's because smart file typing can be very slow.

There are an awful lot of file types out there. This alphabetical list of file formats has about 3500 entries, as I write this. This other Wikipedia list prunes out a ton of old and/or extremely obscure formats that most people don't care about, and still lists more than 1250 formats.

You might think that, by modern computer standards, telling four-digit numbers of file types apart doesn't sound too daunting. But this is not one of those situations where a large problem becomes tiny when you can throw billions of arithmetical operations per second at it.

The core of the problem is that files don't all conveniently identify themselves in the first kilobyte. Even if you ignore pathological examples like raw and uncompressed video from cheap webcam software and RAM dumps from program crashes, there are still plenty of file formats that can't be neatly identified by looking at the beginning, or the end, or any other particular offset within the file.

Modern "wrapper" or "container" formats like WAV or Ogg, and simple but wide-ranging file headers like Internet media types, are easy to identify. That's the whole idea of a wrapper; it communicates what format the file is, so even if you don't have any software that can make use of the data inside, you can at least tell what it is. That's basically how the Amiga datatypes work, too, and in the early days of the Amiga before datatypes had been invented, there was a vain hope that most data would come in an IFF wrapper. The only native Amiga files that had suffixes at all were the "dot info" files that contained icon data for a matching other file.

(Later, though, Amiga users really made things fun for anyone trying to use Amiga files on some other system, by inventing naming conventions that had a three-character filename prefix!)

Many file types don't have wrappers, or any "magic number" file signature conveniently provided at some consistent point in the data. Actually, it's worse than that; as that list of file signatures shows, there are often multiple file types that have the same "signature" and, furthermore, there's no guarantee that a file that has the right bytes at the right offset to be some particular file type isn't, actually, a different file type that by pure coincidence happens to have those bytes in that location!

(See also, trying to figure out what language some text is in. It's harder than you'd think. So is filename sorting.)

To positively identify a file type by "scanning" it, therefore, the only option is to do whatever quick signature scanning you can, make a tentative identification of the file type, and then try to load the file quietly in the background with some software that uses that file type, and see if the test software swallows the file happily or pukes it back up.

This is suddenly a monstrous undertaking. Either you're trying to robo-run whole commercial applications in the background just to index files and flogging your storage so hard that SSD owners start to suspect someone's swapped their old hard drive back in, or you have to reverse-engineer those applications into a lighter-weight multi-format file tester that won't get you sued by Adobe, Symantec, Intuit, Microsoft if you yourself are not Microsoft and possibly even if you are, et cetera.

An example of how badly this can suck, even when it seems simple at first: There are a zillion programs that store their configuration data in plain-text files using the suffix INI or CFG or something. Dealing with those seems easy. ID them loosely as looking not unlike plain text, or more carefully by making sure there's no high-ASCII weirdness in there that a text editor can be expected to barf on. Now send file to text editor; if your computer can't figure out which of the hundreds of different script and settings-file and data-storage formats that're all plain text it's looking at, you probably won't mind if it just shows it to you as plain text. Job done, right?

If there are weird non-text characters in there, though, you're probably looking at a config file from one of those annoying programs that stores its settings in some opaque binary format that humans are not expected to be able to read or edit. There is very probably no "right program" to view this file, anywhere.

But perhaps it's not binary at all; perhaps it's "plain text" that's encoded in a way you didn't think of.

Now enjoy explaining all this to customers who call the help line asking why they can open this.ini but not that.ini. There won't be that many of them, of course, because most users never open anything.ini, and a large subset of those who do know that sometimes such files are text and sometimes they aren't, and have no need for fancy file-ID magic at all.

So you, the programmer that implemented this cursed thing, might actually find yourself simultaneously annoyed by the complaints about un-openable .inis and .cfgs, and simultaneously annoyed at how few complaints there are.

I've got another one. You could use an auto-file-ID system as insurance against accidental idiocy or practical jokes.

Suppose you have a JPEG file that's accidentally, or as a joke, been named something.txt. You try to open it, the OS sends it to Notepad or whatever, and you get some mangled garbage in the window, and/or an error saying that the text editor can't open it because it's not a text file.

If the text editor can then say "misidentified file problem, help!" to the OS, the OS can run "file", or something similar, on the file and figure out what it really is, and then give you the option to try to open it with a program that can actually open JPEGs. If that works, the OS could then give you the option to automatically rename the file. This requires applications to be rewritten to actually make the request to the OS, but this would be an easy enough change to make to the operating system's own built-in viewers and players and editors for common file types.

Now, suppose that you have accidentally, or someone else has as a joke, set all .JPG files to be opened by Notepad. Again, the program will fail to open the file and call for help, the OS will scan the file, discover that it is what the suffix says it is but this program can't open it, and perhaps ask you if you'd like to return that file association to the default, or even roll it back to whatever it was before the most recent change.

Again, this system is not something that'd require a lot of OS-development work, provided you don't mind most third-party software never working with it.

But also again, neither of these problems are problems that most users ever have. And lack of third-party software support would seriously reduce its usefulness. And a significant subset of the people who do have either of these problems will be able to solve the problem themselves.

But wait!

You could avoid a lot of the agony by, when you do your (relatively) quick file-ID scan, also checking the filename suffix! A file that looks like a Rich Text Format document, and has the suffix RTF, is almost certainly an RTF document, after all. So there!

(RTF is a particularly well-defined format. There's a published spec for it, but experienced programmers know that they can save a lot of time by using the more accurate definition that RTF is "whatever you get when you tell Word to save in RTF format".)

If you're looking at the darn filename suffix anyway, though, you only gain two things from checking the file content too.

First, you can detect a file that has a particular suffix but which isn't actually the format indicated by that suffix, as in the accident-or-practical-joke bit, above. But, again except for accidental renamings, jokes, and malware files called nudecelebrity.jpg.exe, this is an uncommon problem.

Second, you can detect clashing suffixes, where more than one valid file format shares the same suffix. This is even less common, unless you're a digital archivist trying to tell which failed 1980s home computer a particular ".PIC" file came from.

Since these minor features are all you gain from flogging through files looking for possibly-nonexistent, possibly-misleading identity clues in addition to looking at the suffix, you might as well call the whole thing off, and only look at the suffixes in the first place.

I'm not exaggerating that "flogging" part, either. A file-identifying system may have to hunt through the entire file to find identifying "magic number", presuming there's even one there. And files, today, can be both very large and very numerous.

What all this adds up to is no way to truly auto-ID files, beyond simple suffix typing, without making compromises that'll annoy a user who just wants to open that file he just clicked (but can't, because the OS is busy hunting through some other file for ID data), or mystify a user who just paged through 20 windows worth of icon-view files and is now going to be looking at a window full of question marks for the next 30 seconds (because the OS has been avoiding the previous problem by only examining files when the user actually opens a window and looks at the icons, except that's exactly what the user just did).

This would create confusion and frustration among users, because they'd never know whether a folder's contents would ID quickly or not. Sometimes such an on-demand scan would be really fast, because the OS would, say, strike obvious JFIF (JPG) headers on every file in there and take essentially zero time on top of the standard Windows thumbnail-scan. But sometimes it would be really slow, because the files are all a format that can't be quickly identified, so the OS has to plough through all of every file hunting magic numbers, and then never find them because the format is, say, some manufacturer's brand new improvement on their previous RAW photo format, or it's some other manufacturer's slight misinterpretation of a format you'd otherwise be able to ID.

You could attack these problems by popping up notifications that say things like "gee, this file called 150_encrypted_pr0n_dvds seems to be taking a bit of a while to identify, do you want to wait?", but the user most likely to decide that long file-typing waits mean the computer is broken is also the one most likely to click "OK" without reading such a message.

Now that we're transitioning from mechanical to solid-state storage, though, the disk thrashing that makes it difficult to identify large numbers of files in a timely fashion is going away. It may never be practicable to do the final see-if-a-program-can-actually-load-it test for everything, but scanning for magic-number IDs, at least in the background when nothing important is happening, is a low-impact feature for a SSD system with tons of transfer bandwidth and near-zero latency.

I still don't know whether anybody's going to bother, given the circle of suffixes being good enough, so that's the only typing OSes use, so only crazy people make files without a suffix, so suffixes continue to be good enough. But when a mainstream desktop OS has the "file" command just sitting there doing nothing, it wouldn't be a large project to scan files with it automatically.

We've already got fast enough processors and large enough physical RAM to make complex comparisons against large file-ID databases practicable, when the computer's idle. All we need is for SSDs to replace spinning disks and we'll be all set, at least for local fixed drives.

We may even get this done before we use up all 46,656 alphanumeric three-character suffixes.