New ways to be wrong

Originally published 2002 in Atomic: Maximum Power Computing

Last modified 03-Dec-2011.

A popular objection to the use of the Internet as a research tool is that the information you find there isn't reliable. Neither are books, of course, but the Web has fewer editors and librarians, so there's some validity to the complaint.

Unreliability of Internet information takes two forms.

First, there's plain old wrongness - but it's often in the middle of a bunch of correct stuff that's lulled you into a gullible state.

There's a term for this phenomenon. It's "database pollution".

Take, for instance, Penn & Teller's Swedish Lemon Angels. The Angels are un-makeable biscuits from P&T's excellent book How To Play With Your Food. The recipe includes both baking soda and lemon juice, and when you're instructed to "add the lemon juice all at once and blend into the mixture", said mixture will foam merrily out of the bowl, for elementary chemical reasons. Hilarity ensues.

The Swedish Lemon Angels recipe can be found in various on-line recipe books. I found the deadly Angels lurking on RecipeLand, RecipeSource and Chef2Chef when I first wrote this column for Atomic magazine. Fizzy, lemony database pollution, kids.

Now, there seems to be a "volcano" disclaimer on the end of all three of those Angels recipes, which I could have sworn wasn't there when I first looked. Hold that thought.

Database pollution can be a protest gesture. If you object to some company's marketing behaviour, filling their user database with 107 year old grandmothers from North Yemen who make more than a million US dollars a year and use the company's software 26 hours a day will probably reduce that database's value. It's getting to the point where automated tools to do this are turning up; consider the now-slightly-harder-to-use New York Times Random Login Generator, for instance.

The second kind of Internet info unreliability is actually good, in a way. It's mutability of information. A page that said one thing when you first looked at it may now have been corrected to say something else. Like the above Angels recipes. When this column showed up in print in Atomic, the recipe links above led to un-warninged versions of the Angels. Well, I think they did, anyway; now there are warnings on the end of all three of them, and there's no way to prove that they weren't always that way.

There's another fine example at - cover your ears, children - fuckmicrosoft.com.

That site contains some pretty good information about why you shouldn't like the Dark Lord Bill's empire.

But it also contains "Microsoft's Really Hidden Files", which has been linked from the site's front page for ages now.

This latter screed, written by a person glorying in the name "The Riddler", tells you about all sorts of apparently privacy-infringing secret data collection by Microsoft.

The first version of it was, to two significant digits, a bunch of misleading hooey.

It's not perfect now, but it's better.

But if you believed version 1.0, you're going to look like a doofus if you point to version 2.6b (November 3, 2001) as evidence for something that the page doesn't say any more. The good old Internet Archive Wayback Machine will provide you with an explanation for your mistake (though it only goes back to v2.0 of the page), but it won't give you an excuse.

By The Riddler's definition in all versions of the page so far, the computer I'm using at the moment has well over a gigabyte of stuff in "folders that Microsoft has tried hard to keep secret".

This is, in my view, a rather uncharitable way to describe temporary files, the swap file, the cookie file, the browser cache, URL auto-completion, and so on. There is quite a bit of Windows data that's not quite as deletable as you might think, and that may be a security risk for those of us with meth labs in our garage or inquisitive younger siblings. But for most people, it's more of a disk space wastage issue than a privacy one, and not much of a problem either way.

The Riddler still tells you that directories that have the System attribute must have it for nefarious reasons, rather than to strongly discourage the uninformed from blundering around in there "making space". And he still implies that the reason for Internet Explorer cache files being inside weird alphanumeric-named subfolders must be because Microsoft doesn't want you getting at them, rather than the fact that there was an exploit some years ago in which a l337 h4XX0r would take rapeyourpc.exe and rename it to bunny.jpg, then stick <IMG SRC="innocentfiles/bunny.jpg"> in a Web page. Try to load that page and the renamed program wouldn't display, but it would end up in your browser cache, from which it could be executed by other software. The random-named cache directories are a kludge to stop that sort of thing from happening.

The Riddler's also still unhappy about the fact that Outlook Express doesn't automatically compact mail folders after things are deleted, with the result that even after you delete e-mail from the Trash folder, the messages will still be there in the DBX file. OK, maybe Microsoft should have put in an auto-compress feature if someone's just deleted the entire contents of a folder, because compression should be quite fast, then, instead of the usual lengthy drive-flog. But then the data wouldn't be recoverable in the event that the deletion was an accident, of the sort suffered strangely often by Outlook Express users.

The Riddler also still tells you that Cookies Are Bad, m'kay. But, like a bunch of other cookie-phobes, he doesn't tell you why. Lots of people seem to be under the impression that cookies let Web sites find out things about you that you haven't already told them. That's not the problem; this page gives a less alarmist explanation of what the problem really is.

Personally, I rather like not having to log in to various low-security Web sites. When I used to use SpamCop a lot (I don't, any more), I would have gone bananas without cookies turned on.

But all of these errors may well be tidied up in the near future, if The Riddler writes v3.0. In that case, this page of mine won't make me look stupid (well, no stupider than I look all the time, anyway), because I clearly state that I'm talking about v2.6b of Really Hidden Files. But most people aren't in the habit of specifying that, and many Web pages don't even have a last-updated date.

Heck, I'm not generally in the habit of clearly timestamping my off-site references, and it doesn't necessarily help anyway. When I put this page up on the Web, I failed to notice that the above-linked Swedish Lemon Angels recipes now had a disclaimer on the end of them, and used them as straight examples of database pollution, thereby making myself look as if I hadn't noticed in the first place. Well, maybe I hadn't; there's no way for me to be sure that the recipes weren't this way all along.

Often, if you have a problem with a link to a Web page that's been around for a while, it's just a broken link; the page has moved, or vanished entirely. But sometimes it's still there, but different.

Usually, this is pretty obvious. For instance, Penn and Teller's site, which I link to above, used to be www.sincity.com; as I write this, that's still the number one Google hit when you search for "penn and teller". But that URL is now owned, as you might expect just by looking at it, by a porn site.

If a couple of magicians sell their domain to a smut site (I don't know if that's what they did, but I don't see anyone complaining about their domain being stolen...), or an accounting firm went broke and its domain was re-registered by a porn site (or, worse yet, the reverse happened) then you're unlikely to mistake the new site for the old one.

But if someone just rewrites their page, then your references to it can look as if you're deliberately misquoting them, or worse.

Startlingly, there's a point to this rant which doesn't involve grand sociological statements about two-edged swords and the slippery slope of revisionism.

New ways to be wrong

Other columns