How Many Ways Can You Spell V1@gra?
Spam mutates, and the Internet community mounts an immune response
When I first noticed spam with aberrant spellings, I assumed that someone out there in the murky world of spam service providers had written a program to generate random variants. It's not hard to do so. At the core of the program is a grammar that defines all possible ways of forming an acceptable spelling. The illustration on this page gives an example of such a grammar, one that allows substitutions and a limited form of insertions. The grammar could be improved with a few context-sensitive rules; for example, the numeral "6" might be chosen as a substitute for "G" only when it is surrounded by upper-case characters; in a lower-case context, "9" would be a better choice.
With a program generating random variants and inserting them into the text of a message, every recipient of a mailing could get a customized version. (What a thrill: your very own personalized spam!) Even if many mailings used the same algorithm, so that not all the spellings were unique, it's unlikely that any one recipient (or, more to the point, any one spam filter) would see the same spelling twice.
I still suspect that such random-spelling generators exist in the spam world, but the evidence of my own inbox suggests they are not widely used. The telltale mark of their use would be a peculiar abundance of hapax legomena—the lit-crit term for words that appear only once in a corpus. As a first step in looking for such throwaway spellings, I read through a sample of about 10,000 spam messages, extracting every variant of "Viagra" I could find. (I excluded cases where the term was obfuscated not by misspelling but by other techniques, such as embedding HTML tags between the letters.) At the end of this bleary-eyed task I had 113 spellings. I then ran a search program over all the spam I had received in the course of three years, counting the number of occurrences of each of the spellings.
There were indeed a few hapax legomena, and some of them exhibited a family resemblance, suggesting that they might all be output of the same algorithm. For example, VtAGGReA, VttAGGRA and VtAGGjRA each appeared exactly once in the corpus of 88,324 messages, all within a period of a few days in June 2006. Perhaps these strangely malformed words are the product of a Perl script running on some spammer's clandestine mail server. Or maybe they're just someone strumming on the keyboard. In either case, it appears the experiment was not a great success. Nothing similar has turned up since.
Certain other spellings also seem to come in families, such as VÌagra, VÏagra, V1agra and V1AGRA, but these examples are definitely not dynamically generated variants that are meant to be unique to each message. They arrived in my mailbox in big bunches, often several copies a day. The V1agra spelling appears more than 400 times in my three-year collection.
The spelling variations (as shown in the illustration ["Spelling 'Viagra'"]) might be likened to the spectrum of genetic mutations in a population of bacteria. The Viagra variants offer some tentative hints about the working habits of spammers. Often, once a spelling has been incorporated into a message, it is mailed out repeatedly over a period of a few months, then given a rest. The conspicuous vertical correlations within the diagram suggest that mailings with the various spellings are not independent. A plausible inference is that most of the pill-pushing spam comes from a few individuals.
Among all the spellings, the most popular by far were the plain, unadorned ones: Viagra, viagra and VIAGRA. There's a bit of a conundrum here. If mangled spellings are necessary to get a message through the spam filters, then what's the point of sending thousands of undisguised ones? And if the mangling isn't really needed, then why go to the bother of producing all those variants? For what it's worth, the spam filter running on my computer seems quite indifferent to all these shenanigans: With very few exceptions, all of the Viagra messages, no matter how they spell it, go directly to the junk folder.