Spam, Spam, Spam, Lovely Spam
In the early years of the Internet, dealing with miscreants was remarkably straightforward. Although Net lore and legend celebrate the lack of any central governing body, the network was actually run by a cohesive community with shared aims and values. Any serious breach of the rules could be punished by expulsion—by canceling the violator's account. The situation is different now. If Hotmail kicks you out, you just move over to AOL or MSN. Furthermore, Net pariahs with enough resources can set up their own Internet service provider.
Yet the power to isolate renegade sites has not been lost entirely. You cannot easily reach out and unplug an offending node of the network, but you can set up your own system to ignore any information coming from that node. In particular, a system administrator can configure an Internet router so that it refuses traffic from selected sites. Some years ago Paul Vixie, who conceived several early Internet protocols, began publishing a list of network nodes from which spam was emanating. This Realtime Blackhole List, or RBL, is now maintained by a nonprofit organization called MAPS. For subscribers to the RBL, the listed sites become black holes—no e-mail can get out, and in some cases traffic of all kinds is blocked. The weapon is quite blunt, in that it stops not only the spam but also other innocent communication. The rationale for this policy is that legitimate users of a blacklisted service will exert pressure to shut down the spammer so that they can again reach the outside world. It's rather like keeping the whole class after school until someone turns in the naughty child. Not everyone approves of this strategy, and there have been several lawsuits against Vixie and MAPS. Meanwhile, spammers cope by keeping on the move and by disguising their whereabouts.
Other approaches to filtering out spam try to block only the offending mail. The filter can be installed on the individual user's computer, on a mail server or farther upstream. Much ingenuity has been brought to bear on designing filters. Also on evading them.
The simplest kind of filtering uses static criteria to sort incoming mail into various folders or directories. For example, a filter might reject any mail that comes from "Bargain Blizzard" or that has "inkjet" in the subject line. But the spammer's response is all too easy: The sender becomes "Blizzard of Bargains" and the subject becomes "i-n-k-j-e-t" or one of a thousand other variations. There is also the problem that a friend writing for advice on an inkjet printer may have a hard time getting through.
Large-scale services such as Brightmail and Postini cannot hope to keep up with the evolving spam ecosystem by hand-crafting filter rules. The key to their methodology is to set up thousands of "honeypots"—e-mail accounts whose only purpose is to attract spam. Since these addresses should have no legitimate e-mail sent to them, messages collected there can serve as templates for filtering the stream of mail going to the service's subscribers. In this way one of the essential, defining characteristics of spam—the fact that it goes simultaneously to thousands of addresses—is turned into a weapon against it.
Another mechanism for building filters is based on the collaborative effort of thousands of people performing the routine daily chore of sorting their e-mail. If you are participating in such a cooperative network, then every time you mark a spam message for deletion, a copy of the e-mail is sent to a central repository; there many such reports are gathered and compiled into filter criteria. When the same message arrives again, addressed either to you or to another participant in the cooperative, the mail is automatically shunted to the spam bin. This idea originated with Vipul Ved Prakash, a San Francisco programmer, who created a public-domain program called Vipul's Razor. A commercial version called SpamNet has been in testing for the past year. The SpamNet cooperative has more than 300,000 members.
Schemes that filter out only identical copies of a known exemplar message have a serious weakness: The spammer can overcome them by making each copy of a mailing slightly different, perhaps by adding a few random characters to the text. (Presumably, this explains subject headers such as "Married, Lonely, and home alone ! 2563SEpT0-115eltW64-18.") Brightmail reports that 90 percent of all spam messages are now unique, and so more-elaborate algorithms are needed to establish a match between a template and a target. The SpamNet technique is vulnerable to the same countermeasure, and so again the filter cannot rely on a simple, exact match. Given many copies to examine, however, an algorithm can determine which regions of the message are constant and which are variable, and thereafter focus only on the stable, identifying features. But already a counter-countermeasure has appeared. It is "scramblespam," where random characters are not a minor addition but make up most of the message, typically with the actual content of the ad embedded in an image. Cloudmark, the company behind SpamNet, reports it has recently devised an algorithm for identifying scramblespam.
Yet another filtering strategy abandons the whole idea of matching messages to templates and simply looks at the statistical properties distinguishing desirable e-mails from spams. This idea began to gain momentum last summer when Paul Graham, a computer scientist best known for his books on the Lisp programming language, circulated an article titled "A Plan for Spam." It soon emerged that similar principles were already familiar and well-developed in other fields, such as automated text analysis and computational learning theory. By the time of a conference on spam held at MIT in January, several variations of the algorithm were being actively explored.
The statistical filtering process requires a reasonably large corpus of e-mail messages, already divided into spam and nonspam categories so that they can serve as a training set. The program breaks messages down into individual words and other "tokens," recording the number of appearances of each token in the two groups of messages. The resulting frequency tables give the probability that any spam or nonspam message contains a specified token. Furthermore, from the same information it is also possible to calculate the inverse probability: Given any token, the tables determine the probability that a message containing the token is either spam or nonspam. New messages are classified by finding the most "interesting" tokens—those whose probabilities are closest either to 0 or to 1—and then computing an overall composite probability that the e-mail is spam. In Graham's experiments, the probability distribution turned out to be strongly bimodal: Most messages were either close to 0 or close to 1, with few in the middle. His reported error rate is about five per 1,000 for false negatives (spams that squeak by as legitimate mail), with no false positives (legitimate mail misidentified as spam).
Graham argues that a filter based on the entire content of a message cannot be evaded without altering the content itself. "It would not be enough for spammers to make their emails unique or to stop using individual naughty words," he writes. "They'd have to make their mails indistinguishable from your ordinary mail.... Spam is mostly sales pitches, so unless your regular mail is all sales pitches, spams will inevitably have a different character." The argument seems compelling and the results so far are impressive, but the real test will come when such filters are widely deployed, putting pressure on spam authors to invent countermeasures.