How Many Ways Can You Spell V1@gra?
Spam mutates, and the Internet community mounts an immune response
All the zany spellings discussed earlier are presumably part of the adversary's strategy for slipping through the anti-spam net. When a message arrives trying to sell you "VtAGGjRA," your spam filter probably doesn't have a score for that token. Some of the obfuscation techniques are even more insidious, in that they disrupt the low-level process of forming tokens. The spelling "V i a g r a" might well be taken as six separate tokens.
Some of the wiles of spammers may be intended not just to evade detection but also to sabotage the filter. The idea is called Bayesian poison. Many spams include a long addendum of random words or irrelevant text, such as paragraphs lifted from Dickens or Defoe, or recent news items. One aim of this practice is to dilute the spamminess of the message, so that the overall score might fall below the rejection threshold. But there could be a more nefarious effect as well. If the message does wind up in the spam bin, then all its "innocent" words will have their spam scores increased slightly. Eventually, legitimate messages that happen to mention those words may be falsely condemned as spam. Returning to the immunological metaphor, Bayesian poison induces something like an allergy or an autoimmune disease.
Opinions differ on whether Bayesian poisoning is a threat we need to worry about. An article by John Graham-Cumming, which reviews experiments by several groups, concludes that the attack is likely to succeed only if the spammer can get feedback on which messages succeed in fooling the filter. And frequent retraining of the filter is an effective defense.
Paul Graham argues that many attempts to disguise spam actually make it stand out more prominently. After all, few legitimate correspondents spell words with 1's and @'s in the middle, or quote long passages from Martin Chuzzlewit. These peculiarities become distinctive features that the filter can seize on. It remains to be seen whether the filters will cope as well with the latest spam fad, which puts the entire message in an image rather than text.
On the defensive side, the hottest fads of the moment are SVMs and HMMs. SVM stands for support vector machine; it is an algorithm for clustering or classifying data. In an e-mail corpus each message can be assigned to a point in a high-dimensional space, where each dimension represents one of the features that distinguish spams from hams. An SVM attempts to find the hyperplane in this space that best separates the two sets of points.
An HMM is a hidden Markov model, a device currently popular in bio-informatics and speech recognition. HMMs are useful for inferring the hidden rules that govern a sequence of signals or symbols, such as the letters in a text. One possible application of HMMs in spam filtering is undoing the deliberate misspellings of words.