FEATURE ARTICLE
Knowledge Discovery and Data Mining
Computers taught to discern patterns, detect anomalies and apply decision algorithms can help secure computer systems and find volcanoes on Venus
Carla Brodley, Terran Lane, Timothy Stough
Computer Security
In June of 1994, Vladimir Leonidovich Levin, a computer expert in St. Petersburg, Russia, penetrated the CitiBank electronic funds-transfer network. Over the next five months, he funneled $10 million into accounts in California, Israel, Finland, Germany, the Netherlands and Switzerland. He was eventually apprehended, and most of the money was recovered, but the incident revealed the vulnerability of our modern information infrastructure.
As U.S. and international businesses depend more on computers and networks, so grows the threat of computer crime. Twenty years ago, when computer systems were relatively unavailable and fairly well insulated from one another, the major threats to system security were corporate insiders and the more or less innocuous "hackers," out for the thrill of illicit exploration. Not so today. The taxonomy of computer intruders has grown to include experts in industrial espionage, members of organized crime syndicates, information thieves, "cyberstalkers," and even hostile foreign nationals engaged in "cyberterrorism." Furthermore, the spread of "hacker tools"—programs designed to automate the process of computer intrusion—has allowed relatively inexperienced computer users to present a real threat.
The job of detecting and preventing such intrusions continues to grow more difficult. System administrators and security officers must monitor large networks, often comprised of thousands of computers and terabytes of storage space, in which a single security violation on one workstation might be a multimillion-dollar incident. According to the Computer Emergency Response Team, an organization of computer security professionals, only 5 percent of victim sites are even aware that they have been infiltrated. Although the raw information necessary to detect an intrusion is often available in the audit data recorded by each computer, there is far too much audit information each day (most of which records perfectly mundane and innocuous activities) for a person to inspect.
By detecting anomalous patterns in logs of computer usage, a data-mining system can flag suspicious events for system administrators, thus greatly reducing the burden on them. Such a system monitors a single user's computer or account and develops a profile of that user's typical behavior. It can detect anomalies (potentially harmful or abusive actions) as deviations from the known and expected patterns. At the same time, it must be flexible enough to accept differences resulting from "normal" changes—for example, alterations in behavior as a user learns to use a new program or undertakes new tasks.
In an anomaly-detection system that we have developed at Purdue, we represent user behaviors as sequences of commands and their arguments. The sequences are of fixed length (typically 10 elements or "words") and are intended to represent short groups of actions that the user performs frequently. For example, the system may notice that, when writing an article, a user often executes the following sequence of UNIX commands:
cd work/
ls -laF
cd article/
vi myfile.tex
latex myfile.tex
(meaning: move to the work directory; list all files with their sizes, owners, file permissions and last modifications; move to the article directory; edit "myfile.tex" with the editor vi; process "myfile.tex" as a LaTeX file). This sequence would then be recorded and stored in the user's profile. Once the profile is created, the anomaly detector continuously compares the current input command sequences to the known profile to obtain a "similarity" score. If the similarity falls below a certain threshold, the behavior is flagged as anomalous.

Just as the heart of a decision-tree algorithm is the evaluation of splitting rules, the heart of this anomaly-detection algorithm is the function used to compare current inputs to the profile. Our approach scores the similarity of two sequences both by the number of identical "words" and by the proximity of those matches. Intuitively, people tend to work at a task contiguously, rather than in fragments. Therefore, we expect that matches occurring in unbroken runs are more likely to represent similar behaviors, whereas matches separated by mismatches are more likely to have happened by chance. As shown in Figure 5b, the first element in an unbroken run of matches adds one point to the similarity score; the next adds two points, and so on. This biases the score toward adjacent matches, because it attributes more importance to each match in a run as the run grows longer. When a run is interrupted, the weight of an individual match is reset to one.
As rudimentary as this method is, it can differentiate users with surprising accuracy. In Figure 5c, the upper curve represents normal computer usage by the profiled user, whereas the lower curve is generated by a different user. Both users were graduate students in Purdue's Machine Learning research laboratory, and their behavioral data were collected as they performed their normal classwork and research tasks. The profile was generated from approximately a month and a half's data, whereas the tested data (displayed in the plot) was drawn from a later period of approximately three months. We have found that the methods discussed here can identify the profiled user with as high as 99 percent accuracy and differentiate the profiled user from an another user almost 94 percent of the time.
Because usage patterns change over time, we included one more stage in our anomaly-detection system. After the system identifies a user as valid, that user's activity is fed back into his or her profile, thereby enabling the system to learn new behavior. Although this improves the system's ability to adapt over time, it does leave the system open to a particularly insidious form of attack known as hostile training. In this attack, an intruder masquerades as a valid user, performing innocuous actions for a time, but gradually engaging in more dangerous activities in an effort to train the defense system to believe the hostile actions are normal. To date, this attack appears to be purely hypothetical, because real adaptive security systems are still rare. But hostile training is a clear danger to such systems, and it is better to design defenses now than wait for the problem to surface when it really matters.
By the way, we were not able to find out exactly how CitiBank caught Levin; they are understandably wary of publishing all the details. A little more is known about another data-mining system, IBM's Fraud and Abuse Management System (FAMS), which has been used to detect millions of dollars of health-care fraud. This system separates routine billing patterns from unusual ones by profiling providers against one another and checking for unusual patterns that have pointed to fraud in the past. For example, Empire Blue Cross and Blue Shield, a health-insurance company in New York, caught a doctor who billed them more than $1.4 million for bronchoscopies that were never performed. The program noticed that the doctor claimed to perform one operation per patient per week, and flagged it as unusual because bronchoscopy is a procedure that is ordinarily done no more than once or twice in a patient's lifetime.
» Post Comment