Knowledge Discovery and Data Mining
Computers taught to discern patterns, detect anomalies and apply decision algorithms can help secure computer systems and find volcanoes on Venus
One of the most important parts of a scientist's work is the discovery of patterns in data. Yet the databases of modern science are frequently so immense that they preclude direct human analysis. Inevitably, as their methods for gathering data have become automated, scientists have begun to search for ways to automate its analysis as well. Over the past five years, investigators in a new field called knowledge discovery and data mining have had notable successes in training computers to do what was once a unique activity of the human brain.
The study of climate change provides an excellent example of the difficulties of extracting useful information from modern databases. In recent years, many questions about the earth's climate have moved into the realm of public policy: How rapidly are we destroying tropical rain forests? Is the Sahara Desert growing? What might be the effects of global warming, and has it started already?
An important factor in any climate model is the distribution of different kinds of vegetative land cover. Plants play a huge and so far not completely understood role in moderating the earth's carbon cycle, including the levels of carbon dioxide in the atmosphere. Since the late 1970s, climatologists have been able to follow changes in global land cover by satellite, but the instruments they use were not really designed for that purpose. That will change next year, when NASA is scheduled to launch a new satellite network devoted exclusively to earth science, the Earth Observing System (EOS). But with greater opportunity comes greater challenge. EOS is expected to gather 46 megabytes of data per second; a single day's data will require the space it would take to store 1,500 copies of the 32-volume text of the Encyclopaedia Britannica. Such a volume of information threatens to overwhelm many of the tools available to analyze it.
Nor is sheer volume the only problem. To make sense of the satellite data, analysts will have to correlate it with existing global vegetation maps. These maps, produced by direct observation on the ground as well as by prediction of "potential vegetation" (the vegetation that is natural to a region given its climate and latitude), disagree with each other to an amazing extent. When they compared three reference maps, geographers Ruth DeFries and John Townshend of the University of Maryland found that they agreed on roughly 20 percent of the earth's land surface. One map's grasslands are another map's badlands.
A third problem is that of interpretation. If a particular pixel in the satellite data is "green" in January and again in June, does that mean the pixel represents a tropical rainforest? And just how "green" does it have to be? Some rules may be part of the existing base of scientific knowledge, but others may be hidden in the data and awaiting discovery. In applications of data mining to other sciences, even the relevant categories—the equivalent of "grassland" or "tropical rain forest"—may be unknown.
Because it offers a way to detect patterns in large and messy data sets, data mining has become fashionable in a very short time, especially in the business world. We shall present applications that range from finding volcanoes on Venus to making printing presses run better, and explain one of the more popular data-mining methods, called "decision trees." First we offer as a caveat one lesson we have learned along the way: However tempting it may seem to leave everything to the computer, knowledge discovery is still a collaborative process that involves human expertise in a fundamental way.