Logo IMG


Digging Bit by Bit

Michael Berry

Principles of Data Mining. David Hand, Heikki Mannila and Padhraic Smyth. xxxii + 546 pp. The MIT Press, 2001. $50.

Data mining?the science of extracting useful information from large data sets?brings together such fields as statistics, machine learning, database management, pattern recognition and artificial intelligence. So it's fitting that the authors of Principles of Data Mining represent different viewpoints: David Hand is a statistician, Heikki Mannila specializes in databases, and Padhraic Smyth is a computer scientist. Their goal here is to provide not only the technical details (mathematical models especially) of approaches to data mining but also their own perspectives on the field. They choose to focus on fundamental topics rather than attempting to be exhaustive.

The authors purport to offer a "foundational view of data mining" and to take an interdisciplinary approach. This is in contrast to the more limited perspectives adopted in such recent books as Jiawei Han and Micheline Kamber's Data Mining: Concepts and Techniques, which focuses on databases, and Ian H. Witten and Eibe Franke's Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, which as its subtitle implies, takes an applications-oriented, machine-learning approach (both were published by Morgan Kaufmann in 2000). Hand and his coauthors pursue both statistical and computational issues in dealing with large data sets, stressing the mathematical models and computational algorithms used to characterize and mine digital information.

The first four chapters of Principles of Data Mining focus on fundamentals, offering a general introduction to data mining and some discussion of measurement, summarization and visualization of data, and uncertainty and inference. In the next four chapters, the authors look at the "building-blocks" used to create and analyze data-mining algorithms: model representations, score functions for fitting models to data, and optimization and search techniques. The remaining seven chapters are more task-specific, covering specific data-mining techniques (density estimation and clustering, classification, regression, pattern discovery and content-based retrieval) and algorithms for addressing them.

The book, which would be appropriate as a text for courses at the senior undergraduate or first-year graduate level, assumes familiarity with basic concepts in probability, calculus, linear algebra and optimization. Most third- or fourth-year undergraduate majors in computer science, mathematics or statistics would have sufficient training to understand it.

The authors, keenly aware that many computer science students will have shallow backgrounds in statistics, have provided as a review a short appendix on basic probability and common distributions. For readers with a good working knowledge of statistics, they have included many details on algorithmic design and computational complexity, which would not be commonplace in a traditional statistics or applied statistics textbook. The chapters that discuss databases, pattern matching and information retrieval should be informative to statisticians in particular.

The authors use several "real world" applications reflecting their own interests or backgrounds, but they consistently use small (or toy) data sets to illustrate the algorithm or pattern in question. I found this approach refreshing, because the size and scale of actual data sets would probably discourage a novice in the field from grasping the fundamental design of an algorithm or a procedure for detecting the structure of a data collection.

The book's coverage and presentation of topics is excellent, but I found it lacking as a textbook. Suggestions for further reading conclude each chapter, but there are no exercises or programming assignments, nor are there any URLs or FTP sites for acquiring software or data sets. Perhaps the authors will consider easing the burden of potential instructors by developing a Web site to disseminate such information.-Michael W. Berry, Computer Science, University of Tennessee, Knoxville

comments powered by Disqus

Connect With Us:


Sigma Xi/Amazon Smile (SciNight)

Subscribe to Free eNewsletters!

RSS Feed Subscription

Receive notification when new content is posted from the entire website, or choose from the customized feeds available.

Read Past Issues on JSTOR

JSTOR, the online academic archive, contains complete back issues of American Scientist from 1913 (known then as the Sigma Xi Quarterly) through 2005.

The table of contents for each issue is freely available to all users; those with institutional access can read each complete issue.

View the full collection here.


Of Possible Interest

Book Review: Fearless Symmetry

Book Review: Don't Try This at Home

Book Review: When the World Went Digital

Subscribe to American Scientist