BOOK REVIEW

# Digging Bit by Bit

*Principles of Data Mining*. David Hand, Heikki Mannila and Padhraic Smyth. xxxii + 546 pp. The MIT Press, 2001. $50.

Data mining?the science of extracting useful information from large data sets?brings together such fields as statistics, machine learning, database management, pattern recognition and artificial intelligence. So it's fitting that the authors of *Principles of Data Mining* represent different viewpoints: David Hand is a statistician, Heikki Mannila specializes in databases, and Padhraic Smyth is a computer scientist. Their goal here is to provide not only the technical details (mathematical models especially) of approaches to data mining but also their own perspectives on the field. They choose to focus on fundamental topics rather than attempting to be exhaustive.

The authors purport to offer a "foundational view of data mining" and to take an interdisciplinary approach. This is in contrast to the more limited perspectives adopted in such recent books as Jiawei Han and Micheline Kamber's *Data Mining: Concepts and Techniques,* which focuses on databases, and Ian H. Witten and Eibe Franke's *Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations,* which as its subtitle implies, takes an applications-oriented, machine-learning approach (both were published by Morgan Kaufmann in 2000). Hand and his coauthors pursue both statistical and computational issues in dealing with large data sets, stressing the mathematical models and computational algorithms used to characterize and mine digital information.

The first four chapters of *Principles of Data Mining* focus on fundamentals, offering a general introduction to data mining and some discussion of measurement, summarization and visualization of data, and uncertainty and inference. In the next four chapters, the authors look at the "building-blocks" used to create and analyze data-mining algorithms: model representations, score functions for fitting models to data, and optimization and search techniques. The remaining seven chapters are more task-specific, covering specific data-mining techniques (density estimation and clustering, classification, regression, pattern discovery and content-based retrieval) and algorithms for addressing them.

The book, which would be appropriate as a text for courses at the senior undergraduate or first-year graduate level, assumes familiarity with basic concepts in probability, calculus, linear algebra and optimization. Most third- or fourth-year undergraduate majors in computer science, mathematics or statistics would have sufficient training to understand it.

The authors, keenly aware that many computer science students will have shallow backgrounds in statistics, have provided as a review a short appendix on basic probability and common distributions. For readers with a good working knowledge of statistics, they have included many details on algorithmic design and computational complexity, which would not be commonplace in a traditional statistics or applied statistics textbook. The chapters that discuss databases, pattern matching and information retrieval should be informative to statisticians in particular.

The authors use several "real world" applications reflecting their own interests or backgrounds, but they consistently use small (or toy) data sets to illustrate the algorithm or pattern in question. I found this approach refreshing, because the size and scale of *actual* data sets would probably discourage a novice in the field from grasping the fundamental design of an algorithm or a procedure for detecting the structure of a data collection.

The book's coverage and presentation of topics is excellent, but I found it lacking as a textbook. Suggestions for further reading conclude each chapter, but there are no exercises or programming assignments, nor are there any URLs or FTP sites for acquiring software or data sets. Perhaps the authors will consider easing the burden of potential instructors by developing a Web site to disseminate such information.-*Michael W. Berry, Computer Science, University of Tennessee, Knoxville
*

**IN THIS SECTION**

Community Guidelines: Disqus Comments

**BROWSE BY**

- Nanoview

- Reviewer

- Topic

- Issue

# Connect With Us:

# Subscribe to Free eNewsletters!

*American Scientist Update: Pheromones, Nanotubes, Origins of Lying, and More Science Stories**Scientists' Nightstand: Holiday Special!*News of book reviews published in

An early peek at each new issue, with descriptions of feature articles, columns, and more. Every other issue contains links to everything in the latest issue's table of contents.

*American Scientist*and around the web, as well as other noteworthy happenings in the world of science books.

To sign up for automatic emails of the

*American Scientist Update*and

*Scientists' Nightstand*issues, create an online profile, then sign up in the My AmSci area.

# RSS Feed Subscription

Receive notification when new content is posted from the entire website, or choose from the customized feeds available.

# Read Past Issues on JSTOR

JSTOR, the online academic archive, contains complete back issues of *American Scientist* from 1913 (known then as the *Sigma Xi Quarterly*) through 2005.

The table of contents for each issue is freely available to all users; those with institutional access can read each complete issue.

View the full collection here.

EMAIL TO A FRIEND :

**Of Possible Interest**

**Book Review**: Don't Try This at Home

**Book Review**: When the World Went Digital

**Book Review**: A Wealth of Complexities