It is very easy to collect huge volumes of data - social statistics,
bank records, biological data, and more - but very hard to pull useful
facts out of the heap. This book is about processing large volumes of
data in ways that let simple descriptions emerge.
This is an introductory level book, aimed at someone with
reasonably good programming skills. A little facility with statistics
might help, but certainly isn't necessary. The book starts gently, with
some very basic questions: what is data mining exactly, when there seem
to be so many definitions for the term? What is a data warehouse, and
how does it differ from a database? Next, the authors address the data
itself in terms of quality, usability, and organization for efficient
access. The central chapters, 4 thhrough 8, address various kinds of
query specification, kinds of relationships to extract, correlations,
clustering, and classification. None of the discussions is especially
deep. All, however, are presented in pseudocode or simple math that can
easily be translated into working code. The careful reader learns a few
basic principles that work well in many contexts: entropy maximization,
Bayesian analysis, and simple stats. It may be surprising to see how
little of normal statistical analysis is used. I suspect the authors
assume that stats-savvy readers will already know how to apply
significance testing, and that stats-naive readers don't need the
distraction. The last chapters discuss complex data, where the best
structure for the data and the questions to be asked of it are not at
all obvious, and tools and applications used in data mining.
The book is nicely laid out as a textbook, with an orderly summary,
problem set, and bibliography at the end of each chapter. The
bibliography is more than just a list of names and authors - it
actually helps the reader decide which references will give the best
description of each of the chapter's topics.
This is a clear, usable introduction to data mining: the data it
uses, the questions it answers, and the techniques for connecting them.
It gives codable detail for lots of techniques, and prepares the reader
for more advanced discussions. I recommend it very highly.