PART I
CHAPTER 1
WHAT IS KNOWLEDGE DISCOVERY?
Knowledge discovery is a semiautomated process of extracting useful information from collections of data that are too big to be investigated manually. By semiautomated we mean that we use computer-based tools for the discovery process but that guidance by an analyst is indispensable. The information retrieved by the discovery process usually takes on the form of actionable or explanatory patterns often referred to as models. There are many different types of models. For instance, we have models that are represented as ifâthenâelse rules as well as models that implement artificial neural networks. All models have the desirable property that they tend to ignore unnecessary detail and summarize the major trends in data. A model can represent or summarize terabytes of data and therefore provides access to information or knowledge hidden in large amounts of data. In this book we deal with one particular type of model called a support vector machine. Support vector machines represent a powerful new class of models invented by Vladimir Vapnik in the early 1990s. They have been shown to be competitive with artificial neural networks and outperform them in many cases.
A term that is often associated with knowledge discovery is data mining. Data mining can be considered a specific kind of knowledge discovery process that aims at extracting information from databases. Data mining is often referred to as knowledge discovery in databases (KDD).
Knowledge discovery is a highly interdisciplinary undertaking ranging from domain analysis, data cleansing, and visualization to model evaluation and deployment (see Figure 1.1). However, at the core of the knowledge discovery process is a discovery algorithm that performs some kind of pattern recognition and constructs models of the data encountered. The discovery algorithms we are concerned with in
this book are based on machine learning. Let us start by defining what we mean by machine learning.
1.1 MACHINE LEARNING
Phenomena whose behavior we can observe exist all around us. Consider, for example, the orbits of the planets around the sun or the timing of the tides. The central question in machine learning is: Can we use computers to discover and describe patterns based on these behaviors? The answer to this question is a resounding âyesâ and it is the topic of the remainder of the book.
Perhaps the easiest way to describe phenomena is through classification. Here, a particular object either belongs to a class of objects or it does not. When we see a cat, we easily recognize that it belongs to the class of mammals, and when we see a crow, we recognize that it belongs to the class of birds. Abstractly speaking, we can imagine that there exists some process in connection with some phenomenon that labels objects as true if they belong to the class in question or false if they do not belong to the class. In our case, we have mammal (cat) = true and mammal (crow) = false, as well as bird (cat) = false and bird (crow) = true. Here, mammal and bird are processes that provide the labels for any object according to the class of mammals and the class of birds, respectively. Typically, classifications are not as easy as mammals and birds, and in general we do not have access to the processes that label the objects. We can only observe the consequences of these processes: the observable labels for each object. The goal of machine learning then is to compute a suitable model for a labeling process that approximates the original process as closely as possible. The following definition states this more formally.
Definition 1.1 (Machine Learning)
Given:
- A data universe X
- A sample set S, where S X
- Some target function (labeling process) f : X â {true, false}
- A labeled training set D, where D = {(x, y) | x S and y = f (x)}
Compute a function :
X â {
true, false}
using D such that