Chapter 1
Pattern Recognition: Evolution, Mining and Big Data
Amita Pal1 and Sankar K. Pal2
1 Interdisciplinary Statistical Research Unit Indian Statistical Institute, Kolkata, India
2Machine Intelligence Unit Indian Statistical Institute, Kolkata, India
This chapter traces the evolution of pattern recognition (PR) over the years, from its humble beginnings as an extension of statistical discriminant analysis, to the multidisciplinary approach that it has become now, on account of the continuous import of ideas from various scientific disciplines. It begins with an introduction to the discipline of PR, explaining the basic underlying concepts, different tasks involved, some conventional classification techniques and the subsequent development of various modern methodologies. The evolution has been nurtured and aided by the likes of statistical decision theory, the theory of formal languages (which led to the syntactic or structural approach), followed by the theories of fuzzy sets, artificial neural networks, genetic algorithms, rough sets, granular computing and support vector machines individually (leading to different modern approaches), and finally, their integration into the theory of soft computing. While tracing the journey of pattern recognition along this complex route, significant aspects are highlighted. The chapter also discusses the significance of data mining, which has drawn the attention of many PR researchers world-wide for the past couple of decades. Finally, the challenging issues of Big Data analysis are addressed along with the relevance of PR and machine learning.
1.1.Introduction
Pattern recognition is an activity that humans normally excel in. They do it almost all the time, and without conscious effort. Information received via the various sensory organs is processed almost instantaneously by the human brain so that it is possible to identify the source or content of the information, without any perceptible effort. What is even more impressive is the accuracy with which such recognition tasks can be performed even under non-ideal conditions, for instance, when the information that needs to be processed is vague, imprecise or even incomplete. In fact, most of our day-to-day activities are based on our success in performing various pattern recognition tasks. For example, when we read a book, we recognize the letters, words and, ultimately, concepts and notions, from the visual signals received by our brain, which processes them speedily and probably does a neurobiological implementation of template-matching!
The discipline of Pattern Recognition (PR) or Pattern Recognition by machine essentially deals with the problem of developing algorithms and methodologies/devices that can enable the computer-implementation of many of the recognition tasks that humans normally perform. The motivation is to perform these tasks more accurately, or faster, and perhaps, more economically than humans and, in many cases, to release them from drudgery resulting from performing routine recognition tasks repetitively and mechanically. The scope of PR also encompasses tasks humans are not good at, like reading bar codes. The goal of pattern recognition research is to devise ways and means of automating certain decision-making processes that lead to classification and recognition. PR has been a thriving field of research for the past few decades, as is amply borne out by the numerous books ([1]ā[17], for example) and journals devoted exclusively to it. In this regard, mention must be made of the seminal article by Kanal [18] which gives a comprehensive review of the advances made in the field till the early nineteen-seventies. A review article by Jain et al. [19] provides an engrossing survey of the advances made in statistical pattern recognition till the end of the twentieth century.
Though the subject has attained a very high level of maturity in the past five decades or so, it remains evergreen to the researchers due to continuous crossāfertilization of ideas from disciplines like computer science, physics, neurobiology, psychology, engineering, statistics, mathematics and cognitive science. Depending on the practical need and demand, various modern methodologies have come into being, which often supplement the classical techniques. The present article gives a birdās-eye view of the different methodologies that have evolved so far including the emergence of data mining. Since any discussion today on pattern recognition remains incomplete without the mention of Big Data, we have included a section describing the ABCs of Big data, challenging issues, and the relevance of pattern recognition, machine learning and data mining. Before we describe them, we explain briefly the basic concept of PR including supervised and unsupervised classification, and feature selection/extraction. Though these were mentioned in the leading chapter of the first edition of the volume [20], the authors repeat some of them here for the convenience of readers in following the remaining chapters.
1.2.The Pattern Recognition Problem
Let Ī© denote the universe of patterns that are of interest, and let X be a vector of p variables (called features) defined on objects in Ī©, which together provide some sort of numerical description for them. Let Ļ ā āp be the feature space, or the domain of variation of X corresponding to all patterns in Ī©, which contains K categories of objects, where K may or may not be known a priori. Let Ī©1, Ī©2, , . . . , Ī©K denote the corresponding categories or pattern classes. In this setup, the pattern recognition problem is to determine, for any pattern of unknown categorization (or label) from Ī© and having a corresponding feature vector x, which pattern class it belongs to. Essentially, the general approach to solving the problem is to find, in some way, a partition of Ļ into Ļ1, Ļ2, , . . . , ĻK so that if x ā Ļr we can infer that the unknown pattern comes from the pattern class Ī©r. Obviously, this is not possible unless some additional information is provided for each of the classes, say, in the form of a set of n patterns, called training samples.
1.2.1.Supervised vs. unsupervised classification
Human pattern recognition capability is mainly learnt from past experiences, though it is certainly not possible to describe the procedure by which the human brain accomplishes this. Thus learning is an indispensable component of pattern recognizers, both human and mechanical. The information contained in the training samples provides the basis for learning in pattern recognition systems. In some cases, learning is accomplished with the help of a teacher, that is, an external agency of some sort that provides the correct labels or classifications of the training samples for building the classifier. The training samples in such cases become representatives of the classes they belong to, and can be processed in a suitable manner so that the class-specific information they carry may be distilled from them. This is referred to as supervised pattern recognition. References [5], [7], [12]ā[17], [21] are a few of the many books in which detailed information on this is available.
On the other hand, if no teacher is available for a pattern classification task, that is, the training samples are not labeled, then we have a case of unsupervised pattern recognition. In such cases, learning essentially means discovery of the natural groupings inherent in the training set. The generic name for computational techniques applicable to unsupervised classification is clustering, for which there is no dearth of literature ([1], [2], [22]ā[25]).
1.2.2.Feature selection and extraction
Most of the approaches designed for solving PR problems presuppose the representation of patterns by a set of measurements, called features. A judicious selection of features for building classifiers is a very crucial aspect of classifier design, and deserves careful consideration. On one hand, there is certainly nothing to lose in using all available measurements in classifier design. On the other hand, too many features make the classifier increasingly complex (sometimes confusing too), in fact, unnecessarily so, in case some of the measurements are redundant. It is encouraging to see that this aspect of classifier design has indeed been given the importance it deserves, judging from the work reported. References may be found, for example, in [3, 8, 9, 26, 27]. Two broad approaches have been used traditionally. The first is called feature selection, and is essentially the selection of the subset of measurements that optimizes some criterion of separability of classes, since, intuitively, the best set of features should discriminate most efficiently among the classes, that is, enhance the separability ...