Data Analysis
eBook - ePub

Data Analysis

What Can Be Learned From the Past 50 Years

Peter J. Huber

Compartir libro
  1. English
  2. ePUB (apto para móviles)
  3. Disponible en iOS y Android
eBook - ePub

Data Analysis

What Can Be Learned From the Past 50 Years

Peter J. Huber

Detalles del libro
Vista previa del libro
Índice
Citas

Información del libro

This book explores the many provocative questions concerning the fundamentals of data analysis. It is based on the time-tested experience of one of the gurus of the subject matter. Why should one study data analysis? How should it be taught? What techniques work best, and for whom? How valid are the results? How much data should be tested? Which machine languages should be used, if used at all? Emphasis on apprenticeship (through hands-on case studies) and anecdotes (through real-life applications) are the tools that Peter J. Huber uses in this volume. Concern with specific statistical techniques is not of immediate value; rather, questions of strategy – when to use which technique – are employed. Central to the discussion is an understanding of the significance of massive (or robust) data sets, the implementation of languages, and the use of models. Each is sprinkled with an ample number of examples and case studies. Personal practices, various pitfalls, and existing controversies are presented when applicable. The book serves as an excellent philosophical and historical companion to any present-day text in data analysis, robust statistics, data mining, statistical learning, or computational statistics.

Preguntas frecuentes

¿Cómo cancelo mi suscripción?
Simplemente, dirígete a la sección ajustes de la cuenta y haz clic en «Cancelar suscripción». Así de sencillo. Después de cancelar tu suscripción, esta permanecerá activa el tiempo restante que hayas pagado. Obtén más información aquí.
¿Cómo descargo los libros?
Por el momento, todos nuestros libros ePub adaptables a dispositivos móviles se pueden descargar a través de la aplicación. La mayor parte de nuestros PDF también se puede descargar y ya estamos trabajando para que el resto también sea descargable. Obtén más información aquí.
¿En qué se diferencian los planes de precios?
Ambos planes te permiten acceder por completo a la biblioteca y a todas las funciones de Perlego. Las únicas diferencias son el precio y el período de suscripción: con el plan anual ahorrarás en torno a un 30 % en comparación con 12 meses de un plan mensual.
¿Qué es Perlego?
Somos un servicio de suscripción de libros de texto en línea que te permite acceder a toda una biblioteca en línea por menos de lo que cuesta un libro al mes. Con más de un millón de libros sobre más de 1000 categorías, ¡tenemos todo lo que necesitas! Obtén más información aquí.
¿Perlego ofrece la función de texto a voz?
Busca el símbolo de lectura en voz alta en tu próximo libro para ver si puedes escucharlo. La herramienta de lectura en voz alta lee el texto en voz alta por ti, resaltando el texto a medida que se lee. Puedes pausarla, acelerarla y ralentizarla. Obtén más información aquí.
¿Es Data Analysis un PDF/ePUB en línea?
Sí, puedes acceder a Data Analysis de Peter J. Huber en formato PDF o ePUB, así como a otros libros populares de Matemáticas y Probabilidad y estadística. Tenemos más de un millón de libros disponibles en nuestro catálogo para que explores.

Información

Editorial
Wiley
Año
2012
ISBN
9781118018262
Edición
1
Categoría
Matemáticas
CHAPTER 1
WHAT IS DATA ANALYSIS?
Data analysis is concerned with the analysis of data – of any kind, and by any means. If statistics is the art of collecting and interpreting data, as some have claimed, ranging from planning the collection to presenting the conclusions, then it covers all of data analysis (and some more). On the other hand, while much of data analysis is not statistical in the traditional sense of the word, it sooner or later will put to good use every conceivable statistical method, so the two terms are practically coextensive. But with Tukey (1962) I am generally preferring “data analysis” over “statistics” because the latter term is used by many in an overly narrow sense, covering only those aspects of the field that can be captured through mathematics and probability.
I had been fortunate to get an early headstart. My involvement with data analysis (together with that of my wife who then was working on her thesis in X-ray crystallography), goes back to the late 1950s, a couple of years before I thought of switching from topology into mathematical statistics. At that time we both began to program computers to assist us in the analysis of data – I got involved through my curiosity in the novel tool. In 1970, we were fortunate to participate in the arrival of non-trivial 3-d computer graphics, and at that time we even programmed a fledgling expert system for molecular model fitting. From the late 1970s onward, we got involved in the development and use of immediate languages for the purposes of data analysis.
Clearly, my thinking has been influenced by the philosophical sections of Tukey’s paper on “The Future of Data Analysis” (1962). While I should emphasize that in my view data analysis is concerned with data sets of any size, I shall pay particular attention to the requirements posed by large sets – data sets large enough to require computer assistance, and possibly massive enough to create problems through sheer size – and concentrate on ideas that have the potential to extend beyond small sets. For this reason there will be little overlap with books such as Tukey’s Exploratory Data Analysis (EDA) (1977), which was geared toward the analysis of small sets by pencil-and-paper methods.
Data analysis is rife with unreconciled contradictions, and it suffices to mention a few. Most of its tools are statistical in nature. But then, why is most data analysis done by non-statisticians? And why are most statisticians data shy and reluctant even to touch large data bases? Major data analyses must be planned carefully and well in advance. Yet, data analysis is full of surprises, and the best plans will constantly be thrown off track. Any consultant concerned with more than one application is aware that there is a common unity of data analysis, hidden behind a diversity of language, and stretching across most diverse fields of application. Yet, it does not seem to be feasible to learn it and its general principles that span across applications in the abstract, from a textbook: you learn it on the job, by apprenticeship, and by trial and error. And if you try to teach it through examples, using actual data, you have to walk a narrow line between getting bogged down in details of the domain-specific background, or presenting unrealistic, sanitized versions of the data and of the associated problems.
Very recently, the challenge posed by these contradictions has been addressed in a stimulating workshop discussion by Sedransk et al. (2010, p. 49), as Challenge #5 – To use available data to advance education in statistics. The discussants point out that a certain geological data base “has created an unforeseen enthusiasm among geology students for data analysis with the relatively simple internal statistical methodology.” That is, to use my terminology, the appetite of those geology students for exploring their data had been whetted by a simple-minded decision support system, see Section 2.5.9. The discussants wonder whether “this taste of statistics has also created a hunger for [&] more advanced statistical methods.” They hope that “utilizing these large scientific databases in statistics classes allows primary investigation of interdisciplinary questions and application of exploratory, high-dimensional and/or other advanced statistical methods by going beyond textbook data sets.” I agree in principle, but my own expectations are much less sanguine. No doubt, the appetite grows with the eating (cf. Section 2.5.9), but you can spoil it by offering too much sophisticated and exotic food. It is important to leave some residual hunger! Instead of fostering creativity, you may stifle ingenuity and free improvisation by overwhelming the user with advanced methods. The geologists were in a privileged position: the geophysicists have a long-standing, ongoing strong relation with probability and statistics – just think of Sir Harold Jeffreys! – and the students were motivated by the data. The (unsurmountable?) problem with statistics courses is that it is difficult to motivate statistics students to immerse themselves into the subject matter underlying those large scientific data bases.
But still, the best way to convey the principles, rather than the mere techniques of data analysis, and to prepare the general mental framework, appears to be through anecdotes and case studies, and I shall try to walk this way. There are more than enough textbooks and articles explaining specific statistical techniques. There are not enough texts concerned with issues of overall strategy and tactics, with pitfalls, and with statistical methods (mostly graphical) geared toward providing insight rather than quantifiable results. So I shall concentrate on those, to the detriment of the coverage of specific techniques. My principal aim is to distill the most important lessons I have learned from half a century of involvement with data analysis, in the hope to lay the groundwork for a future theory. Indeed, originally I had been tempted to give this book the ambitious programmatic title: Prolegomena to the Theory and Practice of Data Analysis.
Some comments on Tukey’s paper and some speculations on the path of statistics may be appropriate.
1.1 TUKEY’S 1962 PAPER
Half a century ago, Tukey in an ultimately enormously influential paper (Tukey 1962) redefined our subject, see Mallows (2006) for a retrospective review. It introduced the term “data analysis” as a name for what applied statisticians do, differentiating this from formal statistical inference. But actually, as Tukey admitted, he “stretched the term beyond its philology” to such an extent that it comprised all of statistics. The influence of Tukey’s paper was not immediately recognized. Even for me, who had been exposed to data analysis early on, it took several years until I assimilated its import and recognized that a separation of “statistics” and “data analysis” was harmful to both.
Tukey opened his paper with the words:
For a long time I have thought that I was a statistician, interested in inferences from the particular to the general. But as I have watched mathematical statistics evolve, I have had cause to wonder and to doubt. And when I have pondered about why such techniques as the spectrum analysis of time series have proved so useful, it has become clear that their “dealing with fluctuations” aspects are, in many circumstances, of lesser importance than the aspects that would already have been required to deal effectively with the simpler case of very extensive data, where fluctuations would no longer be a problem. All in all, I have come to feel that my central interest is in data analysis, which I take to include, among other things: procedures for analyzing data, techniques for interpreting the results of such procedures, ways of planning the gathering of data to make analysis easier, more precise or more accurate, and all the machinery and results of (mathematical) statistics which apply to analyzing data.
Large parts of data analysis are inferential in the sample-to-population sense, but these are only parts, not the whole. Large parts of data analysis are incisive, laying bare indications which we could not perceive by simple and direct examination of the raw data, but these too are parts, not the whole. Some parts of data analysis [&] are allocation, in the sense that they guide us in the distribution of effort [&]. Data analysis is a larger and more varied field than inference, or incisive procedures, or allocation.
A little later, Tukey emphasized:
Data analysis, and the parts of statistics which adhere to it, must then take on the characteristics of a science rather than those of mathematics, specifically:
(1) Data analysis must seek for scope and usefulness rather than security.
(2) Data analysis must be willing to err moderately often in order that inadequate evidence shall more often suggest the right answer.
(3) Data analysis must use mathematical argument and mathematical results as bases for judgment rather than as bases for proofs or stamps of validity.
A few pages later he is even more explicit: “In data analysis we must look to a very heavy emphasis on judgment.” He elaborates that at least three different sorts or sources of judgment are likely to be involved in almost every instance: judgment based on subject matter experience, on a broad experience how particular techniques have worked out in a variety of fields of application, and judgment based on abstract results, whether obtained by mathematical proofs or empirical sampling.
In my opinion the main, revolutionary influence of Tukey’s paper indeed was that he shifted the primacy of statistical thought from mathematical rigor and optimality proofs to judgment. This was an astounding shift of emphasis, not only for the time (the early 1960s), but also for the journal in which his paper was published, and last but not least, in regard of Tukey’s background – he had written a Ph.D. thesis in pure mathematics, and one variant of the axiom of choice had been named after him.
Another remark of Tukey also deserves to be emphasized: “Large parts of data analysis are inferential in the sample-to-population sense, but these are only parts, not the whole.” As of today, too many statisticians still seem to cling to the traditional view that statistics is inference from samples to populations (or: virtual populations). Such a view may serve to separate mathematical statistics from probability theory, but is much too exclusive otherwise.
1.2 THE PATH OF STATISTICS
This section describes my impression of how the state of our subject has developed in the five decades since Tukey’s paper. I begin with quotes lifted from conferences on future directions for statistics. As a rule, the speakers expressed concern about the sterility of academic statistics and recommended to get renewed input from applications. I am quoting two of the more colorful contributions. G. A. Barnard said at the Madison conference on the Future of Statistics (Watts, ed. (1968)):
Most theories of inference tend to stifle thinking about ingenuity and may indeed tend to stifle ingenuity itself. Recognition of this is one expression of the attitude conveyed by some of our brethren who are more empirical than thou and are always saying, ‘Look at the data.’ That is, their message seems to be, in part, ‘Break away from stereotyped theories that tend to stifle ingenious insights and do something else.’
And H. Robbins said at the Edmonton conference on Directions for Mathematical Statistics (Ghurye, ed. (1975)):
An intense preoccupation with the latest technical minutiae, and indifference to the social and intellectual forces of tradition and revolutionary change, combine to produce the Mandarinism that some would now say already characterizes academic statistical theory and is most likely to describe its immediate future. [& T]he statisticians of the past came into the subject from other fields – astronomy, pure mathematics, genetics, agronomy, economics etc. – and created their statistical methodology with a background of training in a specific scientific discipline and a feeling for its current needs. [&]
So for the future I recommend that we work on interesting problems [and] avoid dogmatism.
At the Edmonton conference, my own diagnosis of the situation had been that too many of the activities in mathematical statistics belonged to the later stages of what I called ‘Phase Three’:
In statistics as well as in any other field of applied mathematics (taken in the wide sense), one can usually distinguish (at least) three phases in the development of a problem. In Phase One, there is a vague awareness of an area of open problems, one develops ad hoc solutions to poorly posed questions, and one gropes for the proper concepts. In Phase Two, the ‘right’ concepts are found, and a viable and convincing theoretical (and therefore mathematical) treatment is put together.
In Phase Three, the theory begins to have a life of its own, its consequences are developed further and further, and its boundaries of validity are explored by leading it ad absurdum; in short, it is squeezed dry.
A few years later, in a paper entitled “Data Analysis: in Search of an Identity” (Huber 1985a), I tried to identify the then current state of our subject. I speculated that statistics is evolving, in the literal sense of that word, along a widening spiral. After a while the focus of concern returns, although in a different track, to an earlier stage of the development and takes a fresh look at business left unfinished during the last turn (see Exhibit 1.1).
During much of the 19th century, from Playfair to Karl Pearson, descriptive statistics, statistical graphics and population statistics had flourished. The Student-Fisher-Neyman-Egon Pearson-Wald phase of statistics (roughly 1920–1960) can be considered a reaction to that period. It stressed those features in which its predecessor had been deficient and paid special attention to small sample statistics, to mathematical rigor, to efficiency and other optimality properties, and coincidentally, to asymptotics (because few finite sample problems allow closed form solutions).
I expressed the view that we had entered a new developmental phase. People would usually associate this phase with the computer, which without doubt was an important driving force, but there was more to it, namely another go-around at the features that had been in fashion a century earlier but had been neglected by 20th century mathematical statistics, this time under the banner of data analysis. Quite naturally, because this was a strong reaction to a great period, one sometimes would go overboard and evoke the false impression that probability models were banned from exploratory data analysis.
There were two hostile camps, the mathematical statisticians and the exploratory data analysts (I felt at home in both). I still remember an occasion in the late 1980s, when I lectured on high-interaction graphics and exploratory data analysis and a prominent person rose and commented in shocked tones whether I was aware that what I was doing amounted to descriptive statistics, and whether I really meant it!
Still another few years later I elaborated my speculations on the path of statistics (Huber 1997a). In the meantime I had come to the conclusion that my analysis would have to be amended in two respects: first, the focus of attention only in part moves along the widening spiral. A large part of the population of statisticians remains caught in holding patterns corresponding to an earlier paradigm, presumably one imprinted on their minds at a time when they had been doing their thesis work, and too many members of the respective groups are unable to see beyond the edge of the eddy they are caught in, thereby losing sight of the whole, of a dynamically evolving discipline. Unwittingly, a symptomatic example had been furnished in 1988 by the editors of Statistical Science. Then they re-published Harold Hotelling’s 1940 Presidential Address on “The Teaching of Statistics”, but forgot all about Deming’s brief (one and a half pages), polite but poignant discussion. This discussion is astonishing. It puts the finger on deficiencies of Hotelling’s otherwise excellent and balanced outline, it presages Deming’s future role in quality control, and it also anticipates several of the sentiments voiced by Tukey more than twenty years later. Deming endorses Hotelling’s recommendations but says that he takes it “that they are not supposed to embody all that there is in the teaching of statistics, because there are many other neglected phases that ought to be stressed....

Índice