eBook - ePub

Elements of Statistical Computing

Name: Elements of Statistical Computing
Author: R.A. Thisted

NUMERICAL COMPUTATION

R.A. Thisted,

448 pages
English
ePUB (mobile friendly)
Available on iOS & Android

eBook - ePub

Elements of Statistical Computing

NUMERICAL COMPUTATION

R.A. Thisted,

Book details

Book preview

Table of contents

Citations

About This Book

Statistics and computing share many close relationships. Computing now permeates every aspect of statistics, from pure description to the development of statistical theory. At the same time, the computational methods used in statistical work span much of computer science. Elements of Statistical Computing covers the broad usage of computing in statistics. It provides a comprehensive account of the most important computational statistics. Included are discussions of numerical analysis, numerical integration, and smoothing.The author give special attention to floating point standards and numerical analysis; iterative methods for both linear and nonlinear equation, such as Gauss-Seidel method and successive over-relaxation; and computational methods for missing data, such as the EM algorithm. Also covered are new areas of interest, such as the Kalman filter, projection-pursuit methods, density estimation, and other computer-intensive techniques.

Frequently asked questions

Simply head over to the account section in settings and click on “Cancel Subscription” - it’s as simple as that. After you cancel, your membership will stay active for the remainder of the time you’ve paid for. Learn more here.

At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.

Both plans give you full access to the library and all of Perlego’s features. The only differences are the price and subscription period: With the annual plan you’ll save around 30% compared to 12 months on the monthly plan.

We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.

Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.

Yes, you can access Elements of Statistical Computing by R.A. Thisted in PDF and/or ePUB format, as well as other popular books in Mathematics & Probability & Statistics. We have over one million books available in our catalogue for you to explore.

Information

Publisher

Routledge

Year

2017

ISBN

9781351452748

Edition

Topic

Mathematics

Subtopic

Probability & Statistics

Index

Mathematics

1	INTRODUCTION TO STATISTICAL COMPUTING

It is common today for statistical computing to be considered as a special subdiscipline of statistics. However such a view is far too narrow to capture the range of ideas and methods being developed, and the range of problems awaiting solution. Statistical computing touches on almost every aspect of statistical theory and practice, and at the same time nearly every aspect of computer science comes into play. The purpose of this book is to describe some of the more interesting and promising areas of statistical computation, and to illustrate the breadth that is possible in the area. Statistical computing is truly an area which is on the boundary between disciplines, and the two disciplines themselves are increasingly finding themselves in demand by other areas of science. This fact is really unremarkable, as statistics and computer science provide complementary tools for those exploring other areas of science. What is remarkable, and perhaps not obvious at first sight, is the universality of those tools. Statistics deals with how information accumulates, how information is optimally extracted from data, how data can be collected to maximize information content, and how inferences can be made from data to extend knowledge. Much knowledge involves processing or combining data in various ways, both numerically and symbolically, and computer science deals with how these computations (or manipulations) can optimally be done, measuring the inherent cost of processing information, studying how information or knowledge can usefully be represented, and understanding the limits of what can be computed. Both of these disciplines raise fundamental philosophical issues, which we shall sometimes have occasion to discuss in this book.

These are exciting aspects of both statistics and computer science, not often recognized by the lay public, or even by other scientists. This is partly because statistics and computer science — at least those portions which will be of interest to us — are not so much scientific as they are fundamental to all scientific enterprise. It is perhaps unfortunate that little of this exciting flavor pervades the first course in statistical methods, or the first course in structured programming. The techniques and approaches taught in these courses are fundamental, but there is typically such a volume of material to cover, and the basic ideas are so new to the students, that it is difficult to show the exciting ideas on which these methods are based.

1.1 Early, classical, and modern concerns

I asserted above that statistical computing spans all of statistics and much of computer science. Most people — even knowledgeable statisticians and computer scientists — might well disagree. One aim of this book is to demonstrate that successful work in statistical computation broadly understood requires a broad background both in classical and modern statistics and in classical and modern computer science.

We are using the terms “classical” and “modern” here with tongue slightly in cheek. The discipline of statistics, by any sensible reckoning, is less than a century old, and computer science less than half that. Still, there are trends and approaches in each which have characterized early development, and there are much more recent developments more or less orthogonal to the early ones. Curiously, these “modern” developments are more systematic and more formal approaches to the “primordial” concerns which led to the establishment of the field in the first place, and which were supplanted by the “classical” formulations.

In statistics, which grew out of the early description of collections of data and associations between different measurable quantities, the classical work established a mathematical framework, couched in terms of random variables whose mathematical properties could be described. Ronald A. Fisher set much of the context for future development, inventing the current notion of testing a null hypothesis against data, of modeling random variation using parameterized families, of estimating parameters so as to maximize the amount of information extracted from the data, and of summarizing the precision of these estimates with reference to the information content of the estimator. Much of this work was made more formal and more mathematical by Jerzy Neyman, Egon S. Pearson, and Abraham Wald, among others, leading to the familiar analysis of statistical hypothesis tests, interval estimation, and statistical decision theory.

More recently, there has been a “modern” resurgence of interest in describing sets of data and using data sets to suggest new scientific information (as opposed to merely testing prespecified scientific hypotheses). Indeed, one might say that the emphasis is on analyzing data rather than on analyzing procedures for analyzing data.

In computer science, too, there were primordial urges — to develop a thinking machine. Early computers were rather disappointing in this regard, although they quickly came to the point of being faster and more reliable multipliers than human arithmeticians. Much early work, then, centered on making it possible to write programs for these numerical computations (development of programming languages such as FORTRAN), and on maximizing the efficiency and precision of these computations necessitated by the peculiarities of machine arithmetic (numerical analysis). By the 1960s, computers were seen as number crunchers, and most of computer science dealt with this model of what computers do. More recently, computer science has come again to view computing machines as general processors of symbols, and as machines which operate most fruitfully for humans through interaction with human users (collaborators?). The fields of software engineering and of artificial intelligence are two “modern” outgrowths of this view.

To see how a broad understanding of statistics and a broad understanding of computer science is helpful to work in almost any area of statistical computing — and to see why there is such interest in this area — it is helpful to examine how and where computation enters into statistical theory and practice, how different aspects of computer science are relevant to these fields, and also how the field of statistics treats certain areas of broad interest in computer science. With this in hand it will be easier to see that computer science and statistics are, or could well be, closely intertwined at their common boundary.

1.2 Computation in different areas of Statistics

It is difficult to divide the field of statistics into subfields, and any particular scheme for doing so is somewhat arbitrary and typically unsatisfactory. Regardless of how divided, however, computation enters into every aspect of the discipline. At the heart of statistics, in some sense, is a collection of methods for designing experiments and analyzing data. A body of knowledge has developed concerning the mathematical properties of these methods in certain contexts, and we might term this area theory of statistical methods. The theoretical aspects of statistics also involve generally applicable abstract mathematical structures, properties, and principles, which we might term statistical meta-theory. Moving in the other direction from the basic methods of statistics, we include applications of statistical ideas and methods to particular scientific questions.

1.2.1 Applications

Computation has always had an intimate connection with statistical applications, and often the available computing power has been a limiting factor in statistical practice. The applicable methods have been the currently computable ones. What is more, some statistical methods have been invented or popularized primarily to circumvent then-current computational limitations. For example, the centroid method of factor analysis was invented because the computations involved in more realistic factor models were prohibitive (Harman, 1967). The centroid method was supplanted by principal factor methods once it became feasible to extract principal components of covariance matrices numerically. These in turn have been partially replaced in common practice by normal-theory maximum likelihood computations which were prohibitively expensive (and numerically unstable) only a decade ago.

Another example, in which case statistical methods were advanced to circumvent computation limitations, is the invention of Wilcoxon’s test (1945), a substitute for the standard t-test based only on the ranks of the data rather than the numerical values of the data themselves. Wilcoxon invented the procedure as a quick approximate calculation easier to do by hand than the more complicated arithmetic involved in Student’s procedure. It was later discovered that in the case of data with a Gaussian (normal) distribution — where the t-test is optimal — Wilcoxon’s procedure loses very little efficiency, whereas in other (non-Gaussian) situations, the rank-based method is superior to the t-test. This observation paved the way for further work in this new area of “nonparametric” statistics. In this case, computational considerations led to the founding of a new branch of statistical inquiry, and a new collection of statistical methods.

COMMENT. Kruskal (1957) notes that, as is often the case in scientific inquiry, the two-sample version of Wilcoxon’s procedure was anticipated by another investigator, in this case Gustav Adolf Deuchler (1883–1955), a German psychologist. Deuchler published in 1914; others working independently around 1945 also discovered the same basic procedure.

There are ironies in the story of nonparametric statistics. Wilcoxon’s idea was based on the assumption that it is easier for a human to write down, say thirty, data values in order (sorting) than it would be to compute the sum of their squares and later to take a square root of an intermediate result. Even with most hand calculators this remains true, although the calculator can’t help with this sorting process. It turns out, however, that even using the most efficient sorting algorithms, it is almost always faster on a computer to compute the sums of squares and to take square roots than it is to sort the data first. In other words, on today’s computers it is almost always more efficient to compute Student’s t than it is to compute Wilcoxon’s statistic! It also turns out that most other nonparametric methods require much more computation (if done on a computer, anyway) than do their Gaussian-theory counterparts.

A third example is that of multiple linear regression analysis, possibly the most widely used single statistical method. Except for the simplest problems the calculations are complicated and time-consuming to carry out by hand. Using the generally available desk calculators of twenty years ago, for instance, I personally spent two full days computing three regression equations and checking the results. Even in the early sixties, regression analyses involving more than two variables were hardly routine. A short ten years later, statistical computer packages (such as SPSS, BMD, and SAS) were widely available in which multiple linear regression was an easily (and inexpensively) exercised option. By the late seventies, these packages even had the capability of computing all 2^p regression models possible to form based on p candidate independent variables, and screening out the best fitting of these.

COMMENT. Actually, this last assertion is over-stated. Of the 2^p possible models, many of these are markedly inferior to the others. To screen this many regression models efficiently, algorithms have been developed which avoid computing most of these inferior models. Still, as the number of predictors p increases, the amount of computation increases at an exponential rate, even using the best known algorithms. It is not known whether less computation would suffice. We shall return to this problem in Chapter 22.

Thus, as computing power has increased, our notion of what a “routine regression analysis” is has changed from a single regression equation involving a single predictor variable, to screening possibly a million possible regression equations for the most promising (or useful) combinations of a large set of predictors.

The statistical methods that people use are typically those whose computation is straightforward and affordable. As computer resources have fallen dramatically in price even as they have accelerated in speed, statistical practice has become more heavily computational than at any time in the past. In fact, it is more common than not that computers play an important role in courses on statistical methods from the very start, and many students’ first introduction to computing at the university is through a statistical computer package such as Minitab or SPSS, and not through a course in computer science. Few users of statistics any longer expect to do much data analysis without the assistance of computers.

COMMENT. Personal computers are now quite powerful and inexpensive, so that many students coming to the university have already had experience computing, perhaps also programming, without any exposure to data analysis or computer science. As of this writing, software for statistical computing or for data analysis is primitive. It is likely, however, that personal computers will greatly influence the direction that statistics — and computer science — will take in the next two decades. Thisted (1981) discusses ways in which personal computers can be used as the basis for new approaches to data analysis.

1.2.2 Theory of statistical methods

Postponing for the moment a discussion of statistical methods themselves, let us turn to theoretical investigation of properties of statistical procedures. Classically, this has involved studying the mathematical behavior of procedures within carefully defined mathematical contexts.

COMMENT. The conditions under which a procedure is optimal, or has certain specified properties, are often referred to in common parlance as the “assumptions” under which the procedure is “valid,” or simply, the “assumptions of the procedure.” This is really a misnomer, in that it is often entirely permissible—even preferable—to use a procedure even if these conditions are known not to hold. The procedure may well be a valid one to apply in the particular context. To say that a procedure is not valid unless its assumptions hold is an awkward and misleading shorthand for saying that the standard properties which we attribute to the procedure do not hold exactly unless the assumptions (that is, the conditions) of certain theorems are satisfied. These properties may still hold approximat...

Cover
Half Title
Title Page
Copyright Page
Dedication
Table of Contents
Preface
Chapter 1. Introduction to Statistical Computing
Chapter 2. Basic Numerical Methods
Chapter 3. Numerical Linear Algebra
Chapter 4. Nonlinear Statistical Methods
Chapter 5. Numerical Integration and Approximation
Chapter 6. Smoothing and Density Estimation
Answers to Selected Exercises
References
Index