1
The Basics
This chapter discusses some fundamental statistical issues dealing with variation, statistical models, calculations of probability and the connection between hypothesis testing and estimation. These are basic topics that need to be understood by statistical consultants and those who use statistical methods. The selection of these topics reflects the authorâs experience and practice.
There would be no need for statistical methods if there were no variation or variety. Variety is more than the spice of life; it is the bread and butter of statisticians and their expertise. Assessing, describing and sorting variation is a key statistical activity. But not all variation is the domain of statistical practice, it is restricted to variation that has an element of randomness to it.
Definitions of the field of statistics abound. See a sampling in van Belle et al. (2004). For purposes of this book the following characteristics, based on a description by R.A. Fisher (1935) will be used. Statistics is the study of populations, variation, and methods of data reduction. He points out that âthe same types of problems arise in every case.â For example, a population implies variation and a population cannot be wholly ascertained so descriptions of the population depend on sampling. The samples need to be reduced to summarize information about the population and this is a problem in data reduction.
1.1 FOUR BASIC QUESTIONS
Introduction
R.A. Fisherâs definitions provide a formal basis for statistics. It presupposes a great deal that needs to be made explicit. For the researcher and the statistical colleague there is a broader program that puts the Fisher material in context.
Rule of Thumb
Any statistical treatment must address the following questions:
1. What is the question?
2. Can it be measured?
3. Where, when, and how will you get the data?
4. What do you think the data are telling you?
Illustration
Consider the question, âdoes air pollution cause ill healthâ? This is a very broad question that was qualitatively answered with the London smog episodes of the 1940s and 1950s. Lave and Seskin (1970), among others, tried to assess the quantitative effect and this question still with us today. That raises the non-trivial questions whether âair pollutionâ and âill healthâ can be measured. Lave and Seskin review measures of the former such as sulfur in the air and suspended particulates. In the latter category they list morbidity and mortality. The third question of data collection was addressed by considering data from 114 Standard Metropolitan Statistical Areas in the U.S. which contained health information and other government sources for pollution information. The fourth question was answered by running multiple regressions controlling for a variety of factors that might confound the effect, for example age and socioeconomic status.
A host of questions can be raised but in the end this was a landmark study that anticipated and still guides research efforts today.
Basis of the Rule
The rule essentially mimics the scientific method with particular emphasis on the role of data collection and analysis.
Discussion and Extensions
The first question usually deals with broad scientific issues which often have policy and regulatory implications. Another example is global warming and its cause(s). But not all questions are measurable, for example, how do we measure human happiness or wisdom? In fact, most of the important questions of life are not measurable (a reason for humility). âMeasurabilityâ implies that there are âendpointsâ which address the basic question. Frequently we need to take a short-cut. For example, income as a summary of socio-economic status. Given measurable values of the question we can then test whether one set of values differs from another. So testability implies measurability.
This raises the question whether the difference in the endpoints reflect an important difference in the first question. An example of this kind of question is the difference between statistical significance and clinical significance (It may be better to say clinical re/ewmce-statistical significance may point to a very important mechanistic framework). In this context there also needs to be careful considerations of measurements that are not taken. This issue will be addressed in more detail in the chapter on observational studies.
If it is agreed that the question is measurable the issue of data selection or data creation comes up. The three subquestions that focus the discussion. They locate the data selection in space and time, and context. The data can range from administrative data bases to experimental data, they can be retrospective or prospective. The âhowâ subquestion deals with the process that will actually be used. If sampling is involved, the sampling mechanism must be carefully described. In studies involving animals and humans this, especially, requires careful attention to ethics (but not restricted to these, of course). Broadly speaking there are two approaches to getting the data: observational studies and designed experiments.
The next step is analysis and interpretation of the data which, it is hoped, answers questions 1 and 2. Questions 1â3 focus on design, ranging from collecting anecdotes to doing a survey sample to conducting a randomized experiment. Question 4 focuses on analysis-in which statisticians have developed particular expertise (and sometimes ignore questions 1â3 by saying. âLet X be a random variable...â). But is is clear that the answers to the questions are inextricable interrelated. Other issues implied by the question include the statistical model that is used, the robustness of the model, missing data, and an assessment of the many sources of variability.
The ordering reflects the process of science. Data miners who address only question 4 do so at their own risk.
1.2 OBSERVATION IS SELECTION
Introduction
The title of this rule is from Whitehead (1925)-so the idea is not new. This is perhaps the most obvious of rules; and is not taken into account the majority of the time.
Rule of Thumb
Observation is selection.
Illustration
The observation may be straightforward but the selection process not. An example that should be better known (selection?) is the vulnerability analysis of planes returning from bombing missions during Word War II. Aircraft returning from missions had been hit in various places. The challenge was to determine which parts of the plane to reinforce to decrease their vulnerability. The naive approach started with figuring out where the hits had occurred. A second and improved approach was to adjust the number of hits by the area of the plane. The third step recommended by the statistician Abraham Wald: reinforce the planes where they had not been hit! His point was that the observations were correct, but not the selection process. What was of primary interest were the planes that did not return. Using an insightful statistical model he showed that the engine area (showing the fewest hits in returning planes) was the most vulnerable. This is one of those aha! situations where we immediately grasp the key role of the selection process. See Mangel and Samaniego (1984) for the technical description and references.
Basis of the Rule
To observe one thing implies that another is not observed, hence there is selection. This implies that the observation is taken from a larger collective, the statistical âpopulation.â
Discussion and Extensions
Often the observation is of interest only in so far as it is representative of the population we are interested in. For example, in the vulnerability analysis, the plane that provided the information about hits might not have been used again and scrapped for parts.
Selection can be subconscious as when we notice Volvo cars everywhere after having bought one. Thus it is important to be able to recognize the selection process.
The selection process in humans is very complicated as evidenced by contradictory evidence by witnesses of the same accident. Nisbett and Ross (1980) and Kahneman, Slovic, and Tversky (1982) describe in detail some of the heuristics we use in selecting information. The bottom line is âknow your sample.â
1.3 REPLICATE TO CHARACTERIZE VARIABILITY
Introduction
A fundamental challenge of statistics is to characterize the variation that we observe. We can distinguish between systematic variation and non-systematic variation which sometimes can be characterized as random variation. An example of systematic variation is the mile-marker on the highway or the kilometer-marker on the autobahn.
This kind of variation is predictable. Random variation cannot be described in this way. In this section we are concerned with random variation.
Rule of Thumb
Replicate to characterize random variation.
Illustration
Repeated sampling under constant conditions tends to produce replicate observations. For example, the planes in the previous illustration have the potential of being consider replicate observations. The reason for the careful wording is that many assumptions need to be made such as the planes have not been altered in some way that affects their vulnerability, the enemy has not changed strategy, and so on.
At a more mundane level, the baby aspirin tablets we take are about as close to replicates as we can imagine. But even here, there are storage requirements and expiration dates that may make the replications invalid.
Basis of the Rule
Characterizing variability requires repeatedly observing the variability since the it is not a property inherent in the observation itself.
Discussion and Extensions
The concept of replication is intuitive but difficult to define precisely. The idea of constant conditions is technically impossible to achieve since time marches on. Marriott (1999), defines replication as âexecution of an experiment or survey more than once so as to increase precision and to obtain a closer estimation of sampling error.â He also makes a distinction between replication and repetition, reserving the former for repetition at the same time and place.
In agricultural research the basic replicate is called a plot. Treatments can be compared by using several plots to each treatment so that the variability within a treatment is replicate variability.
There is one method that ensures replication: randomization of observational units to two or more treatments. More will be said about in the chapter on design.
1.4 VARIABILITY OCCURS AT MULTIPLE LEVELS
Introduction
As soon as the concept of variability is grasped it becomes clear that there are many sources of variability. Again, here the sources may be systematic or random. The emphasis here is, again, on random variability.
Rule of Thumb
Variability occurs at multiple levels.
Example
In education there is clearly variation in talents from student to student, from classroom to classroom, from school to school, from district to district, from country to country. In this example there is a hierarchy with students nested within schools and so on.
Basis of the Rule
The basis of the rule is the recognition that there are levels of variation.
Discussion and Extensions
Each level of an observational hierarchy has its own units and its own variation. Suppose that the variable is expenditure per student. This could be expanded to expenditure per classroom, school or district. In order to standardize the expenditure per student could be used but for other purposes it may be useful to compare expenditure at the district level. However, if district are compared then the number of students served is usually considered. The number of students would be a confounder in comparison of districts. More will be said about confounders in Chapter 3.
1.5 INVALID SELECTION IS THE PRIMARY THREAT TO VALID INFERENCE
Introduction
The challenge is to be able to describe the selection process-a fundamental problem for applied statisticians. Selection bias occurs when the sample is not representative of the population of interest; this usually occurs when the sampling is not random. For example, a telephone survey of voters excludes those without telephones. This becomes important when the survey deals with political affiliation which may also be associated with owning a telephone (as a proxy for socio-economic status).
The selection process need not be simple random sampling, all that is required that in the end the probability of selection of the units of interest can be described. Survey sampling is a good example of a field where very clever selection processes are used in order to minimize sampling effort and cost yet have estimable probabilities of selection.
1.6 THERE IS VARIATION IN STRENGTH OF INFERENCE
Introduction
Every one agrees that there are degrees of quality of information but when asked to define the criteria there a great deal of disagreement. The simple statistical rule that the inverse of the variance of a statistic is a measure of the information contained in the statistic provides a useful criterion for a point estimate but is clearly inadequate for comparing much bigger chunks of information such as a study. In the field of history primary sources are deemed more informative than secondary sources. These, and other, considerations point to the need to scale the quality and robustness of information.
Rule of Thumb
Compared with experimental studies, observational studies provide less robust information.
Illustration
The Womenâs Health Initiative (WHI) (see chapter 7, a large study involving both randomized clinical trials and parallel observational studies uses the randomized clinical trial to evaluate the validity the evidence of the observational component. In fact, the goal of the analysis of the observational arm is to come as close as possible to the results of the randomized trials.
Basis of the Rule
The primary reason for less robust inference of observational study is that the probability framework linking the study to the population of inference is unknown.
Discussion and Extensions
A great deal more will be said above the strength of inference in the chapter on evidence-base medicine. Also, the Hill guidelines to be discussed below provide at least a rough guide for determining the strength of evidence.
Each field of research has its own criteria for strength of evidence. In genetics the strength of evidence is measured by the lod score is defined as the log (base 10) of the probability of occurrence of the event given a hypothesized linkage to the probability assuming no linkage (log base 10 of the odds ratio). Lod scores of three or more are considered confirmatory, a lod score of â2 as disproving a claimed association. These are stringent criteria. A lod score of 3 means that the probability of the event is 1000 times under the alternative hypothesis (e.g. linkage) than under the null hypothesis (non linkage), this implies a p-value of 0.001.
In epidemiology, odds ratios of two or greater are looked at with more interest that smaller odds ratios. In a very different area, historiography, primary sources are considered more reliable than secondary sources. In each case there are qualifications-usually by experts who know the field thoroughly.
There is no explicit random mechanism in many observational studies. For example, there may not be randomization in linkage analysis. However, there will be an assumption of statistical independence which together with independent observations produces a situation equivalent to randomization. These underlying assumptions then need to be tested or evaluated. In such cases evaluation of the data is often done vi...