CHAPTER 1
INTRODUCTION
Data analysis of any kind, including a regression analysis, has the potential for far-reaching consequences. Conclusions drawn from small laboratory experiments or extensive sample surveys might only influence oneās colleagues and associates or they could form the basis for policy decisions by governmental agencies which could conceivably affect millions of people. Data analysts must, therefore, have an adequate knowledge of and a healthy respect for the procedures they utilize.
Consider as an illustration of the potential for far-reaching effects of a data analysis one of the most massive research projects ever undertaken, the Salk polio vaccine trials (Meier, 1972). The conclusions drawn from the results of this study ultimately culminated in a nationwide polio immunization program and virtual elimination of this tragic disease in the United States. The foresight and competence of the principal investigators of the study prevented ambiguity of the results and possible criticism of the conclusions. The handling of this experiment provides valuable lessons in the overall role of data analysis and the care with which it must be approached.
Polio in the early 1950ās was a mysterious disease. No one could predict where or when it would strike. It did not affect a large segment of any community but those it did strike, mostly children, were often left paralyzed. Its crippling effect on young children and the sporadic nature of its occurrence led to demands for a major effort in eradicating the disease. Salkās vaccine was one of the most promising ones available, but it had not been sufficiently tested.
Since the occurrence of polio in any specific community could not be predicted and only a small portion of the population actually contracted the disease in any year, a large-scale experiment including many communities was necessitated. In the end over one million children participated in the study, some receiving the vaccine and others just a placebo.
In allowing their children to participate, many parents insisted on knowing whether their child received the vaccine or the placebo. These children constituted the āobserved-placeboā group (Meier, 1972). The planners of the experiment, realizing potential difficulties in the interpretation of the results, insisted that there be a large number of communities for which neither child, parent, nor diagnosing physician knew whether the child received the vaccine or the placebo. This group of children made up the āplacebo-controlā group.
For both groups of children the incidence of polio was lower for those vaccinated than for those who were not vaccinated. The conclusion was unequivocal: the Salk vaccine proved effective in preventing polio. This conclusion would have been compromised, however, had the planners of the study not insisted that the placebo-control group be included. Doubts that the observed-placebo group could reliably indicate the effectiveness of the vaccine were raised both before and after the experiment. The indicators of polio are so similar to those of some other diseases that the diagnosing physician might tend to diagnose polio if he knew the child had not been vaccinated and diagnose one of the other diseases if he knew the child had been vaccinated. After the experiment was conducted, analysis of the data for the observed-control group indicated that the vaccine was effective but the differences were not large enough to prevent charges of (unintentional) physician bias. Differences in the incidence of polio between vaccinated and nonvaccinated children in the placebo-control group were larger than those in the observed-control group and the analysis of this data provided the definitive conclusion. Thus due to the careful planning and execution of this study, including the data collection and analysis, the immunization program that was later implemented has resulted in almost complete eradication of polio in the United States.
1.1 DATA COLLECTION
Data can be compiled in a variety of ways. For specific types of information, the U. S. Bureau of the Census can rely on nearly complete enumerations of the U. S. population or on data collected using sophisticated sample survey designs. The Bureau of the Census can insure that all segments of the population are represented in most of the analyses that they desire to perform. Many research endeavors, however, are conducted on a relatively smaller scale and are limited by time, manpower, or economics. Characteristic of these studies is a data base that is restricted by the data-collection techniques.
So important is the data base to a regression analysis that we begin our development of multiple linear regression with the data-collection phase. The emphasis of this section is on an understanding of the benefits associated with a good data collection effort and the influence on the interpretation of fitted models when the data base is restricted. While it may not always be possible to build a data base as large or as representative as one might desire, knowledge of the limitations of a data base can prevent many incorrect applications of regression methodology.
1.1.1 Data-Base Limitations
Regression analysis provides information on relationships between a response variable and one or more predictor variables but only to the degree that such information is contained in the data base. Whether the data are compiled from a complete enumeration of a population, an appropriate sample survey, a haphazard tabulation, or by simply inventing data, regression coefficients can be estimated and conclusions can be drawn from the fitted model. The quality of the fit and accuracy of conclusions, however, depend on the data used: data that are not representative or not properly compiled can result in poor fits and erroneous conclusions.
One of many studies that illustrates the problems that arise when one is forced to draw inferences from a potentially nonrepresentative sample is found in Crane (1965). In her attempt to assess the influence of graduate school prestige and current academic affiliation on productivity and peer recognition of university professors, she surveyed faculty members in three disciplines from three universities on the east coast of the United States. The responses were voluntary and presumably not all professors in these disciplines participated in the study. Although Craneās study did not call for a regression analysis, the interpretation problems that occur as a result of her data-collection effort are applicable regardless of the type of analysis performed.
Questions naturally arise concerning any conclusions that would be drawn from a study with the data-base limitations of this one. Do these three disciplines truly represent all academic disciplines? Can these three universities be said to be typical of all universities in the United States? If some professors chose not to participate in the study, are the responses thereby biased? These questions cannot be answered from Craneās data. Only if additional studies provide results similar to hers for other disciplines and other schools can global conclusions be drawn concerning the influence of graduate school and current academic affiliation on recognition and productivity of university professors. No amount of statistical analysis can compensate for these data-base limitations.
Criticisms of limited data bases and disagreements with conclusions drawn from the analysis of them are common. Nevertheless, the choice is often between conducting no investigations at all or analyzing restricted sets of data. We do not advocate the former position; however, it is the obligation of the data analyst to investigate the data-collection process, discover any limitations in the data collected, and restrict conclusions accordingly. Another example will stress these points and the consequences of underrating their importance.
A well-publicized study on male sexuality (Kinsey et al., 1948) evoked widespread criticism both because of its controversial subject matter and because of its data-collection procedures. Responses were solicited from males belonging to a large number of groups in order to make the sampling more feasible. About 5,300 males were interviewed in prisons, mental institutions, rooming houses, etc. By interviewing volunteers from groups such as this, a large sample of responses could be obtained without exhaustive effort and expense. The convenience of selecting responses in this fashion is the primary factor contributing to the debate over the results of the study.
Among the criticisms raised about the Kinsey report, most centered on the data-collection process. Some groups (such as college men) were overrepresented while others (such as Catholics) were underrepresented and still others (such as Blacks) were completely excluded. The subjects were all volunteers and this fact led to further charges of unrepresentativeness. Additional criticisms centered on the interview technique which relied solely on an individualās ability to recall events in his past.
The statistical methodology used in the Kinsey report was highly praised although it was descriptive and relatively simple (Cochran, Mosteller, and Tukey, 1954). In response to the criticisms of the Kinsey report, moreover, the investigators argued that this study was just a pilot study for a much larger sexual attitude survey. Nevertheless, in numerous instances the conclusions drawn from the study went beyond bounds that could be substantiated by the data. Actually, the conclusions are quite limited in generality. The two examples just discussed demonstrate the problems that can arise from the absence of an adequately representative data base. Regardless of the sophistication of statistical analyses of the data, deficiencies in the data base can preclude valid conclusions. In particular, interpreting fitted regression models and comparing estimated model parameters in a regression analysis can lead to erroneous inferences if problems with the data go undetected or are ignored.
1.1.2 Data-Conditioned Inferences
Of particular relevance to a discussion of data-collection problems is the nature of the inferences that can be drawn once the data are collected. Data bases are generally compiled to be representative of a wide range of conditions but they can fail to be as representative as intended even when good data-collection techniques are employed. One can be led to believe that broad generalizations from the data are possible because of a good data-collection effort when a closer inspection of the data might reveal that deficiencies exist in the data base.
Equality o...