1 Introduction
SUMMARY
A very large number of clinical studies with human subjects have and are being Āconducted in a wide range of settings. The design and analysis of such studies demands the use of statistical models in this process. To describe such situations involves specifying the model, including defining population regression coefficients (the parameters), and then stipulating the way these are to be estimated from the data arising from the subjects (the sample) who have been recruited to the study. This chapter introduces the simple linear regression model to describe studies in which the measure made on the subjects can be assumed to be a continuous variable, the value of which is thought to depend either on a single binary or a continuous covariate measure.
Associated statistical methods are also described defining the null hypothesis, estimating means and standard deviations, comparing groups by use of a z- or t-test, confidence intervals and p-values. We give examples of how a statistical computer package facilitates the relevant analyses and also provides support for suitable graphical display.
Finally, examples from the medical and associated literature are used to illustrate the wide range of application of regression techniques: further details of some of these examples are included in later chapters.
INTRODUCTION
The aim of this book is to introduce those who are involved with medical studies whether laboratory, clinic, or population based, to the wide range of regression techniques which are pertinent to the design, analysis, and reporting of the studies concerned. Thus our intended readership is expected to range from health care professionals of all disciplines who are concerned with patient care, to those more involved with the non-clinical aspects such as medical support and research in the laboratory and beyond.
Even in the simplest of medical studies in which, for example, recording of a single Āfeature from a series of samples taken from individual patients is made, one may ask questions as to why the resulting values differ from each other. It may be that they differ between the genders and/or between the different ages of the patients concerned, or because of the severity of their illnesses. In more formal terms we examine whether or not the value of the observed variable, y, depends on one or more of the (covariate) variables, often termed the xās. Although the term covariate is used here in a generic sense, we will emphasize that individually they may play different roles in the design and hence analysis of the study of which they are a part. If one or more covariates does influence the outcome, then we are essentially claiming that part of the variation in y is a result of individual patients having different values of the xās concerned. In which case, any variation remaining after taking into consideration these covariates is termed the residual or random variation. If the covariates do not have influence, then we have not explained (strictly not explained an important part of) the variation in y by the xās. Nevertheless, there may be other covariates of which we are not aware that would.
Measurements made on human subjects rarely give exactly the same results from one occasion to the next. Even in adults, height varies a little during the course of the day. If one measures the cholesterol levels of an individual on one particular day and then again the following day, under exactly the same conditions, greater variation in this than that of height would be expected. Any variation that we cannot ascribe to one or more covariates is usually termed random variation, although, as we have indicated, it may be that an unknown covariate may account for some of this. The levels of inherent variability may be very high so that, perhaps in the circumstances where a subject has an illness, the oscillations in these measurements may disguise, at least in the early stages of treatment, the beneficial effect of treatment given to improve the condition.
STATISTICAL MODELS
Whatever the type of study, it is usually convenient to think of the underlying structure of the design in terms of a statistical model. This model encapsulates the research question we intend to formulate and ultimately answer. Once the model is specified, the object of the corresponding study (and hence the eventual analysis) is to estimate the parameters of this model as precisely as is reasonable.
Comparing two means
Suppose a study is designed to investigate the relationship between high density lipoprotein (HDL) cholesterol levels and gender. Once the study has been conducted, the observed data for each gender may be plotted in a histogram format as in Figure 1.1.
These figures illustrate a typical situation in that there is considerable variation in the value of the continuous variable HDL ranging from approximately 0.4 to 2.0 mmol/L. Further, both distributions tend to peak towards the centre of their ranges and there is a suggestion of a difference between males and females. In fact the mean value is higher at
= 1.2135 for the females compared with
= 1.0085 mmol/L for the males.
Formal comparisons between these two groups can be made using a statistical significance test. Thus, we can regard
and
as estimates of the true or population mean values
Ī¼F and
Ī¼M. The corresponding standard deviations are given by
sF = 0.3425 and
sM = 0.2881 mmol/L, and these estimate the respective population values
ĻF and
ĻM. To test the null hypothesis of no difference in HDL levels between males and females, the usual procedure is to assume HDL within each group has an app...