Chapter 1
Statistical Inference
1.1 What statistical inference is all about
The decisions of politicians, businesses, engineers, not-for-profit organizations, etc. typically have an influence on many people. Changes to child benefits by a government, for example, influence the financial position of many households. The government is interested (hopefully) in the effect of such measures on individual households. Of course, the government canât investigate the effect on every individual household. That would simply take too much time and make it almost impossible to design general policies. Therefore, the government could restrict itself by focussing on the average effect on households in, say, the lowest income quartile. Even finding out this number is typically too difficult to do exactly. So, the government relies on information obtained from a small subset of these households. From this subset the government will then try to infer the effect on the entire population.
The above example gives, in a nutshell, the goal of statistics. Statistics is the study of collecting and describing data and drawing inferences from these data. Politicians worry about the impact of budgetary measures on the average citizen, a marketeer is concerned with median sales over the year, an economist worries about the variation in employment figures over a 5-year period, a social worker is concerned about the correlation between criminality and drug use, etc. Where do all these professionals get that information from? Usually from data about their object/subject of interest. However, a long list of numbers does not really help these professionals in analysing their subject and in making appropriate decisions accordingly. Therefore, the âraw dataâ (the list of responses you get if, for example, you survey 500 people) are condensed into manageable figures, tables, and numerical measures. How to construct these is the aim of descriptive statistics. How to use them as evidence to be fed into the decision making process is the aim of inferential statistics and the subject of this book. This chapter introduces in an informal way some of the statistical jargon that you will encounter throughout the book.
Inferential statistics is the art and science of interpreting evidence in the face of uncertainty.
Example 1.1. Suppose that you want to know the average income of all university students in the country (for example, to develop a financial product for students). Then, obviously, you could simply go around the country and ask every student after their income. This would, however, be a very difficult thing to do. First of all, it would be extremely costly. Secondly, you may miss a few students who are not in the country at present. Thirdly, you have to make sure you donât count anyone twice.
Alternatively, you could only collect data on a subgroup of students and compute their average income as an approximation of the true average income of all students in the country. But now you have to be careful. Because you do not observe all incomes, the average that you compute is an estimate. You will need to have some idea about the accuracy of your estimate. This is where inferential statistics comes in.
Letâs rephrase the above example in more general terms: you wish to obtain information about a summary (called a parameter) of a measurement of a characteristic (called a variable) of a certain group of people/objects/procedures/⌠(called a population) based on observations from only a subset of the population (called a sample), taking into account the distortions that occur by using a sample rather than the population. All of these boldface notions will be made precise in this book. For now it suffices to have an intuitive idea.
The goal of inferential statistics is to develop methods that we can use to infer properties of a population based on sample information.
1.2 Why statistical inference is difficult
There is a great need for methods to gather data and draw appropriate conclusions from the evidence that they provide. The costs of making erroneous decisions can be very high indeed. Often people make judgements based on anecdotal evidence. That is, we tend to look at one or two cases and then juxtapose these experiences onto our world view. But
anecdotal evidence is not evidence.
At its most extreme, an inference based on anecdotal evidence would be to play the lottery because you heard that a friend of a friendâs grandmother once won it. A collection of anecdotes never forms an appropriate basis from which general conclusions can be drawn.
Example 1.2 (Gardner, 2008). On November 6, 2006, the Globe and Mail ran a story about a little girl, who, when she was 22 months old, developed an aggressive form of cancer. The story recounted her and her parentsâ protracted battle against the disease. She died when she was 3 years old. The article came complete with photographs of the toddler showing her patchy hair due to chemotherapy. The paper used this case as the start for a series of articles about cancer and made the little girl, effectively, the face of cancer.
No matter how dreadful this story may be from a human perspective, it is not a good basis for designing a national health policy. The girlâs disease is extremely rare: she was a one-in-a-million case. Cancer is predominantly a disease of the elderly. Of course you could say: âany child dying of cancer is one too many,â but since we only have finite resources, how many grandparents should not be treated to fund treatment for one little girl? The only way to try and come up with a semblance of an answer to such questions is to quantify the effects of our policies. But in order to do that we need to have some idea about effectiveness of treatment in the population as a whole, not just one isolated case.
The human tendency to create a narrative based on anecdotal evidence is very well documented and hard-wired into our brains.1 Our intuition drives us to make inferences from anecdotal evidence. That does not make those inferences any good. In fact, a case can be made that societies waste colossal amounts of money because of policies that are based on anecdotal evidence, propelled to the political stage via mob rule or media frenzy.
In order to control for this tendency, we need to override our intuition and use a formal framework to draw inferences from data. The framework that has been developed over the past century or so to do this is the subject of this book. The concepts, tools, and techniques that you will encounter are distilled from the efforts of many scientists, mathematicians, and statisticians over decades. It represents the core of what is generally considered to be the consensus view of how to deal with evidence in the face of uncertainty.
1.3 What kind of parameters are we interested in?
As stated above, statistics starts with the idea that you want to say something about a parameter, based on information pertaining to only a subgroup of the entire population. Keep in mind the example of average income (parameter) of all university students (population). Of course not every student has the same income (which is the variable that is measured). Instead there is a spread of income levels over the population. We call such a spread a distribution. The distribution tells you, for example, what percentage of students has an income between $5,000 and $6,000 per year. Or what percentage of students has an income above or below $7,000. The parameter of interest in a statistical study is usually a particular feature of this distribution. For example, if you want to have some idea about the center of the distribution, you may want to focus on the mean (or average) of the distribution. Because the mean is so often used as a parameter, we give it a specific symbol, typically the Greek2 letter Îź.
1.3.1 Center of a distribution
The mean, or average, of a population gives an idea about the center of the distribution. It is the most commonly used summary of a population. The average is often interpreted as describing the âtypicalâ case. However, if you collapse an entire population into just one number, there is always the risk that you get results that are distorted. The first question that should be answered in any statistical analysis is: âIs the parameter I use appropriate for my purpose?â
In this book I donât have much to say about this: we often deal with certain parameters simply because the theory is best developed for them. A few quick examples, though, should convince you that the question of which parameter to study is not always easy to answer.3 Imagine that you are sitting in your local bar with eight friends and suppose that each of you earns $40,000 per year. The average income of the group is thus $40,000. Now suppose that the local millionaire walks in who has an income of $1,500,000 per year. The average income of your group now is $186,000. Iâm sure youâll agree that average income in this case is not an accurate summary of the population.
This point illustrates that the mean is highly sensitive to outliers: extreme observations, either large or small. In the income case it might be better to look at the median. This is the income level such that half the population earns more and half the population earns less. In the bar example, no matter whether the local millionaire is present, the median income is $40,000. The difference between mean and median can be subtle and lead to very different interpretations of, say, the consequences of policy. For example, during the George W. Bush administration, it was at one point claimed that new tax cuts meant that 92 million Americans would, on average, get a tax reduction of over $1,000. Technically, this statement was correct: over 92 million Americans received a tax cut and the average value was $1,083. However, the median tax cut was under $100. In other words, a lot of people got a small tax cut, whereas a few Americans got a very large tax cut.
Not that the median is always a good measure to describe the âtypicalâ case in a population either. For example, if a doctor tells you after recovery from a life-saving operation that the median life expectancy is 6 months, you may not be very pleased. If, however, you knew that the average life expectancy is 15 years, the picture looks a lot better. What is happening here is that a lot of patients (50%) die very soon (within 6 months) after the operation. Those who survive the first 6 months can look forward to much longer lives.
Both examples...