CHAPTER 1
Correlation Analysis
We begin preparing to learn about multiple regression by looking at correlation analysis. As you will see, the basic purpose of correlation analysis is to tell you if two variables have enough of a relationship between them to be included in a multiple regression model. Also, as we will see later, correlation analysis can be used to help diagnose problems with a multiple regression model.
Take a look at the chart in Figure 1.1. This scatterplot shows 26 observations on 2 variables. These are actual data. Notice how the points seem to almost form a line? These data have a strong correlation—that is, you can imagine a line through the data that would be a close fit to the data points. While we will see a more formal definition of correlation shortly, thinking about correlation as data forming a straight line provides a good mental image. As it turns out, many variables in business have this type of linear relationship, although perhaps not this strong.
Figure 1.1 A scatterplot of actual data
Now take a look at the chart in Figure 1.2. This scatterplot also shows actual data. This time, it is impossible to imagine a line that would fit the data. In this case, the data have a very weak correlation.
Figure 1.2 Another scatterplot of actual data
Terms
Correlation is only able to find, and simple regression and multiple regression are only able to describe, linear relationships. Figure 1.1 shows a linear relationship. Figure 1.3 shows a scatterplot in which there is a perfect relationship between the X and Y variables, only not a linear one (in this case, a sine wave.) While there is a perfect mathematical relationship between X and Y, it is not linear, and so there is no linear correlation between X and Y.
Figure 1.3 A scatterplot of nonlinear (fictitious) data
A positive linear relationship exists when a change in one variable causes a change in the same direction of another variable. For example, an increase in advertising will generally cause a corresponding increase in sales. When we describe this relationship with a line, that line will have a positive slope. The relationship shown in Figure 1.1 is positive.
A negative linear relationship exists when a change in one variable causes a change in the opposite direction of another variable. For example, an increase in competition will generally cause a corresponding decrease in sales. When we describe this relationship with a line, that line will have a negative slope.
Having a positive or negative relationship should not be seen as a value judgment. The terms “positive” and “negative” are not intended to be moral or ethical terms. Rather, they simply describe whether the slope coefficient is a positive or negative number—that is, whether the line slopes up or down as it moves from left to right.
While it does not matter for correlation, the variables we use with regression fall into one of two categories: dependent or independent variables. The dependent variable is a measurement whose value is controlled or influenced by another variable or variables. For example, someone’s weight likely is influenced by the person’s height and level of exercise, whereas company sales are likely greatly influenced by the company’s level of advertising. In scatterplots of data that will be used for regression later, the dependent variable is placed on the Y-axis.
An independent variable is just the opposite: a measurement whose value is not controlled or influenced by other variables in the study. Examples include a person’s height or a company’s advertising. That is not to say that nothing influences an independent variable. A person’s height is influenced by the person’s genetics and early nutrition, and a company’s advertising is influenced by its income and the cost of advertising. In the grand scheme of things, everything is controlled or influenced by something else. However, for our purposes, it is enough to say that none of the other variables in the study influences our independent variables.
While none of the other variables in the study should influence independent variables, it is not uncommon for the researcher to manipulate the independent variables. For example, a company trying to understand the impact of its advertising on its sales might try different levels of advertising in order to see what impact those varying values have on sales. Thus the “independent” variable of advertising is being controlled by the researcher. A medical researcher trying to understand the effect of a drug on a disease might vary the dosage and observe the progress of the disease. A market researcher interested in understanding how different colors and package designs influence brand recognition might perform research varying the packaging in different cities and seeing how brand recognition varies.
When a researcher is interested in finding out more about the relationship between an independent variable and a dependent variable, he must measure both in situations where the independent variable is at differing levels. This can be done either by finding naturally occurring variations in the independent variable or by artificially causing those variations to manifest.
When trying to understand the behavior of a dependent variable, a researcher needs to remember that it can have either a simple or multiple relationship with other variables. With a simple relationship, the value of the dependent variable is mostly determined by a single independent variable. For example, sales might be mostly determined by advertising. Simple relationships are the focus of chapter 2. With a multiple relationship, the value of the dependent variable is determined by two or more independent variables. For example, weight is determined by a host of variables, including height, age, gender, level of exercise, eating level, and so on, and income could be determined by several variables, including raw material and labor costs, pricing, advertising, and competition. Multiple relationships are the focus of chapters 3 and 4.
Scatterplots
Figures 1.1 through 1.3 are scatterplots. A scatterplot (which some versions of Microsoft Excel calls an XY chart) places one variable on the Y-axis and the other on the X-axis. It then plots pairs of values as dots, with the X variable determining the position of each dot on the X-axis and the Y variable likewise determining the position of each dot on the Y-axis. A scatterplot is an excellent way to begin your investigation. A quick glance will tell you whether the relationship is linear or not. In addition, it will tell you whether the relationship is strong or weak, as well as whether it is positive or negative.
Scatterplots are limited to exactly two variables: one to determine the position on the X-axis and another to determine the position on the Y-axis. As mentioned before, the dependent variable is p...