Technology & Engineering

Covariance and Correlation

Covariance and correlation are statistical measures used to quantify the relationship between two variables. Covariance measures how much two variables change together, while correlation standardizes this measure to a range of -1 to 1, indicating the strength and direction of the relationship. Both are important in analyzing data and understanding the associations between different technological and engineering factors.

Written by Perlego with AI-assistance

10 Key excerpts on "Covariance and Correlation"

  • Basic Computational Techniques for Data Analysis
    eBook - ePub
    • D Narayana, Sharad Ranjan, Nupur Tyagi(Authors)
    • 2023(Publication Date)
    • Routledge India
      (Publisher)
    7 CORRELATION COEFFICIENT
    DOI: 10.4324/9781003398127-7
    Learning Objectives After reading this chapter, the readers will be able to understand
    • The two measures of association – Covariance and Correlation
    • The concept of covariance
    • The concept of correlation and how it is different from covariance
    • Methods to estimate correlation coefficients – Pearson’s and Spearman’s methods
    • The use of Excel in calculating correlation coefficients
    In this chapter, we will introduce the concept, application, and computation of another set of descriptive measures that relate to an association between two or more variables. To illustrate, we might be interested in finding out the sales of ice creams as the temperature rises. The measures that attempt to quantify such a relationship between the variables and further help to answer the strength of the relationship are called measures of association. In this chapter, we will be studying two such measures, which are closely related to each other. These are covariance and correlation.

    7.1 Covariance

    Covariance is a statistical measure that analyzes the linear relationship between two random variables. It evaluates how the two variables vary together or covary. For instance, what happens to the other variable if one variable goes up, down, or remains constant? Accordingly, we can have the following types of linear relationships or covariances:
    • A positive covariance indicates a direct relationship between the variables; the two variables tend to move together in the same direction, either upward or downward. So, when one variable increases, the other variable increases or when one variable decreases, the other variable decreases.
    • A negative covariance
  • Statistical Applications for Environmental Analysis and Risk Assessment
    • Joseph Ofungwu(Author)
    • 2014(Publication Date)
    • Wiley
      (Publisher)
    covariance also describes the relationship between two variables but does not provide a means of interpreting the magnitude of the relationship. The covariance is positive if the larger values of one variable largely correspond with the larger values of the other, and similarly for the lower values. Conversely, the covariance is negative if the higher (or lower) values of one variable mainly associate with the lower (or higher) values of the other. Correlation is related to covariance in that the linear (i.e., Pearson's) correlation coefficient between two variables can be obtained by dividing their covariance by the product of the standard deviations of the two variables, and both (i.e., correlation and covariance) always have the same algebraic sign. Correlation is more easily interpreted and more frequently used than covariance.
    Whereas autocorrelation is ordinarily undesirable because many statistical tests assume that the individual data values are uncorrelated, and may produce unreliable results if the data values are actually correlated, the field of geostatistics relies entirely on the presence of spatial correlation or covariance between contiguously located individual data values from the same variable (i.e., univariate data). Geostatistics provides methodologies for estimating data values at unsampled locations based on the modeled relationship between sampling locations and the data values. That model is usually referred to as a variogram or semivariogram, and the technique for estimating the values at the unsampled locations is usually referred to as kriging. Correlation, covariance and geostatistics are typically computed using software. Descriptions of the basic concepts are provided next, supplemented by example applications using software.

    11.2 Correlation and Covariance

    If a change in a variable is associated with a matching change in a second variable, both variables are said to be correlated, with the degree of association represented by the correlation coefficient. The correlation is said to be positive or direct if an increase in one variable is associated with a corresponding increase in the other, whereas negative or inverse correlation refers to the case where an increase in one variable is associated with a decrease in the other. The correlation can be linear, in which case the changes in both variables are proportional to each other (for instance, an increase by a factor of 3 in one variable is associated with a similar increase (or decrease) by a factor of 3 in the other variable), or nonlinear, in which case the changes are not proportional (for instance, a change by a factor of 2 in one variable is associated with a fourfold change in the other). Although as described in Section 11.1, correlation does not imply causation, establishing correlation between variables can often provide useful clues in understanding the nature of site contamination. For instance, it may be helpful to know whether there is an association between pH and dissolved aluminum concentrations in groundwater at a contaminated site because although pH does not cause aluminum contamination, lower pH values can result in increased dissolution of soil aluminum and hence, increased concentrations of groundwater aluminum. Also, correlation is the basis for computing regression between variables, from which predictive relationships may be established. In the absence of substantial correlation between two variables for instance, performing a regression analysis for predicting one variable from the other would be pointless. It should also be noted that there should be some basis for explaining or expecting correlation between two variables; otherwise, the correlation would be described as spurious correlation
  • Understanding Quantitative Data in Educational Research
  • The data should be arranged as a correlation table or plotted as a scatter graph. The table or scatterplot should be carefully examined to compare the variables and to see whether the paired data points follow a straight line which indicates that the value of one variable is linearly associated with the value of the other variable.
  • If an association or a relationship exists between variables, the strength and direction of the relationship will be measured by a coefficient of correlation.
  • To see if the relationship occurs by chance, a null hypothesis is formulated, and then the p-value is computed from the data.
  • We cannot go directly from statistical correlation to causation, and further investigations are required.

13.1 Covariance and Correlation between two variables

Covariance and Correlation describe the association (relationship) between two variables, and they are closely related statistics to each other, but not the same. The covariance measures only the directional relationship between the two variables and reflects how they change together. A direct or positive covariance means that paired values of the two variables move in the same direction, while an indirect or negative covariance means they move in the opposite direction.
The formula for covariance is:
where xi is the ith x-value in the data set, is the mean of the x values, yi is the ith y-value in the data set, is the mean of the y-values and n is the number of data values in each data set.
If cov(X, Y) > 0 there is a positive relationship between the dependent and independent variables, and if cov(X, Y) < 0 the relationship is negative.

Example 13.1 Computing the covariance

Data file: Ex13_1.csv
Suppose that a physics teacher would like to convince her students that the amount of time they spend studying for a written test is related to their test score. She asks seven of her students to study for 0.5, 1, …, 3.5 hours and records their test scores, which are displayed in Table 13.1
  • Probability, Statistics and Other Frightening Stuff
    • Alan Jones(Author)
    • 2018(Publication Date)
    • Routledge
      (Publisher)
    5 Measures of Linearity, Dependence and Correlation
    We have considered Measures of Central Tendency and Measures of Dispersion and Shape, but these are somewhat insular statistics or univariate statistics, meaning they are, in effect, one dimensional as they are expressing a view of a single value or range variable. Estimating is all about drawing relationships between variables that we hope will express insight into how the thing we are trying to estimate behaves in relation to something we already know, or at least, that we feel more confident in predicting.
    Ideally, we would like to be able to ascertain cause and effect between an independent variable or driver, and the dependent variable, or entity we are trying to estimate. However, the reality is that in many cases we cannot hope to understand the complex relationships of cause and effect and must suffice ourselves with drawing inference from relationships that suggest that things tend to move in the same direction or opposite directions, and therefore we can produce estimates by reading across changes in one variable (a driver) into changes in some other variable we want to estimate. In short, we want to have some bivariate or multivariate measures that can advise us when there appears to be a relationship between two or more variables. Correlation is a means of measuring the extent (if any) of any relationship.
    A word (or two) from the wise?
    'Statistician: A man who believes that figures don't lie, but admits that under analysis some of them won't stand up either'.
    Evan Esar (1899-1995) American humourist
    (I only say ‘appears to be a relationship’ because we are dealing with statistics here – heed the ‘A word [or two] from the wise’! We will always need to apply the sense check to the statistics.)
    Definition 5.1 Correlation
    Correlation is a statistical relationship in which the values of two or more variables exhibit a tendency to change in relationship with one other. These variables are said to be positively (or directly) correlated if the values tend to move in the same direction, and negatively (or inversely) correlated if they tend to move in opposite directions.
  • Practical Statistics for Geographers and Earth Scientists
    • Nigel Walford(Author)
    • 2011(Publication Date)
    • Wiley
      (Publisher)
    8 Correlation
    Correlation is an overarching term used to describe those statistical techniques that explore the strength and direction of relationships between attributes or variables in quantitative terms. There are various types of correlation analysis that can be applied to variables and attributes measured on the ratio/interval, ordinal and nominal scales. The chapter also covers tests that can be used to determine the significance of correlation statistics. Correlation is often used by students and researchers in Geography, Earth and Environmental Science and related disciplines to help with understanding how variables are connected with each other.
    Learning outcomes This chapter will enable readers to:
    • carry out correlation analysis techniques with different types of variables and attributes;
    • apply statistical tests to find out the significance of correlation measures calculated for sampled data;
    • consider the application of correlation techniques when planning the analyses to be carried out in an independent research investigation in Geography, Earth Science and related disciplines.
    8.1 Nature of relationships between variables
    Several of the statistical techniques that we have explored in previous chapters have concentrated on one variable at a time and treated this in isolation from others that may have been collected for the same or different sets of sampled phenomena. However, generally the more interesting research questions are those demanding that we explore how different attributes and variables are connected with or related to each other. In statistics such connections are called relationships. Most of us are probably familiar, at least in simple terms, with the idea that relationships between people can be detected my means of analysing DNA. Thus, if the DNA of each person in a randomly selected group was obtained and analysed, it would be possible to discover if any of them were related to each other. If such a relationship were to be found, it would imply that the two people had something in common, perhaps sharing a common ancestor. In everyday terms, we might expect to see some similarity between those pairs of people having a relationship, in terms of such things as eye, skin or hair colour, facial features and other physiological characteristics. If these characteristics were to be quantified as attributes or variables, then we would expect people who were related to each other to possess similar values and those who are unrelated to have contrasting values.
  • Discovering Statistics Using SAS
    correlation coefficient. We then discover how to carry out and interpret correlations in SAS. The chapter ends by looking at more complex measures of relationships; in doing so it acts as a precursor to the chapter on multiple regression.

    6.2. Looking at relationships

    In Chapter 4 I stressed the importance of looking at your data graphically before running any other analysis on them. I just want to begin by reminding you that our first starting point with a correlation analysis should be to look at some scatterplots of the variables we have measured. I am not going to repeat how to get SAS to produce these graphs, but I am going to urge you (if you haven’t done so already) to read section 4.7. before embarking on the rest of this chapter.

    6.3. How do we measure relationships?

    6.3.1. A detour into the murky world of covariance
    The simplest way to look at whether two variables are associated is to look at whether they covary. To understand what covariance is, we first need to think back to the concept of variance that we met in Chapter 2 . Remember that the variance of a single variable represents the average amount that the data vary from the mean. Numerically, it is described by:
    The mean of the sample is represented by , xi is the data point in question and N is the number of observations (see section 2.4.1 ). If we are interested in whether two variables are related, then we are interested in whether changes in one variable are met with similar changes in the other variable. Therefore, when one variable deviates from its mean we would expect the other variable to deviate from its mean in a similar way. To illustrate what I mean, imagine we took five people and subjected them to a certain number of advertisements promoting toffee sweets, and then measured how many packets of those sweets each person bought during the next week. The data are in Table 6.1 as well as the mean and standard deviation (s
  • Statistics for the Behavioural Sciences
    eBook - ePub
    Y, and is calculated as:
    where n - 1 is the number of paired observations (in most cases this corresponds to the number of subjects sampled).
    Notice the similarity of the above formula to the formula to calculate the population variance estimated from sample data, As in the case of the variance, to provide a better estimate of the population covariance using sample data, n - 1 is used instead of n as the denominator.
    For the data-set presented in Table 11.2 (see also Table 11.3 where computation details are presented) the covariance between degree mark (i.e., X ) and monthly salary (i.e., Y) is:
    11.3 TableData and computational details for calculating the Pearson correlation coefficient r to measure the strength of the linear relationship between degree mark and monthly income in a sample of 43 graduates

    The Pearson product-moment correlation coefficient r

    The magnitude of the covariance is a function of the scales used to measure X and Y (i.e., their standard deviations). Hence, the covariance is not appropriate to measure the strength of the relationship between two variables. An absolute covariance of a given size may reflect either a weak relationship, if the standard deviations of the two variables investigated are large, or a strong relationship if the standard deviations of the two variables are small. To avoid this problem we need an index of the strength of the linear relationship between two variables which is independent of the scales used to measure them. To obtain this index the covariance is divided by the product of the standard deviations of the variables. The standardised covariance between two variables is called the Pearson product-moment correlation coefficient r
  • Business Statistics Using Excel
    eBook - ePub

    Business Statistics Using Excel

    A Complete Course in Data Analytics

    13 Correlation and Covariance
    DOI: 10.4324/9781032671703-13
    Learning Objectives
    After going through this chapter, you will be able to
    • Understand the importance of the correlation coefficient.
    • Analyse ungrouped data using correlation coefficient (Pearson’s coefficient of correlation.)
    • Apply computing the correlation coefficient of grouped data (Pearson’s coefficient of correlation).
    • Understand the working of the rank correlation coefficient with illustrations.
    • Analyse data using the auto-correlation coefficient.
    • Study the theory and applications of covariance.

    13.1 Introduction

    In reality, there are numerous circumstances in which two variables may be related, such as country’s economic growth and the growth in its stock market index, government spending and economic growth, and so on. Both covariance and the correlation coefficient, which are discussed in this chapter, can be used to estimate these relationships [1 ].

    13.2 Correlation

    The correlation coefficient is a metric used to express the strength of association between two variables. The correlation coefficient has a range of values from –1 to +1. When two variables have a correlation coefficient of –1, their inverse correlation is at its highest. If it is 0, there is absolutely no relationship between the two variables and a zero degree of association between them. The two variables have the highest possible positive correlation if it is 1 [4 ].
    The different types of correlation coefficient are listed here.
    • Correlation coefficient of ungrouped data (Pearson’s coefficient of correlation)
    • Correlation coefficient of grouped data (Pearson’s coefficient of correlation)
    • Rank correlation coefficient
    • Auto-correlation coefficient
    • Covariance

    13.3 Correlation Coefficient of Ungrouped Data Using Excel Sheets and Correl Function

    In reality, the degree of correlation between two variables is a matter of interest. Consider the cost of living and the urbanisation index as two variables. Generally speaking, it is believed that as urbanisation increases, so will the cost of life. Such a notion may not hold true in direct proportion if the area of study is entirely surrounded by farming communities and other cities are located far away from it. Whatever the situation, the government or organisations will be motivated to do studies that will aid in relocating upcoming businesses so that the rise in the expense of living will be kept within a tolerable range.
  • Statistics for the Behavioural Sciences
    eBook - ePub

    Statistics for the Behavioural Sciences

    An Introduction to Frequentist and Bayesian Approaches

    • Riccardo Russo(Author)
    • 2020(Publication Date)
    • Routledge
      (Publisher)
    Y, and is calculated as:
    COV
    x y
    =
    ( x
    x ¯
    ) ( y
    y ¯
    )
    n 1
    where n − 1 is the number of paired observations (in most cases this corresponds to the number of subjects sampled).
    Notice the similarity of the above formula to the formula to calculate the population variance estimated from sample data,
    s 2
    =
    ( x
    x ¯
    )
    2
    n 1
    .
    As in the case of the variance, to provide a better estimate of the population covariance using sample data, n − 1 is used instead of n as the denominator.
    For the data-set presented in Table 10.2 (see also Table 10.3 where computation details are presented) the covariance between degree mark (i.e., X) and monthly salary (i.e., Y) is:
    COV
    x y
    =
    ( x
    x ¯
    ) ( y
    y ¯
    )
    n 1
    =
    72567.44 42
    = 1727.796

    10.5 The Pearson product-moment correlation coefficient r

    The magnitude of the covariance is a function of the scales used to measure X and Y
  • Statistical Rules of Thumb
    4 Covariation
    One of the key statistical tasks is the investigation of relationships among variables. Common terms used to describe them include agreement, concordance, contingency, correlation, association, and regression. In this chapter covariation will be used as the generic term indicating some kind of linkage between events or variables. Almost equivalent is the term association, but it tends to conjure up images of categorical data. Covariation is more neutral and also has the advantage of suggesting a statistical-scientific framework.
    The topic of covariation is immense; a whole book could be written. Books on this topic typically fall into two categories: those dealing with discrete variables, for example Agresti (2007), and those dealing with continuous variables, for example, Kleinbaum et al. (1998). Additionally, specific subject areas, such as epidemiology and environmental studies, tends to have their own specialized books.
    The rules suggested here can be multiplied manyfold. The aim of this chapter is to discuss assumptions and some straightforward implications of ways of assessing covariation. One of the themes is that assessing covariation requires a good understanding of the research area and the purpose for which the measure of covariation is going to be assessed. Since the Pearson product-moment coefficient of correlation is the most widely used and abused, the chapter focuses on it.
    It is unprofitable to attempt to devise a complete taxonomy of all the terms used to describe covariation. It will suffice to outline some broad categories as summarized in Figure 4. The first category deals with the source of the data: Are they based on experimental studies or observational studies? In a broad sense this deals with the way the data are generated. This will be important in terms of the appropriate inferences. A second classification describes the nature of the variables; a rough split is categorical versus continuous. This leaves ordinal variables and rank statistics in limbo territory somewhere between these two, but this will do at first blush. Things become complicated enough if one variable is categorical and the other is not. The third classification is symmetric versus asymmetric. Correlation is implicitly based on symmetric covariation, whereas regression is asymmetric. Within symmetric measures a distinction is made between agreement and correlation. Within asymmetric measures, distinguish between regression, calibration, and prediction. A measure of causal covariation will, by necessity, be asymmetric. Together these classifications lead to 30 subcategories, indicating that selecting appropriate measures of covariation is a nontrivial task. Specific scientific areas tend to have their own favorite measures. For example, the kappa statistics is very popular in the social sciences. Some measures can be used in several of these classifications. A key point is that some measures are inappropriate-or at least, less than optimal-for some situations, as will be indicated below.
  • Index pages curate the most relevant extracts from our library of academic textbooks. They’ve been created using an in-house natural language model (NLM), each adding context and meaning to key research topics.