Mathematics

Linear Regression

Linear regression is a statistical method used to model the relationship between a dependent variable and one or more independent variables. It assumes a linear relationship between the variables and aims to find the best-fitting line that minimizes the differences between the observed and predicted values. This technique is commonly used for prediction and forecasting in various fields.

Written by Perlego with AI-assistance

11 Key excerpts on "Linear Regression"

  • Statistics for Dental Clinicians
    • Michael Glick, Alonso Carrasco-Labra, Olivia Urquhart(Authors)
    • 2023(Publication Date)
    • Wiley-Blackwell
      (Publisher)
    Chapter 12 ). These complex relationships can be addressed in the design and analysis phases using an appropriate study design and analysis method (i.e., regression analysis).

    Prediction

    Another practical application of regression is to derive a set of independent variables that best predict a dependent variable. When the goal is to identify a subset of independent variables that can explain a large proportion of the variability or differences in the values of the dependent variable to predict future events, regression analysis is used to make such a prediction.
    Irrespective of the research goal (estimation or prediction), the independent and dependent variables in a regression equation can take on many forms (e.g., continuous, dichotomous, ordinal, time‐to‐event) (Chapter 1 ). The type of data will drive which regression equation to use to analyze observed data.

    Linear Regression

    Linear Regression is a subtype of regression analysis that describes the relationship between one or more independent variables and a continuous dependent variable. The most rudimentary form of Linear Regression is simple Linear Regression (SLR), which describes the relationship between two variables. The dependent variable is always continuous, while the independent variable can be continuous or categorical. For simplicity, we will address the relationship between two continuous variables (e.g., does an individual’s weight (continuous variable 1) depend on their height (continuous variable 2)?) herein, but note that the independent variable can take on other forms (e.g., categorical).
    The correlation between two continuous variables can be quantified with a correlation coefficient, which indicates the strength and direction of a linear association (Chapter 10 ). Going one step further, an equation for a line that best fits the data can be estimated—a line of best fit. Suppose a data set contains values for two continuous variables (data pairs) measured for all individuals in the data set, where x is the height variable and y is the body weight variable. A scatterplot of these data shows a direct linear correlation between height and weight (Figure 11.1 ). An equation for the line that best fits these data can be derived from the basic formula of a straight line, a linear formula (Appendix 1 , Formula 11.1; Figure 11.2 ), where y, the value of the dependent variable, equals the sum of the y‐intercept (the value of y when x = 0, also referred to as a constant “a”) and the slope of the line (the amount that the value of y increases for every one‐unit increase in the value of x, “b”) (Figure 11.2
  • Theory of Linear Models
    CHAPTER 1

    Simple Linear Regression

    Linear models are used for studying the explanation of a given variable in terms of a linear combination of given explanatory variables. In the present chapter, we discuss the case of one explanatory variable, as a preparation for the general case, which is treated in Chapter 2 and onwards. Readers already familiar with simple Linear Regression may want to go directly to Chapter 2 , with occasional reference to Chapter 1 as necessary.

    1.1 The Linear Regression model

    Consider an experiment in which we make simultaneous measurements of two variables x and y for a range of different experimental conditions. If we make n measurements, let
    (
    x 1
    ,
    y 1
    ) ,   ,   (
    x n
    ,  
    y n
    )
    denote the corresponding n pairs of observations. We construct a statistical model for the situation where the relationship between x and y is thought to be linear or approximately so.
    Often x represents the experimental conditions, and y represents the outcome of the experiment. The variable y is called the response variable or the dependent variable . We assume that y 1 , …,
    yn
    are realizations of independent random variables Y 1 , …,
    Yn
    . In contrast, the values x l , …,
    xn
    are considered constant (non-random), and x is called the explanatory variable or the independent variable . We must hence make a clear distinction between the response and explanatory variables in regression analysis. Even if x 1 , …
    xn
    are realizations of random variables Xl , …,
    Xn
    , we may think of x l ,,
    xn
    as fixed, in the sense that we consider the conditional distribution of y1 ,…,
    Yn
    given Xl =
    Xl
    , …,
    Xn
    =
    xn
    .
    The first step in the analysis is to make a scatterplot of y versus x . A typical scatterplot is shown in Figure 1.1
  • Simple Statistical Tests for Geography
    11    

    Regression Analysis

     
       

    11.1 Simple Linear Regression

    Simple Linear Regression is a method that allows a ‘best-fit’ line to be added to a set of points on an x y plot or scatter-graph. There are many uses for regression in geography and related disciplines. For example, when you know the value on the horizontal axis it allows you to define the most likely value on the vertical axis. Where one of the axes represents space or time the best-fit or regression line can be used, with care, to make predictions that go beyond the range of the measurements, providing a method of prediction. Relationships defined using regression can also be extended into the past allowing, for example, the reconstruction of past climate and environmental change.
    When I was a student, simple Linear Regression was a method that was just touched on at the end of a typical geography course on statistical methods and the complexity of the mathematics made it very difficult to use. With modern computers, however, all of that has changed and performing regression analysis is remarkably simple and requires no mathematics at all. In fact you have already seen it performed in the last chapter, because the ‘trendline’ that is fitted to an x y plot or scatter-graph in a spreadsheet is actually a ‘regression line’. It appears at the click of a mouse. The ease with which regression can now be conducted is both a blessing and a curse for geography students. It is very easy to do it but it is also very easy to do it wrong and produce absolute nonsense. If you want to use regression it is really important that you understand how it works (Figure 11.1 ). Only then will you be able to check the assumptions have been met and sensibly interpret your results.
       

    11.2 The Straight Line Equation

    In the chapter on correlation analysis x y plots were used to illustrate the shape of the relationship between two parameters or variables. Each point on such a graph represents a pair of numbers, one representing the variable on the horizontal (x ) axis and the other the variable on the vertical (y ) axis. The shape of the relationship was used to decide whether it was reasonable to use parametric or non-parametric approaches to correlation. Where the data proved suitable, correlation analysis was used to quantify the strength of the relationship between the two variables. The aim of regression analysis, in essence, is not to quantify the strength but to define the nature or ‘shape’ of the relationship. The simplest shape is a straight line and most applications of regression are essentially trying to define the straight line that best describes the relationship between the two variables. To understand regression it is essential that you understand how a straight line on an x y
  • Applied Univariate, Bivariate, and Multivariate Statistics
    • Daniel J. Denis(Author)
    • 2015(Publication Date)
    • Wiley
      (Publisher)
    The designation simple Linear Regression denotes the fact that the regression model features only a single explanatory variable. Models with two or more explanatory variables will be discussed in Chapters 9 and 10. More than simply making predictions, regression seeks to predict values on the response variable such that the average error in prediction is less than what would be the case had the explanatory variable not been used as a predictor. What this means statistically is that there must be a correlation between the response and explanatory variable for Linear Regression to be effective. Otherwise, in the absence of such a correlation, predictions would be generally no more accurate than if the explanatory variable were not used at all. Draper and Smith (1998) is a classic resource on regression analysis that also features topics on weighted least‐squares, ridge regression, nonlinear estimation, and robust regression. Fox (1997) is a definitive thorough treatment of regression and related models, which includes generalized linear models. Fox also provides a rather in‐depth study of diagnostics for linear models, and also includes chapters on the geometry of such models. Cohen et al. (2002) is also a classic resource on applied regression with a focus toward the behavioral sciences. Pedhazur (1997) provides a thorough treatment targeted toward behavioral scientists. Neter et al. (1996) feature wide coverage of linear models in general. Wright and London (2009) is a useful resource for fitting regression models in R. 8.1 BRIEF HISTORY OF REGRESSION Regression analysis has a very deep history. The techniques of correlation and regression, as applied to empirical observations, are generally attributed to Francis Galton (1822–1911), an English Victorian who made countless contributions to science in fields such as anthropology, geography, psychology, and statistics (Figure 8.1). For a discussion of Galton, see Fancher and Rutherford (2011)
  • Quantitative Methods
    eBook - ePub

    Quantitative Methods

    An Introduction for Business Management

    • Paolo Brandimarte(Author)
    • 2012(Publication Date)
    • Wiley
      (Publisher)
    Granted, there are many practical cases in which this interpretation does make sense, but we should never forget that a regression model relies on association and not causation. The same caveats that we have pointed out when dealing with correlation apply to regression models. By a similar token, when referring to functions it is customary to call x the independent variable and y the dependent variable; obviously, these terms can be a bit misleading in a statistical framework. In the following we will refer to x by the terms explanatory variable or regressor ; y will be called response or regressed variable. To build and use a Linear Regression model, we must accomplish the following steps: 1. We must devise a suitable way to choose the coefficients a and b. 2. We should check if the model makes sense and is reliable enough. 3. We should use the model by Building knowledge to understand a phenomenon Generating forecasts and scenarios for decision making under uncertainty We accomplish the first step in Section 10.1, where we lay down the foundations of the least-squares method. Section 10.2 deals with the second step, which requires building a statistical framework for Linear Regression. This is needed to state precise assumptions behind our modeling endeavor, which should be thoroughly checked before using the model; we also need to draw statistical inferences and the test hypotheses about the estimated coefficients in the model. We do so in Section 10.3, for the simpler case of a nonstochastic regressor, i.e., when the explanatory variable x is treated as a number rather than a random variable. Then, in Section 10.4, we tackle the third step. There are different uses of a Linear Regression model, and statistics in general. We might be interested in understanding a physical or social phenomenon; in such a case a model is used for knowledge discovery purposes and to ascertain the impact of explanatory variables
  • Analytic Methods in Sports
    eBook - ePub

    Analytic Methods in Sports

    Using Mathematics and Statistics to Understand Data from Baseball, Football, Basketball, and Other Sports

    6   Modeling Relationships Using Linear Regression  
      6.1  Introduction
    The correlation coefficient measures the extent to which data cluster around a line. In Linear Regression analysis, we determine that “best-fitting” line and use it to better understand the relationship between the variables under consideration. The results of a Linear Regression analysis include an equation that relates one variable to another.
    Simple Linear Regression, the subject of this chapter, applies when the data consist of two variables, commonly denoted by X and Y, and our goal is to model Y, called the response variable, in terms of X, called the predictor variable. Multiple regression is used when our goal is to model a response variable in terms of several predictor variables X1 , …, X
    p
    ; models of this type are considered in Chapter 7 .
      6.2  Modeling the Relationship between Two Variables Using Simple Linear Regression
    Consider the relationship between runs scored in a season and a team’s OPS (on-base plus slugging) value for that season for MLB (Major League Baseball) teams from the 2007–2011 seasons (Dataset 6.1). The plot in Figure 6.1 shows that these variables have a strong linear relationship, which is confirmed by the correlation coefficient of 0.96. Let Y denote runs scored and let X denote OPS. The conclusion that Y and X have a linear relationship can be expressed by saying that
    Y = a + b X
    for some constants a, b.
    FIGURE 6.1Runs scored versus OPS for 2007–2011 MLB teams.
    However, it is evident from the plot of runs scored versus OPS that this linear relationship does not hold exactly. In statistics, this fact is often expressed by writing
    Y = a + b X + ε
    where ɛ represents “random error.” Therefore, this equation states that Y is equal to a linear function of X plus random error, or, simply, that Y is approximately a linear function of X. It should be noted that the term error does not mean “mistake” in this context but rather refers to the deviation from the regression line a + b X. In particular, we assume that the average value of ɛ
  • Python: Advanced Predictive Analytics
    • Joseph Babcock, Ashish Kumar(Authors)
    • 2017(Publication Date)
    • Packt Publishing
      (Publisher)
    Several complexities complicate this analysis in practice. First, the relationships we fit usually involve not one, but several inputs. We can no longer draw a two dimensional line to represent this multi-variate relationship, and so must increasingly rely on more advanced computational methods to calculate this trend in a high-dimensional space. Secondly, the trend we are trying to calculate may not even be a straight line – it could be a curve, a wave, or even more complex patterns. We may also have more variables than we need, and need to decide which, if any, are relevant for the problem at hand. Finally, we need to determine not just the trend that best fits the data we have, but also generalizes best to new data.
    In this chapter we will learn:
    • How to prepare data for a regression problem
    • How to choose between linear and nonlinear methods for a given problem
    • How to perform variable selection and assess over-fitting

    Linear Regression

    Ordinary Least Squares (OLS ).
    We will start with the simplest model of Linear Regression, where we will simply try to fit the best straight line through the data points we have available. Recall that the formula for Linear Regression is:
    Where y is a vector of n responses we are trying to predict, X is a vector of our input variable also of length n, and β is the slope response (how much the response y increases for each 1-unit increase in the value of X). However, we rarely have only a single input; rather, X will represent a set of input variables, and the response y is a linear combination of these inputs. In this case, known as multiple Linear Regression, X is a matrix of n rows (observations) and m columns (features), and β is a vector set of slopes or coefficients which, when multiplied by the features, gives the output. In essence, it is just the trend line incorporating many inputs, but will also allow us to compare the magnitude effect of different inputs on the outcome. When we are trying to fit a model using multiple Linear Regression, we also assume that the response incorporates a white noise error term ε, which is a normal distribution with mean 0 and a constant variance for all data points.
    To solve for the coefficients β in this model, we can perform the following calculations:
    The value of β is known the ordinary least squares estimate of the coefficients. The result will be a vector of coefficients β for the input variables. We make the following assumptions about the data:
  • A First Course in the Design of Experiments
    eBook - ePub
    • John H. Skillings, Donald Weber(Authors)
    • 2018(Publication Date)
    • CRC Press
      (Publisher)
    CHAPTER 2 LINEAR MODELS 2.1 Definition of a Linear Model
    As noted in Chapter 1 , the analysis of data obtained in designed experiments is facilitated by the use of a statistical model. Statistical models are used to describe an assumed structure for both the underlying population and the sample. In experimental designs the models we use are special cases of a very general model called the linear statistical model, or, simply, the linear model.
    Definition 2.1.1 A linear statistical model is a model that can be represented in the form
    Y =
    β 0
    x 0
    +
    β 1
    X 1
    +
    β 2
    x 2
    + +
    β k
    x k
    + ε
    (2.1.1)
    where Y is an observable random variable, x0 ,x1 ,…,xk are known mathematical (nonrandom) variables, β0 , β1 ,…, βk are parameters and ε is an unobservable random variable.
    This model is used to represent a random variable Y as a linear combination of β0 , β1 ,…, βk plus a random component. Since we are utilizing linear combinations, it is natural to call this model a linear model. The parameters can be thought of as unknown constants that need to be estimated using sample data. The random variable Y is called the response variable or dependent variable and x0 , x1 ,…, xk are referred to as the independent variables.
    Linear models are widely used for applications in many disciplines. For example, in agriculture we can think of determining the yield of wheat, Y, based on various values of independent variables which could include the variety of wheat, the amount of fertilizer, and the type of soil. In social science we can think of the problem of predicting a student’s academic ability as measured by grade point average using independent variables such as IQ, social adjustment, and prior grades. Since one can seldom predict the value of a response variable perfectly based only on the independent variables, and since the response variable does not always yield the same value for identical independent variable values, one accounts for this anomaly by including the random “error” term, ε, in the model.
  • Regression Analysis
    eBook - ePub

    Regression Analysis

    A Practical Introduction

    • Jeremy Arkes(Author)
    • 2023(Publication Date)
    • Routledge
      (Publisher)
    2 Regression analysis basics
    DOI: 10.4324/9781003285007-2
    1. 2.1 What is a regression?
    2. 2.2 The four main objectives for regression analysis
    3. 2.3 The Simple Regression Model
      1. 2.3.1 The components and equation of the Simple Regression Model
      2. 2.3.2 An example with education and income data
      3. 2.3.3 Calculating individual predicted values and residuals
    4. 2.4 How are regression lines determined?
      1. 2.4.1 Calculating regression equations
      2. 2.4.2 Total variation, variation explained, and remaining variation
    5. 2.5 The explanatory power of the regression
      1. 2.5.1 R-squared (R2 )
      2. 2.5.2 Adjusted R-squared
      3. 2.5.3 Mean Square Error and Standard Error of the Regression
    6. 2.6 What contributes to slopes of regression lines?
    7. 2.7 Using residuals to gauge relative performance
    8. 2.8 Correlation vs. causation
    9. 2.9 The Multiple Regression Model
    10. 2.10 Assumptions of regression models
    11. 2.11 Everyone has their own effect
    12. 2.12 Causal effects can change over time
    13. 2.13 Why regression results might be wrong: inaccuracy and imprecision
    14. 2.14 The use of regression flowcharts
    15. 2.15 The underlying Linear Algebra in regression equations
    16. 2.16 Definitions and key concepts
      1. 2.16.1 Different types of data (based on the unit of observation and sample organization)
      2. 2.16.2 A model vs. a method
  • Machine Learning
    eBook - ePub

    Machine Learning

    Theory and Practice

    Regression
    DOI: 10.1201/9781003002611-2
    Often we are given a dataset for supervised learning, where each example in the training set is described in terms of a number of features, and the label associated with the example is numeric. In terms of mathematics, we can think of the features as independent variables and the label as a dependent variable. For example, the independent variable can be the monthly income of a person, and the dependent variable can be the amount of money the person spends on entertainment per month. In this case, we can also say that the person is described in terms of one feature, the income; and the person's label is the amount of money spent on entertainment per month. In such a case, our training set will consist of a number of examples where each person is described only in terms of his or her monthly income. Corresponding to each person, we have a label, which corresponds to the person's entertainment expense per month. Usually, the example is written as x, and the label as y. If we have N examples in our training set, we can refer to the ith example as
    x
    ( i )
    and its corresponding label as
    y
    ( i )
    . The goal in regression is to find a function
    f ^
    of x that explains the training data the best. The^on top of f says that it is not the real function f that explains the relationship between y and x, but an empirical approximation of it, based on the few data points we have been given. In machine learning, this function
    f ^
    is learned from the given dataset, so that it can be used to predict the value y, given an arbitrary value x of x, i.e.,
    f ^
    (
    x
    )
    is the predicted value
    y ^
    for y for a value x for x. The dataset from which the regression function is learned is called the training dataset.
    Of course, we can make the training examples more complex, if we describe each person in terms of several features, viz., the person's age, income per month, number of years of education, and the amount of money in the bank. Assume the label is still the same. In this case, each example is a vector of four features. In general, the example is a vector x (also written as
    x
    ) where each component is a specific feature of the example, and the label is still called y. If we have a training set with a number of examples, the ith example can be referred to as
    x
    ( i )
    and the corresponding label is
    y
    ( i )
    . The example
    x
    ( i )
    is described in terms of its features. If we have n features, the i
  • Pragmatic Machine Learning with Python
    eBook - ePub

    Pragmatic Machine Learning with Python

    Learn How to Deploy Machine Learning Models in Production

    X .
    The main difference between linear and non-linear relationship is the gradient (constant vs. variable). And we can say that these relationships are always with respect to other variables and of very relative nature.

    Conversion between linear and non-linear relationships

    If we change/transform one variable, then the relationship changes drastically. A nonlinear one becomes linear and vice-versa.
    For the previous equation, if we consider w as x 2 and the equation becomes:
    y = 3 + 10w
    Plotting above equation will produce below a line like below:
    Figure 3.3: Plot of a sample transformed non-linear relationship
    So, transforming one variable to others can change the nature of the relationships. A non-linear can become linear and vice versa.

    Building a Linear Regression model

    We saw the formal algebraic form of Linear Regression, which is an equation. Now, the question is, how can we build the model and find optimal values of coefficients b 0 , b 1 , b 2 , …., b n . There are different techniques fordoing it. We will discuss one of them in the next section.

    General approach tosolving Linear Regression

    We can think of Linear Regression as an optimization problem. Our objective is to minimize the errors generated in the prediction. If are the predicted value, and y i is the actual value of the target variable, then objective function, a.k.a. the cost function can be written as:
    Error is nothing but is an average of the square of differences between predicted and actual values of continuous variable y
Index pages curate the most relevant extracts from our library of academic textbooks. They’ve been created using an in-house natural language model (NLM), each adding context and meaning to key research topics.