Psychology

Statistical Significance

Statistical significance in psychology refers to the likelihood that a research finding is not due to chance. It is determined through statistical tests and indicates the reliability of the results. If a finding is statistically significant, it suggests that there is a true relationship or difference in the population being studied, rather than one that occurred by random chance.

Written by Perlego with AI-assistance

8 Key excerpts on "Statistical Significance"

  • Statistics for Psychologists
    eBook - ePub

    Statistics for Psychologists

    An Intermediate Course

    In some cases the results from this stage may contain such an obvious message that more detailed analysis becomes largely superfluous. Many of the methods used in this preliminary analysis of the data will be graphical, and it is some of these that are described in Chapter 2. 1.2.2.  Estimation and Significance Testing Although in some cases an initial examination of the data will be all that is necessary, most investigations will proceed to a more formal stage of analysis that involves the estimation of population values of interest and/or testing hypotheses about particular values for these parameters. It is at this point that the beloved significance test (in some form or other) enters the arena. Despite numerous attempts by statisticians to wean psychologists away from such tests (see, e.g., Gardner and Altman, 1986), the p value retains a powerful hold over the average psychology researcher and psychology student. There are a number of reasons why it should not. First, p value is poorly understood. Although p values appear in almost every account of psychological research findings, there is evidence that the general degree of understanding of the true meaning of the term is very low. Oakes (1986), for example, put the following test to 70 academic psychologists: Suppose you have a treatment which you suspect may alter performance on a certain task. You compare the means of your control and experimental groups (say 20 subjects in each sample). Further suppose you use a simple independent means t-test and your result is t = 2.7, df = 18, P = 0.01
  • Introductory Biostatistics
    • Chap T. Le, Lynn E. Eberly(Authors)
    • 2016(Publication Date)
    • Wiley
      (Publisher)
    nonsignificance when something of great practical importance is going on. The conclusion is that the attainment of Statistical Significance in a study is just as affected by extraneous factors as it is by practical importance. It is essential to learn that Statistical Significance is not synonymous with practical importance.

    5.1 BASIC CONCEPTS

    From the introduction of sampling distributions in Chapter 4 , it was clear that the value of a sample mean is influenced by:
    1. The population μ, because
    2. Chance: and μ are almost never identical. The variance of the sampling distribution is
      a combined effect of natural variation in the population (σ2 ) and sample size n.
    Therefore, when an observed value is far from a hypothesized value of μ (e.g., mean high blood pressures for a group of oral contraceptive users compared to a typical average for women in the same age group), a natural question would be: Was it just due to chance, or something else? To deal with questions such as this, statisticians have invented the concept of hypothesis tests, and these tests have become widely used statistical techniques in the health sciences. In fact, it is almost impossible to read a research article in public health or medical sciences without running across hypothesis tests!

    5.1.1 Hypothesis Tests

    When a health investigator seeks to understand or explain something, for example the effect of a toxin or a drug, he or she usually formulates his or her research question in the form of a hypothesis. In the statistical context, a hypothesis is a statement about a distribution (e.g., “the distribution is normal”) or its underlying parameter(s) (e.g., “μ = 10”), or a statement about the relationship between probability distributions (e.g., “there is no statistical relationship”) or its parameters (e.g., “μ1 = μ2 ” – equality of population means). The hypothesis to be tested is called the null hypothesis and will be denoted by H0 ; it is usually stated in the null form, indicating no difference or no relationship between distributions or parameters, similar to the constitutional guarantee that the accused is presumed innocent until proven guilty. In other words, under the null hypothesis, an observed difference (like the one between sample means and for samples 1 and 2, respectively) just reflects chance variation. A hypothesis test is a decision-making process that examines a set or sets of data, and on the basis of expectation under H0 , leads to a decision as to whether or not to reject H0 . An alternative hypothesis, which we denote by HA , is a hypothesis that in some sense contradicts the null hypothesis H0 , just as the charge by the prosecution in a trial by jury. Under HA , the observed difference is real (e.g., not by chance but because
  • Introduction to Statistics for Nurses
    • John Maltby, Liz Day, Glenn Williams(Authors)
    • 2014(Publication Date)
    • Routledge
      (Publisher)
    We have divided these considerations into two main areas, but they are, like many other statistical procedures, related. These two areas are: (1) statistical and clinical significance and (2) hypothesis testing and confidence intervals. Therefore at the end of this chapter you should be able to outline ideas that underlie statistical and clinical significance and how these relate to effect size and percentage improvement in a research participant’s condition. You will also be able to outline the ideas that form hypothesis testing and confidence intervals, and how these two concepts are used in the literature to provide context to statistical findings.

    Statistical versus clinical significance

    Within the statistical literature there is a distinction between Statistical Significance and clinical significance. Throughout this book we have concentrated on reporting Statistical Significance because these are changes that are primarily related to the use of statistical tests. However, when we report the findings from statistical tests, a number of questions can arise about the practical importance of these findings. These questions are best summarised by the one question: are findings clinically (or practically) significant?
    Let us frame this distinction with the following examples. Researchers might have found that a drug treatment has had a statistically significant effect on a particular illness. To do this, doctors and researchers would have administered the drug to different groups and looked at changes in the symptoms of the illness of individuals in all groups. These groups usually include:
    • Experimental groups – groups that receive an intervention (for example, a drug, a counselling session).
    • Control groups
  • What If There Were No Significance Tests?
    • Lisa L. Harlow, Stanley A. Mulaik, James H. Steiger(Authors)
    • 2013(Publication Date)
    • Psychology Press
      (Publisher)
    Lively debate on a controversial issue is often regarded as a healthy sign in science. Anomalous or conflicting findings generated from alternative theoretical viewpoints often precede major theoretical advances in the more developed sciences (Kuhn, 1970), but this does not seem to be the case in the social and behavioral sciences. As Meehl (1978) pointed out nearly 20 years ago, theories in the behavioral sciences do not emerge healthier and stronger after a period of challenge and debate. Instead, our theories often fade away as we grow tired, confused, and frustrated by the lack of consistent research evidence. The reasons are many, including relatively crude measurement procedures and the lack of strong theories underlying our research endeavors (Platt, 1964; Rossi, 1985, 1990). But not least among our problems is that the accumulation of knowledge in the behavioral sciences often relies upon judgments and assessments of evidence that are rooted in Statistical Significance testing.
    At the outset I should point out that I do not intend here to enumerate yet again the many problems associated with the significance testing paradigm. Many competent critiques have appeared in recent years (Cohen, 1994; Folger, 1989; Goodman & Royall, 1988; Oakes, 1986; Rossi, 1990; Schmidt, 1996; Schmidt & Hunter, 1995); in fact, such criticisms are almost as old as the paradigm itself (Berkson, 1938, 1942; Bolles, 1962; Cohen, 1962; Grant, 1962; Jones, 1955; Kish, 1959; McNemar, 1960; Rozeboom, 1960). However, one consequence of significance testing is of special concern here. This is the practice of dichotomous interpretation of p values as the basis for deciding on the existence of an effect. That is, if p < .05, the effect exists. If p > .05, the effect does not exist. Unfortunately, this is a common decision-making pattern in the social and behavioral sciences (Beauchamp & May, 1964; Cooper & Rosenthal, 1980; Cowles & Davis, 1982; Rosenthal & Gaito, 1963, 1964).
    The consequences of this approach are bad enough for individual research studies: All too frequently, publication decisions are contingent on which side of the .05 line the test statistic lands. But the consequences for the accumulation of evidence across studies is even worse. As Meehl (1978) has indicated, most reviewers simply tend to “count noses” in assessing the evidence for an effect across studies. Traditional vote-counting methods generally underestimate the support for an effect and have been shown to have low statistical power (Cooper & Rosenthal, 1980; Hedges & Olkin, 1980). At the same time, those studies that find a statistically significant effect (and that are therefore more likely to be published) are in fact very likely to overestimate the actual strength of the effect (Lane & Dunlap, 1978; Schmidt, 1996). Combined with the generally poor power characteristics of many primary studies (Cohen, 1962; Rossi, 1990), the prospects for a meaningful cumulative science seem dismal.
  • The Significance Test Controversy
    • Ramon E. Henkel(Author)
    • 2017(Publication Date)
    • Routledge
      (Publisher)
    Indeed, we usually know in advance of testing that the typical null hypothesis is false—which is all the significance test is able to tell us. It is, thus, not the confusion of statistical and substantive significance per se which causes the difficulty at the present stage of social science so much as the fact that the confusion occurs in the context of undeveloped theory. Because we have few substantively significant hypotheses (i.e. deduced, precise, and condition-specified), findings or statistics come to be referred to as “significant,” which is particularly misleading. The only reasonable interpretation that can be made of such designations is that the statistic “signifies” a basis for rejecting the null hypothesis, given the probability level adopted. Also, the interpretation of a “significant” finding as a “precise” or “reliable” estimate of a parameter can be quite misleading. A correlation coefficient of.34 that is “significant” at the.05 level under the typical null hypothesis does not necessarily mean that the statistic itself is a reliable or precise estimate of the parameter value (it is your best estimate, and may be a good estimate if your sample is of the right design and large enough). A subsequent correlation of.12 or correlations that average.15 on repeated samples may also fully vindicate the decision to reject the null hypothesis
  • First steps in research 3
    One only concludes “there is a significant difference” or “there is a significant correlation” at some level of confidence, if that is what the test suggests. It does not indicate whether the finding is of any practical significance. One of the reasons why this is an issue is because of the influence the size of the sample has on Statistical Significance: in small samples, relatively big differences may come out as statistically insignificant while in very big samples, even tiny differences may turn out to be statistically significant. This problem has been overcome by calculating an effect size, in addition to the p -value, that is a standardised, scale-free measure of the magnitude of the difference or correlation being tested, and that is not affected by the size of the sample. It is not necessary to calculate an effect size in all situations. An example of this would be in a study where it is tested whether a certain type of medication reduces high blood pressure. It is found that an average reduction of 10 mm Hg is statistically significant on the 5% level, and in the field of medicine, a reduction of this magnitude is seen as practically significant. Two situations in which the calculation of effect sizes would be of great value are the following: It frequently happens that no familiar scale is available to measure the variables in which the researcher is interested. The researcher then has to use some unfamiliar or arbitrary scale, for example a few 7-point Likert scale items are 256 asked to measure some construct. In such a case, a difference between the means of two groups of 0,9 would have no meaning in terms of practical significance. When a census is undertaken (i.e. every element in the population is part of the study) an effect size is the only way by which practical significance can be judged
  • Research Methods and Statistics in Psychology
    • Hugh Coolican(Author)
    • 2018(Publication Date)
    • Routledge
      (Publisher)
    However, Harcum (1990) and many others argue that the 5% significance level is conservative, and several conventions tend to favour the retention of the null hypothesis in order not to make Type I errors. This means that many actual effects are passed over and some researchers are beginning to call for changes to the conventions of significance testing. One major step that has been taken, as mentioned above, was the British Psychological Society requesting, in research articles, a report of the actual value of p found rather than only stating that it was ‘less than.05’. Another more important one is to require researchers to report effect sizes. Significance ‘levels’ Whatever the statisticians say, though, psychological researchers tend to use the following terms in reporting results that are significant at different levels: p ≤.05 (significant at 5%) ‘The difference was significant’. p ≤.01 (significant at 1%) ‘The difference was highly significant’. p ≤.1 (the 10% level) A researcher cannot be confident of results, or publish them as an effect, if the level achieved is only p ≤.1. But if the probability is in fact close to.05 (like the sex-guesser’s results if she gets eight predictions correct and p =.055), it may well be decided that the research is worth pursuing and it may even be reported, along with other findings, as a result ‘approaching significance’. The procedure can be revisited, tightened or altered, the design may be slightly changed and sampling might be scrutinised, with perhaps an increase in sample size. p ≤.01 (the 1% level) We said above that lowering α to.01 has the unfortunate effect of increasing the likelihood of Type II errors. However, sometimes it is necessary to be more certain of our results than usual. If we are about to challenge a well-established theory or research finding by publishing results that contradict it, the convention is to achieve p ≤.01 before publication
  • Quantitative Data Analysis with SPSS 12 and 13
    eBook - ePub
    • Alan Bryman, Duncan Cramer(Authors)
    • 2004(Publication Date)
    • Routledge
      (Publisher)
    Table 6.4 . We may reduce the probability of making this kind of error by lowering the significance level from 0.05 to 0.01, but this increases the probability of committing a Type II error, which is accepting that there is no difference when there is one. A Type II error is accepting the null hypothesis when it is false. Setting the significance level at 0.01 means that the finding that four out of the five women were more perceptive than the men is assuming that this result is due to chance when it may be indicating a real difference.
    TABLE 6.4 Type I and Type II errors
    The probability of correctly assuming that there is a difference when there actually is one is known as the power of a test. A powerful test is one that is more likely to indicate a significant difference when such a difference exists. Statistical power is inversely related to the probability of making a Type II error and is calculated by subtracting beta from one (i.e. 1 - β).
    Finally, it is important to realise that the level of significance has nothing to do with the size or importance of a difference. It is simply concerned with the probability of that difference arising by chance. In other words, a difference between two samples or two treatments which is significant at the 0.05 level is not necessarily bigger than one which is significant at the 0.0001 level. The latter difference is only less probable than the former one.
    Inferring from samples to populations
    The section so far has raised the prospect of being able to generalise from a sample to a population. We can never know for sure whether a characteristic we find in a sample applies to the population from which the sample was randomly selected. As the discussion so far suggests, what we can do is to estimate the degree of confidence we can have in the characteristic we find. If we find, as we did in Chapter 5 , that the mean income in the Job Survey is £15,638.24, how confident can we be that this is the mean income for the population of workers in the firm as a whole?
    A crucial consideration in determining the degree of confidence that we can have in a mean based on a sample of the population is the standard error of the mean, which is the standard deviation of the sample means. This notion is based on the following considerations. The sample that we select is only one of an incredibly large number of random samples that could have been selected. Some of these samples would find a mean that is the same as the population mean, some will be very close to it (either above or below) and some will be further away from it (again, either above or below the population mean). If we have a population that is normally distributed, the distribution of all possible sample means will also be normally distributed. This suggests that most sample means will be the same as or close to the population mean, but that some will deviate from it by quite a large amount. The standard error of the mean expresses the degree of dispersion of these means. We know from the discussion in Chapter 5
Index pages curate the most relevant extracts from our library of academic textbooks. They’ve been created using an in-house natural language model (NLM), each adding context and meaning to key research topics.