Psychology

Standardization and Norms

Standardization in psychology refers to the process of establishing uniform procedures for administering and scoring tests. Norms, on the other hand, are the established standards of performance based on the results of a standardized test. These norms provide a frame of reference for comparing an individual's performance to that of a larger group.

Written by Perlego with AI-assistance

9 Key excerpts on "Standardization and Norms"

  • The New Psychometrics
    eBook - ePub

    The New Psychometrics

    Science, Psychology and Measurement

    • Paul Kline(Author)
    • 2014(Publication Date)
    • Routledge
      (Publisher)
    Before I do this, however, I shall discuss the standardisation of psychometric tests. This is not an intrinsic characteristic of tests but, as shall be seen, standardisation is necessary to make sense of them. It is discussed here because it is highly relevant to any comparison with the quantification of the natural sciences.

    Norms and the standardisation of psychometric tests

    The difficulty with psychometric tests is that scores, per se, have no meaning. Thus if I score 23 on the N scale of the EPO the significance of the score, whether it is high, low or average, is impossible to determine. For this reason, for every psychometric test norms have to be established; that is, the test has to be standardised. Norms are defined as the scores of a particular group which may then be used for comparison. It should be obvious that norms, if they are to be of any value, should be based on large and representative samples. Depending on the purposes of a test, norms may be set up for the general population or for special groups. For example, if a test is designed for clinical psychology, it makes sense; to have norms for the relevant clinical or psychiatric groups, perhaps obsessionals or anxiety neurotics. If the test is an ability test for children, norms should be established for 6-month age-groups, or there will be serious inaccuracy.
    How standardisation is to be carried out, the numbers required for reliable norms, the methods to ensure a representative sample, the form the norms should take are not relevant to the arguments of this chapter. Details may be found in Kline (1995) or Nunnally and Bernstein, (1994). Some points should be noted:
    • Normalisation is essentially a scaling procedure: raw scores on the test are converted to the normalised scores.
    • The scaling depends on the scores of the normative groups,
    • Intelligence tests are usually given norms which are normally distributed with a mean of 100 and a standard deviation of 15. From the area under the normal curve it can be seen that approximately 68 per cent of scores fall between 85 and 115 and 95 per cent between 85 and 130.
  • An Assessment Guide To Geriatric Neuropsychology
    • Holly Tuokko, Thomas Hadjistavropoulos(Authors)
    • 2014(Publication Date)
    • Psychology Press
      (Publisher)
    chap. 1 ). Direct observation of the person’s performance during test administration provides an additional rich source of information about the individual’s approach to tasks, tolerance levels, personal style, and coping skills, as well as providing an opportunity for the examiner to note speech and language characteristics and abnormalities in movement that may be clinically significant. Standardized tests are administered to gather objective, readily replicable data that permit reliable interpretation and meaningful comparisons (Lezak, 1995). Both forms of information (i.e., observations and test scores) are essential to the neuropsychological assessment. In isolation, each form of information is subject to misinterpretation. Test scores may be objective but must be considered within the specific context of the individual. Observations lack objective comparability. Notably in geriatric assessment, behaviors associated with normal aging may be readily identified as deficits by someone used to working with a younger population or unfamiliar with the behavioral, cognitive, social, and/or physical correlates of aging.
    Psychological test scores are most commonly interpreted in relation to the performance of a standardization sample. A standardization sample is a representative group of individuals who are administered the measure in a standardized fashion. Standardization refers to uniformity in administering and scoring the test and is discussed later in the chapter. To accurately determine the individual’s performance in relation to the standardization sample, the performance of the standardization sample can be converted to a set of derived scores characterizing the distribution of scores for the sample (standard scores). This allows individual scores to be examined in relation to other persons and to performances on other tests (Anastasi, 1988). Thus, an individual’s performance is evaluated in relation to norms , or the performance of the standardization sample.
    Standard scores come in different forms (e.g., z scores, T-scores), but are all based on the mean and standard deviation of the scores in the standardization sample (Lezak, 1995). Comparability of the scores assumes that the underlying distributions of scores have essentially the same form (i.e., typically a normal distribution of scores around the mean or normal curve). The term normalized standard scores is used to identify standard scores that have been statistically transformed to fit a normal curve. Scores may also be presented as stanines, percentile equivalents, or merely as means and standard deviations. Wechsler Intelligence Quotients (IQs) are standard scores expressed with a mean of 100 and standard deviation of 15, whereas Wechsler subtest scores have a mean of 10 and standard deviation of 3. Figure 2.1
  • A Handbook of Test Construction (Psychology Revivals)
    eBook - ePub
    • Paul Kline(Author)
    • 2015(Publication Date)
    • Routledge
      (Publisher)
    8 Standardizing the test
    In chapter 1 it was made clear that one of the advantages possessed by psychological tests in comparison with other forms of measurement is that tests are standardized. Hence it is possible to compare a subject’s score with that of the general population or other relevant groups, thus enabling the tester to make meaningful interpretations of the score.
    From this it follows that the standardization of tests is most important where scores of subjects are compared explicitly or implicitly – as in vocational guidance or educational selection. Norms may also be useful for mass-screening purposes. For the use of psychological tests in the scientific study of human attributes – the psychometrics of individual differences – norms are not as useful. For this the direct, raw test-scores are satisfactory. Thus norms meet the demand, in general, of the practical test user in applied psychology. Since norms are usually necessary for tests of ability, our discussion of how a test should be standardized will relate in the main to such tests.
    Sampling
    This is the crucial aspect of standardization: all depends upon the sample. In sampling there are two important variables: size and representativeness. The sample must accurately reflect the target population at which the test is aimed (of course, there may be several populations and consequently several samples), and it must be sufficiently large to reduce the standard errors of the normative data to negligible proportions.
    Size
    For the simple reduction of statistical error a sample size of 500 is certainly adequate. However, the representativeness of a sample is not independent of size. A general population norm, for example, of school-children would require in the region of 10,000 subjects. A sample from a limited population such as lion-tamers or fire-eaters would not have to be so large (indeed, the population would hardly be that large). Thus no statement about sample size can be made without relating it to the population from which it is derived. This discussion clarifies the point that more important than size is the representativeness of the sample. A small but representative normative sample is far superior to a large but biased sample. Some examples taken from actual tests will make this point obvious and will also indicate the best methods for test constructors of obtaining standardization samples.
  • Handbook of Psychological Testing
    • Paul Kline(Author)
    • 2013(Publication Date)
    • Routledge
      (Publisher)
    If norms are to be used for the interpretation of the meaning of test scores it is obvious that they must be accurate. If, according to the norms, a score of y is obtained by the top 20 per cent of the general population, this must really be so and not a quirk of the normative sample. If the norms are inaccurate they can be completely misleading. This is particularly dangerous in clinical and psychiatric work where tests may be used for diagnosis. Thus if a child were to be consigned to a particular form of education on the basis of test scores, it is essential, in the interests of justice, that the norms are accurate. For this reason, in this chapter we set out the demands of good standardisation.
    Before describing the techniques of standardisation and the varieties of norms which can be used, there is a more general point which needs to be made. The fact that psychological tests can be so easily standardised, thus allowing accurate comparisons with normative groups, makes them particularly effective forms of assessment, with great advantages over other methods which cannot be so standardised, such as interviews or repertory grids (Kelly, 1955) which deliberately exclude comparison and are fully described in Chapter 18 of this handbook.
    Of course norms are necessary for psychological tests because, as was discussed in the previous chapter, for most types of psychological test there is no true zero, i.e. they are not ratio scales. Most psychometrists do not believe that this renders psychological tests invalid or unscienjpgic. All it means is that norms are essential for understanding the meaning of the measurements.
    Nevertheless recent work in the theory of measurement, especially that of Michell (1990, 1997), has challenged the scienjpgic validity of most psychological scales. He, indeed, argues that it is essentially delusional to regard psychological tests as scienjpgic measures. They should be regarded as nothing more than pragmatic devices which are useful in various applied settings. Such problems, however, can be answered by certain forms of scaling, such as Rasch scales, which as examples of conjoint measurement, do, it is claimed, retain the essential properties of scienjpgic measurement and, therefore, strictly require no norms.
  • Handbook of Psychological Assessment
    • Gary Groth-Marnat, A. Jordan Wright(Authors)
    • 2016(Publication Date)
    • Wiley
      (Publisher)
    Three major questions that relate to the adequacy of norms must be answered. The first is whether the standardization group includes representation from the population on which the examiner would like to use the test. The test manual should include sufficient information to determine the representativeness of the standardization sample. If this information is insufficient or in any way incomplete, it greatly reduces the degree of confidence with which clinicians can use the test. The ideal and current practice is to use stratified random sampling. However, because this can be an extremely costly and time-consuming procedure, many tests do not meet this standard. The second question is whether the standardization group is large enough. If the group is too small, the results may not give stable estimates because of too much random fluctuation. Finally, a test may have specialized subgroup norms as well as broad national norms. Knowledge relating to subgroup norms gives examiners greater flexibility and confidence if they are using the test with similar subgroup populations (see Dana, 2005). This is particularly important when subgroups produce sets of scores that are significantly different from the normal standardization group. These subgroups can be based on factors such as ethnicity, sex, geographic location, age, level of education, socioeconomic status, urban versus rural environment, or even diagnostic history. Knowledge of each of these subgroup norms allows for a more appropriate and meaningful interpretation of scores.
    Standardization can also refer to administration procedures. A well-constructed test should have clear instructions that permit examiners to give the test in a manner similar to that of other examiners and also similar to themselves from one testing session and the next. Research has demonstrated that varying the instructions between one administration and the next can alter the types and quality of responses the examinee gives, thereby compromising the test's reliability. Standardization of administration should refer not only to consistent administration procedures but also to ensuring adequate lighting, quiet, no interruptions, and good rapport.

    Reliability

    The reliability of a test refers to its degree of stability, consistency, and predictability. It addresses the extent to which scores obtained by a person are or would be the same if the person is reexamined by the same test on different occasions. Underlying the concept of reliability is the possible range of error, or error of measurement, of a single score. This is an estimate of the range of possible random fluctuation that can be expected in an individual's score. Because psychological constructs cannot be measured directly (e.g., through measuring a level in blood), test scores are at best an approximation of these constructs, and thus error is always present in the system. It may arise from such factors as a misreading of the items, poor administration procedures, or the changing mood of the client. If there is a large degree of error, the examiner cannot place a great deal of confidence in an individual's scores. The goal of a test constructor is to reduce, as much as possible, the degree of measurement error. If this error reduction is achieved, the difference between one score and another for a measured characteristic is more likely to result from some true difference than from some chance fluctuation.
  • Psychology in the Schools
    eBook - ePub

    Psychology in the Schools

    Addressing the Learning, Behavior, and Mental Health Needs of Students

    • Elena Diamond, Shelley R. Hart, Amy Jane Griffiths, Stephen E. Brock(Authors)
    • 2023(Publication Date)
    • Routledge
      (Publisher)
    Reynolds & Livingston, 2014 ). In this way, we attempt to ensure that no student has an advantage or disadvantage over another as testing conditions are as similar as possible; in theory, this means that the variable of interest—student ability or knowledge—is the only one that varies between students. Standardization regarding these types of tests may occur particularly with commercially produced, curriculum-focused measures.
    In general, professionals most commonly think of standardized testing in relation to norm-referenced tests. Norm-referenced tests allow the assessor to compare an individual's performance to the performance of other relevant test-takers (Reynolds & Livingston, 2014 ). Relevance of the sample is a crucial aspect of the development of these types of tests. Most commercially developed tests aim to acquire a nationally representative sample—meaning that subgroups (e.g., race, ethnicity, gender, region) are accurately proportionally represented in the norming sample as they exist in the population. This norming sample (also referred to as a standardization or reference sample/group) is administered the test in carefully controlled conditions, with results used to develop the normative tables (i.e., norms) and standard scores of a test.
    TIP
    Get your “blurb” describing standardization and norm-referenced testing down early in your assessment career—it is helpful to feel very comfortable explaining these concepts to others in IEP meetings, in conversations with families and teachers, or within your reports, as most assessments will contain at least one standardized, norm-referenced measure.
    Again, assessment is an iterative process. As we gather information, we may need to circle back to re-review records, interview with clarifying questions or conduct additional observations. Leung (1993)
  • The Human Resources Program-Evaluation Handbook
    In the context of program evaluation for human resources (HR), the establishment of standards plays a crucial role in determining how a program will be judged and ultimately what recommendations are made. Standard setting lays the foundation for determining whether a program provides a sufficient return on investment or meets key stakeholder objectives. If not, it needs to be modified to better meet these requirements. Standards also direct the methodology of the evaluation process and will therefore significantly impact the outcome of any program evaluation.
    The establishment of standards is inextricably linked to the development of criteria, which was discussed in Chapter 3 . Once the evaluation criteria have been developed, it is necessary to establish the performance levels on these criteria against which the HR program will be evaluated. Depending upon the particular HR program being evaluated, these standards can be developed using a variety of techniques; in some cases (e.g., selection), the development of standards is governed by a set of professional and legal guidelines. One such set of professional guidelines is the Standards for Educational and Psychological Testing (American Educational Research Association [AERA], American Psychological Association [APA], & the National Council on Measurement in Education [NCME], 1999) which is intended to “promote the sound and ethical use of tests and to provide a basis for evaluating the quality of testing practices” (p. 1).
    A critical piece of information in any assessment procedure is the cut score, a minimum value that examinees or applicants must attain to be chosen for further consideration in a selection process. If this cut score or decision rule is used as intended, organizational decision makers will treat individuals in a fair and standardized manner and thereby avoid arbitrariness, preferential treatment, or capriciousness. This treatment of standard setting and its evaluation will be presented from a criterion-referenced examination perspective since standards should be absolute and not relative (norm-referenced). In the realm of licensing examinations, the most prevalent standard-setting procedures are those based on the Angoff method (Plake, 1998; Sireci & Biskin, 1992). Shepard (1980) captured the essence of criterion-referenced testing with the following statement:
  • The Psychometrics of Standard Setting
    eBook - ePub

    The Psychometrics of Standard Setting

    Connecting Policy and Test Scores

    These definitions are useful, but more elaborated and nuanced definitions are needed for the purposes of this book. For example, definition 1 begins with “something considered by an authority.” This phrase indicates that in many cases standards do not spring into existence on their own; they are called for by “an authority.” For the applications considered here, the authority might be a certification or a licensure agency, or it might be the state board of education for a state, or a governmental agency such as the National Assessment Governing Board. In this book, the organization that calls for a standard will generically be labeled as the “agency.” Agencies differ in their characteristics and the way they initiate and interact with standard setting processes. The role of the agency is an important consideration in a theory of standard setting.
    The second part of definition 1 is that standards form the basis for comparisons, or an approved model that can be used for comparison. In most cases where educational or psychological tests are used, the basis for comparison is a score on a score scale, but it may be thought of more generally as a person with the desired characteristics. In standard setting studies, there is often reference to the “minimally qualified person” who is the “approved model” to which other individuals are compared. The same minimally qualified person can be thought of as the “authorized exemplar” in definition 7 and the “basis for judgment” in definition 3.
    Of those listed, definition 2 is related to what is called “norm-referenced” in the testing industry. That is, the performance of an individual is compared to the distribution of performance for a meaningful group of individuals. The comparison may be to the mean, median, or mode of the distribution, or some other percentage point, such as the 90th percentile. This may seem like an objective way to define a standard, but comparison to a distribution of performance still requires that someone (the agency?) decide the boundaries of who are included under the label “its kind” and the point in the distribution used as the standard.
    In certain cases, the development of the standard is not initiated by a formal agency, but it is defined by the general sense of the use of words by a population of people. For example, in common usage, a person might be labeled as “tall.” The specific standard for “tall” is set by the impressions of people and generally would be called “norm-referenced” in the same sense as definition 2. However, the clothing industry may have a trade organization that can be considered the agency that calls for a formal standard for the concept “tall” that can be used to uniformly specify the way clothing sizes are labeled. Care must be taken to distinguish informal standards derived from common usage and formal standards that are called for by an agency. This book focuses on the development of formal standards.
  • Developing Norm-Referenced Standardized Tests
    • Lucy Jane Miller(Author)
    • 2020(Publication Date)
    • Routledge
      (Publisher)
    Chapter 7 for a detailed discussion of validity.

    NORMS DEVELOPMENT

    Norm-referenced tests allow us to compare the score of a particular individual with a given reference group. The reference group is determined by the sampling plan of the standardization which has already been discussed. Norms allow for the determination of where an individual stands on the ability or trait being measured compared to those in the reference group.
    In developing norms, several considerations need to be taken into account in order to determine how best to present the norms. For example, if developing a test of intellectual ability in children, the test developer would probably want to have age related norms so a child could be compared against same aged peers. If the ability being measured shows sex differences, the developer might want to have norms based on sex. For example, for height and weight, it is better to have separate norms for males and females than one set of norms for both sexes combined given the difference between males and females in height and weight. Other types of subgroup norms may be appropriate for given tests.
    At times, local norms may be more appropriate and informative. For example, a school district may not be interested in how a child performs compared to a national sample of children on a given test but only compared to children within that school district. When devising the sampling plan it is important to have a clear understanding of what type of norms will be most informative for the user. One could have an exemplary sampling plan, excellent reliability, and yet have a test of dubious utility if the norms are developed in such a way that they do not provide the type of information that is useful to the user. (See Chapter 5 and Anastasi5 or Cronbach9
Index pages curate the most relevant extracts from our library of academic textbooks. They’ve been created using an in-house natural language model (NLM), each adding context and meaning to key research topics.