Frontiers of Test Validity Theory
eBook - ePub

Frontiers of Test Validity Theory

Measurement, Causation, and Meaning

  1. 342 pages
  2. English
  3. ePUB (mobile friendly)
  4. Available on iOS & Android
eBook - ePub

Frontiers of Test Validity Theory

Measurement, Causation, and Meaning

Book details
Book preview
Table of contents
Citations

About This Book

This book examines test validity in the behavioral, social, and educational sciences by exploring three fundamental problems: measurement, causation and meaning. Psychometric and philosophical perspectives receive attention along with unresolved issues. The authors explore how measurement is conceived from both the classical and modern perspectives. The importance of understanding the underlying concepts as well as the practical challenges of test construction and use receive emphasis throughout. The book summarizes the current state of the test validity theory field. Necessary background on test theory and statistics is presented as a conceptual overview where needed.

Each chapter begins with an overview of key material reviewed in previous chapters, concludes with a list of suggested readings, and features boxes with examples that connect theory to practice. These examples reflect actual situations that occurred in psychology, education, and other disciplines in the US and around the globe, bringing theory to life. Critical thinking questions related to the boxed material engage and challenge readers. A few examples include:

What is the difference between intelligence and IQ?

Can people disagree on issues of value but agree on issues of test validity?

Is it possible to ask the same question in two different languages?

The first part of the book contrasts theories of measurement as applied to the validity of behavioral science measures.The next part considers causal theories of measurement in relation to alternatives such as behavior domain sampling, and then unpacks the causal approach in terms of alternative theories of causation.The final section explores the meaning and interpretation of test scores as it applies to test validity. Each set of chapters opens with a review of the key theories and literature and concludes with a review of related open questions in test validity theory.

Researchers, practitioners and policy makers interested in test validity or developing tests appreciate the book's cutting edge review of test validity. The book also serves as a supplement in graduate or advanced undergraduate courses on test validity, psychometrics, testing or measurement taught in psychology, education, sociology, social work, political science, business, criminal justice and other fields. The book does not assume a background in measurement.

Frequently asked questions

Simply head over to the account section in settings and click on “Cancel Subscription” - it’s as simple as that. After you cancel, your membership will stay active for the remainder of the time you’ve paid for. Learn more here.
At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.
Both plans give you full access to the library and all of Perlego’s features. The only differences are the price and subscription period: With the annual plan you’ll save around 30% compared to 12 months on the monthly plan.
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.
Yes, you can access Frontiers of Test Validity Theory by Keith A. Markus, Denny Borsboom in PDF and/or ePUB format, as well as other popular books in Psicologia & Ricerche e metodologie nella psicologia. We have over one million books available in our catalogue for you to explore.

Information

Publisher
Routledge
Year
2013
ISBN
9781135055851
1 Introduction
Surveying the Field of Test Validity Theory
This book treats test validity theory from perspectives emphasizing measurement, causation, and meaning. This chapter provides a foundation for what follows with respect to key concepts used throughout the book. An introduction to key terminology appears in the next section. A brief overview of test validity theory and an overview of the key concepts of measurement, causation and meaning as they relate to test validity occupy the following two sections.
1.1. Terminology
There was a time, toward the middle of the last century, when the demand that one define one’s terms could fluster even the most nimble theoretician. In the decades since, it has become clear that defining terms and developing their use do not separate so easily that one can fully complete the former before embarking on the latter. Nonetheless, merely using terms without introducing them—on the assumption that everyone else will understand them precisely as the author understands them—is the surest path to talking at cross purposes. Therefore, the present section will help the reader carve out some basic vocabulary used throughout the remainder of the book. Further refinement and elaboration of the concepts introduced here will await later chapters.
What follows may seem like surplus of terms at this early stage. The reader need not keep all of these distinctions in memory. However, much of the terminology introduced here foreshadows important distinctions developed in later chapters. The distinctions made below will each serve an important purpose in later chapters. Some terms and distinctions may seem novel, but the terminology presented conforms as much as possible to standard use of the terms within the testing literature. Our goal is to avoid misunderstandings while addressing a field in which authors often write about similar topics in very different terms.
1.1.1. Testing, Assessing, and Measuring
Authors often used the terms ‘testing’, ‘assessing’, and ‘measuring’ interchangeably. However, just as often, authors use these terms to distinguish closely related ideas. In this book, the term ‘measurement’ has a narrow sense that involves strict quantities or magnitudes measured on at least an interval scale (chapter 2). That is to say, the term ‘measuring’ only applies when one has standardized units that remain consistent across the range of possible values, as with lengths or temperatures. On rare occasion, however, we will also make use of the broader sense of the term because of its pervasive use in certain contexts (e.g., measurement models). In such cases, we will clearly mark this context as an exceptional use of the term. The term ‘testing’ applies more broadly. It covers any technique that involves systematically observing and scoring elicited responses of a person or object under some level of standardization. In other contexts, the term ‘assessment’ might apply more broadly to include non-systematic or non-standardized methods. However, because the present book is restricted to test validity theory, ‘testing’ and ‘assessment’ will largely function interchangeably in the present context. The term ‘testing’ places more emphasis on systematic observation, whereas the term ‘assessment’ places more emphasis on scoring. We understand outcomes assessment as a special case distinguished primarily by the fact that the variable assessed by the test or assessment instrument serves as an outcome variable in some context. As illustrated in this paragraph, throughout the book single quotes are used to refer to terms.
1.1.2. Attributes, Constructs, and Latent Variables
Consider the sentences “The intelligence construct has spurred considerable controversy in the research community” and “The employer uses IQ-scores to assess applicants’ standing on the intelligence construct.” In the first of these sentences, the term ‘construct’ denotes a word or concept used by intelligence researchers, sometimes referred to as a theoretical term. As such, the term ‘construct’ indicates the signifying term. In the second sentence, the term ‘construct’ instead denotes an attribute that the tested persons have. Philosophers have used the term ‘property’ in a similar manner, sometimes referring to the term (a predicate in logic) and sometimes referring to the attribute that the term signifies (Putnam, 1975a, chapter 19). Following the same pattern, much behavioral science literature uses the term ‘variable’ sometimes to refer to the attribute and sometimes to refer to the signifier of the attribute (Markus, 2008a). It is worth noting that dichotomous variables correspond to simple properties, such that each individual either has or lacks the property in contrast to properties one can have to some degree or in some quantity. For example, having a height of two meters constitutes a simple property. Multi-valued variables, categorical or quantitative, correspond to complex properties involving mutually exclusive simple properties, such as height in general (Rozeboom, 1966). In some cases an attribute is associated with a particular test, such as a passing score on a particular certification test or a particular score on a college admissions test. In other cases, many different tests assess the same attribute and people have this attribute independent of any particular test. For example, several tests could all assess the level of mastery of the same curriculum in college-level calculus.
In this book, these terms always refer to the actual property tested or intended for testing. They never indicate the theoretical term, label, or symbol used to refer to this property. The term ‘construct’ always assumes a substantive interpretation of this property. The term ‘latent variable’ allows for content-free statistical models that capture probabilistic relationships with latent variables without specifying the specific property that the latent variable represents. Finally, the term ‘construct label’ refers to the label given to the construct—the theoretical term used in scientific discourse, for instance. In keeping with these distinctions, we always understand a latent variable as a simple or complex property of test takers. What makes the variable latent is the fact that researchers do not directly observe it, possibly but not necessarily because it cannot be directly observed. The formalism of a latent variable makes it possible to represent statistical relationships with some latent variable, whatever it may be, without specifying the substance of that variable. For example, a common factor model makes it possible to estimate shared variance between items without specifying or making any assumptions about the content of that shared variance.
1.1.3. Items, Indicators, and Indices
Each individual stimulus that is incorporated in a test with the goal of eliciting a response from the test taker constitutes a test item. For instance, an item may be “Spring stands to Season as Jupiter stands to …?” The response to the item is the item response; for instance, a response to the example item could be “Planet” or “Solar system.” The coding of the response, for instance as correct or incorrect, or as 0 or 1, is the item score and the process of coding responses into numbers is item scoring. Here, the response “Planet” would be coded as correct, and would typically be scored 1, while the response “Solar system” would be coded as incorrect, and be scored as 0.
In the context of models that relate item responses to latent variables, items are used as indicators. Item scores are then interpreted as indicator scores. These indicator scores may be used to assess or to measure a construct. Recall that the terms ‘assess’ and ‘measure’ apply to indicators only when the indicator scores are used to gauge a person’s standing on a construct, not to merely summarize performance. This means that, in the measurement model, the indicator variables are dependent variables, and the latent variable is an independent variable (chapter 6). In the present example, the test user may interpret the item score as an indicator of the construct verbal reasoning. In educational testing, the term indicator can refer to a parcel of items combined for reporting purposes. We will not use the term in this sense.
In contrast, an item score can also function as an index. This applies to indicators when the model equation is reversed. This means that the indicators are used to form a composite score, which may be used as a summary of the item responses. For instance, in the current example, a number of similar items may be administered to the test taker. In this case, the total number of correctly answered items would be an index or summary of the person’s performance. In some cases the term ‘index’ may also apply to situations where a construct is predicted from the item scores. For instance, the item scores may in this case be used to predict a property that was not assessed, for instance future performance on a job. In the measurement model, the indicators then function as independent variables, whereas the latent variable (future performance on a job) is a dependent variable that is predicted from the items. When test scores are used in this way, i.e., as predictors of an external variable, then that external variable may also be designated a criterion variable or criterion. A criterion may be a construct or an observable, depending on context, where the term ‘observable’ refers to something that can be directly observed. In most cases of interest, the criterion carries surplus meaning over the observed score and in such cases the criterion is best thought of as a construct. For example, the choice of one observable over another typically has a theoretical rationale. The data alone do not force such choices upon the researcher.
In many contexts, test theory assumes that indicators come from homogeneous domains of items, whereas indices more typically reflect a small number of heterogeneous-theory-determined attributes that comprise the composite variable. In still other contexts, indicators may serve as samples from heterogeneous domains. In this case, the response behavior is considered to be a sample from a larger domain of behaviors that characterizes a person. For instance, in the current example, a correct response to the item may be considered a sample from a person’s responses to all possible items of the verbal analogy type. The response is then used to infer, from the administered items, the person’s response behavior on items that were not actually administered. This inferential process is called generalization. Generalization is horizontal inference: one makes inferences from one set of properties (administered verbal analogies) to other properties of the same kind (other verbal analogies). This contrasts with assessment and measurement, which are vertical inference strategies: one makes an inference from one class of properties (administered verbal analogies) to properties of a different kind (constructs involved in analogical reasoning).
1.1.4. Test Validity and Validation
When it appears on its own in the present book, the term validity generally refers to test validity and not to broader applications such as the validity of inferences or research designs. Validation refers to the process of investigating and documenting test validity. Both validity and validation are evolving concepts that have taken on a variety of meanings in the past century. We briefly chart this historical development in the next section.
1.2. The Development of Test Validity Theory
Throughout the history of test validity theory, one can discern three interacting processes: expansion, unification, and partition. Expansion occurs when test developers encounter applied problems that existing validity theory does not cover. At such times, new concepts and validation procedures enter validity theory. Examples of expansion typically involve new types of validity evidence incorporated into the validation process. Unification occurs as a form of theoretical integration. It results when theoretical innovation treats aspects of validity formerly treated as disparate as special cases of a common overarching concept, emphasizing their commonalities as part of test validity. Examples of unification typically involve reinterpretation of existing validation strategies. Partition occurs when authors press distinctions between cases treated similarly under existing validity theory. This process occurs when authors emphasize the differences between elements of validation treated together under existing theory. Examples of partition typically involve the introduction of typologies of test development and test validation activities.
Superimposed over this dynamic interplay between these three processes, one also finds a successive development in the underlying philosophical assumptions. The chapters in successive editions of Educational Measurement on validity (Cureton, 1951; Messick, 1989) and validation (Cronbach, 1971, Kane, 2006) offer useful guideposts to this development. Cureton (1950) developed test validity theory in a manner consistent with a form of descriptive empiricism, reflecting both behaviorism in psychology and positivism in the philosophy of science. Here, one finds an emphasis on the idea that claims are not meaningful unless operationalized in terms of observables. Correlations provide the currency of the realm. Sometimes correlations inform claims about the identity of the correlates, such as what the test measures and what the test developer intends to measure. Other times, correlations inform claims about prediction, claims that what the test measures at one time predicts what the test user wants to predict at a later time. Cronbach (1971) developed test validity in a way that shifted toward an explanatory (logical) empiricism. The primary innovation here involves the idea that inferred theoretical variables (e.g., extroversion, reading ability, job knowledge) explain patterns of observed test behavior rather than simply summarizing such behavior. The relevant notion of explanation involves subsuming observed test behaviors under general scientific laws (Hempel, 1965). Messick (1989) reworked a broad range of test validity concepts under a form of constructivist realism. The constructivist part was already well entrenched in validity theory, as earlier authors had long emphasized the underdetermination of theory by data: the idea that observed test behaviors do not fully determine the abstractions from them introduced by theory (Loevinger, 1957). Different theories can summarize the same data differently, and abstracting from data to theory thus involves choices that cannot be driven only by the data. The realist component, however, marked a break with earlier validity theory (congruent with Donald Campbell’s advocacy of critical realism during a similar period, both reflecting shifts in the philosophy of science). Messick emphasized psychological constructs as actually existing properties reflecting real dimensions of variability between people (or other objects of measurement) rather than just convenient summaries of observable test behaviors (c.f. Norris, 1983). In the most recent chapter in the series, Kane (2006) moved test validity theory in a direction consistent with philosophical pragmatism. Earlier approaches sought to articulate the universal nature of validity, validation, and truthful claims about tests and test use. Kane’s approach is pragmatic in the sense that he instead presents an approach to test validity that treats all of these things as highly context specific. One constructs an interpretive argument and validates relative to that interpretive argument. The interpretive argument leads to standards for validity, validation evidence, appropriate to the intended interpretation. The distinctive character of pragmatism is the view that there are no foundations that transcend individual or collective perspectives on the world, but, instead, knowledge of the world develops from within such perspectives. As such, a validation effort might prove adequate for one context, but not for another. When the interpretive argument changes or expands, a need for new validation evidence can result.
Corresponding to this philosophical progression, one finds in the four Educational Measurement chapters an interesting narrowing of focus over the decades. Cureton (1951) devoted a substantial portion of the chapter to the problem of choosing the right attribute to measure. Cronbach (1971) more or less assumed that the test developer had decided what to measure and focused a good deal of attention on the myriad of other factors that could also impact test responses. Messick (1989) assumed both that the test developer knew what he or she wanted to measure and that he or she had developed a standardized procedure that controlled for extraneous factors. Messick primarily focused on what was involved in collecting evidence that the standardized procedure measured the desired attribute. Kane (2006) more or less assumed that the test developer has selected an attribute and developed a standardized procedure that measures it with at least partial success. Kane focused on providing a detailed account of producing a persuasive argument to justify test use to outside audiences.
1.2.1. Descriptive Empiricism
It seems fitting that the acronym of the subtitle of this book forms the Roman numeral MCM, because the year 1900 is a good starting point for the history of test validity theory. Pearson invented the correlation coefficient in 1896. The coefficient was immediately applied to test scores (e.g., Wissler, 1901). The concept of test validity was first introduced to indicate the extent to which a test measures what it purports to measure (e.g., Buckingham, 1921). The correlation with a criterion measure (of what the test purports to measure) provided an index of this and was widely referred to as a validity coefficient (e.g., Hunkins & Breed, 1923). Favoring a more theory-neutral approach, Nunally and Burnstein (1994) described the absolute value of the correlation coefficient as the validity coefficient.
Using the terminology outlined above, we would paraphrase this in terms of the test assessing what it purports to assess, to avoid assuming too much about the level of measurement or the quantitative attributes of the construct assessed. Correlations with other variables remain a basic tool in the test validator’s toolkit, although few users today would interpret such correlations as providing a complete summary of a test’s validity. Historically, however, the correlation between test score and criterion has been of great importance.
The validity coefficient applied most naturally in cases where a test was used to provide a prediction before the criterion measure became available. For instance, universities may use high school grade point average as a predictor of success in college; the army may use personality test scores as predictors of behavior in combat situations; and companies may use task performance ratings as predictors of performance on the job. Other uses involved situations in which a test provided a shorter, more economical alternative to a longer but definitive measure. For instance, in medicine, a symptom profile (e.g., coughing, fever, headache) is often the first choice to assess someone’s condition (e.g., influenza) even though it is inferior to other tests (e.g., blood tests).
In still other contexts, test developers created tests based on detailed content specifications rather than to predict or approximate a specific criterion measure. For instance, educational tests, like exams, are traditionally put together according to content specification, rather than on the basis of test–criterion correlations. An example may be an arithmetic examination. Such a test requires adequate content coverage of several distinct arithmetic skills (addition, subtraction, multiplication and division). Content often takes precedence over test–criterion correlations in these cases. Suppose that the answer to the question ‘Do you like beans?’ happened to do better than arithmetic items in predicting arithmetic ability. Even if this were so, few would propose replacing arithmetic exams by inquiries into one...

Table of contents

  1. Cover
  2. Half Title
  3. Title Page
  4. Copyright Page
  5. Table of Contents
  6. About the Authors
  7. Preface
  8. Acknowledgements
  9. 1. Introduction: Surveying the Field of Test Validity Theory
  10. Part I:
  11. Part II:
  12. Part III:
  13. Part IV:
  14. Notes
  15. References
  16. Author Index
  17. Subject Index
  18. Example Index