Computer-Based Testing
eBook - ePub

Computer-Based Testing

Building the Foundation for Future Assessments

  1. 338 pages
  2. English
  3. ePUB (mobile friendly)
  4. Available on iOS & Android
eBook - ePub
Book details
Book preview
Table of contents
Citations

About This Book

Although computer-based tests (CBT) have been administered for many years, improvements in the speed and power of computers coupled with reductions in their cost have made large-scale computer delivery of tests increasingly feasible. CBT is now a common form of test delivery for licensure, certification, and admissions tests. Many large-scale, high-stakes testing programs have introduced CBT either as an option or as the sole means of test delivery. Although this movement to CBT has, to a great extent, been successful, it has not been without problems. Advances in psychometrics are required to ensure that those who rely on test results can have at least the same confidence in CBTs as they have in traditional forms of assessment. This volume stems from an ETS-sponsored colloquium in which more than 200 measurement professionals from eight countries and 29 states convened to assess the current and future status of CBT. The formal agenda for the colloquium was divided into three major segments: Test Models, Test Administration, and Test Analysis and Scoring. Each segment consisted of several presentations followed by comments from noted psychometricians and a break-out session in which presenters and discussants identified important issues and established priorities for a CBT research agenda. This volume contains the papers presented at the colloquium, the discussant remarks based on those papers, and the research agenda that was generated from the break-out sessions. Computer-Based Testing: Building the Foundation for Future Assessments is must reading for professionals, scholars, and advanced students working in the testing field, as well as people in the information technology field who have an interest in testing.

Frequently asked questions

Simply head over to the account section in settings and click on “Cancel Subscription” - it’s as simple as that. After you cancel, your membership will stay active for the remainder of the time you’ve paid for. Learn more here.
At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.
Both plans give you full access to the library and all of Perlego’s features. The only differences are the price and subscription period: With the annual plan you’ll save around 30% compared to 12 months on the monthly plan.
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.
Yes, you can access Computer-Based Testing by Craig N. Mills, Maria T. Potenza, John J. Fremer, William C. Ward, Craig N. Mills, Maria T. Potenza, John J. Fremer, William C. Ward in PDF and/or ePUB format, as well as other popular books in Education & Education General. We have over one million books available in our catalogue for you to explore.

Information

Publisher
Routledge
Year
2005
ISBN
9781135651640
Edition
1

1
The Work Ahead: A Psychometric Infrastructure for Computerized Adaptive Tests

Fritz Drasgow
University of Illinois at Urbana-Champaign

Introduction

Computer-based tests (CBTs) and computerized adaptive tests (CATs) have finally become a reality. After many years of research by psycho-metricians and scholars in related fields, CBTs and CATs have been implemented by several major testing programs. Examples include the Department of Defense’s Computer Adaptive Test-Armed Services Vocational Aptitude Battery (CAT-ASVAB), the Graduate Record Examination (GRE)-CAT, State Farm Insurance’s selection test for computer programmers, and various licensing and credentialing exams. Clearly, large-scale, high-stakes testing is undergoing a paradigm shift.
The colloquium “Computer-Based Testing: Building the Foundation for Future Assessments,” held in Philadelphia September 25-26, 1998, underscores the significance of these new approaches to educational and psychological testing. Seven leading researchers presented papers reviewing current research and practice with respect to several critical aspects of CBTs and CATs. These papers, contained in this volume, integrated current research, including many studies presented recently at national conferences and not yet available in the published literature, and began the task of identifying an agenda for future research. At the colloquium, a panel of noted measurement scholars provided commentary on the papers and offered their ideas and proposals for the research agenda. Colloquium attendees also provided input for the research agenda by participating in small group discussions. A formal report, Completing the CBT Foundation: A Measurement Agenda, was written following the colloquium. It summarized key issues identified in the paper presentations, panel commentary, and small group discussions. The research agenda is included as the final chapter of this book.
The colloquium offered a unique opportunity for the psychometric community to shape the future of CBT. Wide-ranging and intense discussions of critical problems clarified various needs for programmatic research. Many ideas for addressing these problems were offered and approaches for theoretical and empirical research were proposed and evaluated. Perhaps most exciting was the opportunity for a community of psychometric scholars and measurement practitioners to lay out a broad framework for the research needed to ensure the success of CBT.

CAT and CBT: Past and Future

Since the 1970s there has been a tremendous amount of research on CAT and CBT. For example, the recently published book, Computerized Adaptive Testing: From Inquiry to Operation (Sands, Waters, & McBride, 1997), chronicles the work of the military psychometricians who created the CAT-ASVAB. Countless psychometrician years of effort were required to produce a test that could be used seamlessly with the paper-based ASVAB. Many other testing programs have also devoted great effort to computerizing their assessment instruments.
The efforts devoted to CAT and CBT in the past two decades are reminiscent of the growth of large-scale standardized paper-based testing in the 1940s and 1950s. This growth led to extensive research by psycho-metricians to ensure that examinees received scores that were fair and equitable. For example, multiple test forms were needed by these large assessment programs so that practice effects and cheating were not an issue. However, the use of multiple test forms creates the problem of placing all examinees’ scores on the same scale; thus began the field of test equating.
CBT is having analogous growth today. This growth is creating new technical problems for psychometricians that must be solved for testing programs to operate smoothly. Issues include developing and calibrating large numbers of items, constructing item pools for operational tests, limiting exposure of items to ensure security, designing procedures for scoring tests, and selecting models for characterizing item responses. Furthermore, numerous basic issues have not be adequately studied such as under what conditions should the fundamental unit of analysis be an item or a set of items (i.e., a “testlet”). These issues were addressed in the colloquium; brief summaries are provided in the following discussion.
It is important to maintain one’s perspective in the face of the myriad technical challenges to CAT and CBT. Computerization enables a much broader range of measurement advances than just adaptive administration of traditional multiple-choice items. Three avenues for such work are briefly discussed. The first is visualization. True color images can be presented with remarkable resolution. Moreover, it is possible to allow users to pan in and out, as well as rotate objects in space. An example is Ackerman, Evans, Park, Tamassia, and Turner’s (1999) demnatological disorder exam, which uses visualization to good effect. Audition provides a second improvement. Vispoel’s (1999) work on assessing musical aptitude provides a fascinating example. Simulations of phone conversations, such as those made to call centers, can also be used to glean information about examinees. What might be called interaction constitutes the third area: Full motion video can be presented by the computer and thus it is possible to develop assessments of skills related to human interactions. Olson-Buchanan et al.’s (1998; Drasgow, Olson-Buchanan, & Moberg, 1999) Conflict Resolution Skills Assessment and Donovan, Drasgow, and Bergman’s (1998) Leadership Skills Assessment provide two examples. Several other examples of innovative uses of computerization for measurement are presented in Drasgow and Olson-Buchanan (1999).

Issues for Today’S CBTs

Many challenges confront a testing program as it implements and operates a CBT. These challenges, which constitute the raison d’etre for this colloquium, are summarized in this section.

Developing New Items

One of the great virtues of a CBT is the opportunity for walk-in testing where examinees schedule the test at times convenient for them. Walk-in testing is viable when test construction designs reduce the possibility that one examinee can provide information to other examinees about test items they might encounter. To this end, adaptive item selection or adaptive testlet selection can be used.
Way, Steffen, and Anderson (chap. 7, this volume) provide a thoughtful analysis of item writing requirements for a high-stakes testing program. They describe various constraints on item usage that should enhance security. The bottom line on their analysis is that item writing requirements are even more substantial than one might guess.

Calibrating Items

Item response theory (IRT) is the psychometric theory underlying CAT. IRT uses one or more item parameters to characterize each item; estimates of these item parameters must be obtained before an item can be added to the item pool. Perhaps the most common approach to obtaining these estimates before implementing a CAT has been to administer items to sizable samples of examinees using paper-based tests. Parshall (chap. 6, this volume) notes problems with this method.
After a CAT is implemented, new items can be “seeded” into the operational form: New items are administered during the CAT but are not used to compute official test scores. Examinees would not ordinarily know which items are part of the operational test and which items are experimental. Sample size requirements and factors that influence these requirements are also described by Parshall.
Several years ago, the Office of Naval Research funded four psycho-metricians to work on estimating parameters of seeded items. Both parametric approaches (Darrell Bock; Frederic Lord) and nonparametric approaches (Michael Levine; Fumiko Samejima) were investigated. Despite this initial work, relatively little has appeared in the published literature on this topic. Consequently, we are left with questions: What seeding schemes are effective? What estimation frameworks are superior? What sample sizes are required? What subtle problems can be encountered?

Changes in the Item Pool: Evolution Versus Cataclysm

Way et al. (chap. 7, this volume) describe the set of available items as the item vat and characterize restrictions on item selection for a specific item pool as docking rules. They use system dynamics modeling to examine CAT forms assembly given new items flowing into the item vat, old items being removed from the vat, and various docking rules.
In Way et al.’s approach, item pools are taken as intact entities. When an item pool becomes too old, it is replaced by a new item pool. Beyond test security, a critical concern is how new forms are equated to the reference form. Military researchers, for example, conducted careful studies to equate CAT-ASVAB Forms 1 and 2 to their reference form, paper-based ASVAB Form 8A.
A beautiful old hickory tree in my back lawn died recently. I thought about taking down the tree lumberjack style, but instead had an arborist remove it. Rather than felling the tree, he started near the top and cut it down branch by branch. His explanation was, “Small cuts make small mistakes.” Way et al.’s treatment of item pools, which might be called item pool cataclysm, seems analogous to the lumberjack’s approach; it may allow large equating errors to occur and seriously disrupt a testing program. An alternative would be item pool evolution where only a few items are replaced at a time. Items that are overexposed or items that show evidence of changing characteristics would be candidates. Would such small changes limit mistakes in equating to being small in magnitude? How many items could be replaced before an equating study is required? Alternatively, can item pool cataclysms be designed in ways that guarantee small equating errors? Moreover, is it necessary to replace entire item pools to deter cheating conspiracies? These are questions of critical significance for CAT programs, and they provide opportunities for future research.

Delivering, Scoring, and Modeling

What is presented to examinees affects how the assessment can be scored and drives the psychometric modeling of examinee behavior. Four chapters on this interrelated set of topics are presented in this book.
Test delivery refers to how items are selected for presentation to examinees. In CAT-ASVAB, for example, the computer adaptively selects individual items for administration. Folk and Smith (chap. 2, this volume) describe a number of alternatives to this approach. For example, branching might be based on carefully constructed testlets rather than on individual items. Testlets are particularly prominent in computerized mastery testing where it is essential to satisfy content specifications.
Test delivery affects how a test is scored. Dodd and Fitzpatrick (chap. 11, this volume) first consider scoring CATs with standard item branching (although some of these scoring methods have wider applicability than this delivery method). They focus on scoring methods that may be easier to explain to examinees and other interested parties than the usual maximum likelihood or Bayesian estimates of ability. Dodd and Fitzpatrick then discuss automated scoring methods for performance-based tasks such as the architectural design exam for which Bejar (1991) developed an interesting scoring algorithm.
Scoring, in turn, can be affected by how examinee responses are modeled. Luecht and Clauser (chap. 3, this volume) argue that we should not let currently popular psychometric models drive the type of data that are collected. Instead, they suggest considering more detailed information about the examinee-stimulus material interaction such as response time or the sequence of responses made by an examinee. They note that current IRT models are useful for estimating overall proficiency in a domain, but evaluating qualitative states of learning is beyond their scope.
Continuing in this direction, Schnipke and Scrams (chap. 12, this volume) provide a detailed review of approaches to modeling response time. It seems likely a reasonably large set of models (or considerable flexibility within a given model) is necessary to accurately characterize items that vary in difficulty, processing requirements, and timing constraints. Schnipke and Scrams point out a critical need in this area of research: “No validity studies have been offered that address the utility of the resulting scores.”

Item Exposure and Test Security

The convenience of walk-in testing creates a great potential for test compromise; conspirators have many opportunities to steal items. A situation in which a relatively small number of examinees conspire to raise their test scores is unfair to the vast majority of honest examinees, and so testing programs are obliged to take steps to maintain test security. High-stakes testing programs in particular must assume evil intentions on the part of some examinees.
Davey and Nering (chap. 8, this volume) describe methods for item exposure control. Such methods require a delicate balance: They must not let the most discriminating items be overexposed, yet they must maintain the efficiency of adaptive testing. It is probably true in high-stakes testing programs that security needs impose greater requirements on an item pool’s size than do psychometric needs. More research is needed on item exposure control and examinees’ proclivities for cheating.

Item and Test Disclosure

New York state mandates disclosure of educational tests. The legal status of the New York law is in some question (it has been ruled to be in conflict with United States copyright law), but the educational measurement community seems in reasonable agreement that some degree of disclosure is appropriate.
Item and test disclosure in a CAT setting presents a challenge. If the items presented to individual examinees are publicly disclosed, the item pool will be quickly depleted. Moreover, it will be impossible to write and pretest items fast enough to replenish the item pool. The National Council on Measurement in Education (NCME) convened an ad hoc committee to consider the issues involved; this committee was composed of members from educational test publishers, universities, and the public schools.
After review and debate, the NCME ad hoc committee recommended that testing programs provide the option of what they termed secure item review. Here, examinees would have the opportunity to review the items they were administered, their own answers, and the answers scored as correct. This review would take place at a testing site and would be monitored by a test proctor. Examinees would not be allowed to take notes, but some mechanism for challenging items they believe to be miskeyed would be allowed. The committee’s recommendations were given in a white paper entitled Item and Test Disclosure for Computerized Adaptive Tests, which is contained in the Appendix to this chapter.

Stretching the Envelope

Computerization greatly increases the variety of stimulus materials that can be presented to examinees. As noted previously, we can use the terms visualization, audition, and interaction to classify new assessment tools.

Visualization

Computer monitors are rapidly converging to a true color standard that allows 16 million colors to be presented; this represents the maximum number of colors that can be discriminated by the human eye. Moreover, the cost of monitors has fallen rapidly so that the new generation of large (19-inch and 21-inch) monitors can be purchased by many testing programs. Large screen sizes, true color, and...

Table of contents

  1. Cover
  2. Half Title
  3. Title
  4. Copyright
  5. Contents
  6. Preface
  7. 1 The Work Ahead: A Psychometric Infrastructure for Computerized Adaptive Tests
  8. PART I: TEST MODELS
  9. PART II: TEST ADMINISTRATION
  10. PART III: TEST ANALYSIS AND SCORING
  11. PART IV: RESEARCH AGENDA
  12. Author Index
  13. Subject Index