Introduction
Computer-based tests (CBTs) and computerized adaptive tests (CATs) have finally become a reality. After many years of research by psycho-metricians and scholars in related fields, CBTs and CATs have been implemented by several major testing programs. Examples include the Department of Defenseâs Computer Adaptive Test-Armed Services Vocational Aptitude Battery (CAT-ASVAB), the Graduate Record Examination (GRE)-CAT, State Farm Insuranceâs selection test for computer programmers, and various licensing and credentialing exams. Clearly, large-scale, high-stakes testing is undergoing a paradigm shift.
The colloquium âComputer-Based Testing: Building the Foundation for Future Assessments,â held in Philadelphia September 25-26, 1998, underscores the significance of these new approaches to educational and psychological testing. Seven leading researchers presented papers reviewing current research and practice with respect to several critical aspects of CBTs and CATs. These papers, contained in this volume, integrated current research, including many studies presented recently at national conferences and not yet available in the published literature, and began the task of identifying an agenda for future research. At the colloquium, a panel of noted measurement scholars provided commentary on the papers and offered their ideas and proposals for the research agenda. Colloquium attendees also provided input for the research agenda by participating in small group discussions. A formal report, Completing the CBT Foundation: A Measurement Agenda, was written following the colloquium. It summarized key issues identified in the paper presentations, panel commentary, and small group discussions. The research agenda is included as the final chapter of this book.
The colloquium offered a unique opportunity for the psychometric community to shape the future of CBT. Wide-ranging and intense discussions of critical problems clarified various needs for programmatic research. Many ideas for addressing these problems were offered and approaches for theoretical and empirical research were proposed and evaluated. Perhaps most exciting was the opportunity for a community of psychometric scholars and measurement practitioners to lay out a broad framework for the research needed to ensure the success of CBT.
CAT and CBT: Past and Future
Since the 1970s there has been a tremendous amount of research on CAT and CBT. For example, the recently published book, Computerized Adaptive Testing: From Inquiry to Operation (Sands, Waters, & McBride, 1997), chronicles the work of the military psychometricians who created the CAT-ASVAB. Countless psychometrician years of effort were required to produce a test that could be used seamlessly with the paper-based ASVAB. Many other testing programs have also devoted great effort to computerizing their assessment instruments.
The efforts devoted to CAT and CBT in the past two decades are reminiscent of the growth of large-scale standardized paper-based testing in the 1940s and 1950s. This growth led to extensive research by psycho-metricians to ensure that examinees received scores that were fair and equitable. For example, multiple test forms were needed by these large assessment programs so that practice effects and cheating were not an issue. However, the use of multiple test forms creates the problem of placing all examineesâ scores on the same scale; thus began the field of test equating.
CBT is having analogous growth today. This growth is creating new technical problems for psychometricians that must be solved for testing programs to operate smoothly. Issues include developing and calibrating large numbers of items, constructing item pools for operational tests, limiting exposure of items to ensure security, designing procedures for scoring tests, and selecting models for characterizing item responses. Furthermore, numerous basic issues have not be adequately studied such as under what conditions should the fundamental unit of analysis be an item or a set of items (i.e., a âtestletâ). These issues were addressed in the colloquium; brief summaries are provided in the following discussion.
It is important to maintain oneâs perspective in the face of the myriad technical challenges to CAT and CBT. Computerization enables a much broader range of measurement advances than just adaptive administration of traditional multiple-choice items. Three avenues for such work are briefly discussed. The first is visualization. True color images can be presented with remarkable resolution. Moreover, it is possible to allow users to pan in and out, as well as rotate objects in space. An example is Ackerman, Evans, Park, Tamassia, and Turnerâs (1999) demnatological disorder exam, which uses visualization to good effect. Audition provides a second improvement. Vispoelâs (1999) work on assessing musical aptitude provides a fascinating example. Simulations of phone conversations, such as those made to call centers, can also be used to glean information about examinees. What might be called interaction constitutes the third area: Full motion video can be presented by the computer and thus it is possible to develop assessments of skills related to human interactions. Olson-Buchanan et al.âs (1998; Drasgow, Olson-Buchanan, & Moberg, 1999) Conflict Resolution Skills Assessment and Donovan, Drasgow, and Bergmanâs (1998) Leadership Skills Assessment provide two examples. Several other examples of innovative uses of computerization for measurement are presented in Drasgow and Olson-Buchanan (1999).
Issues for TodayâS CBTs
Many challenges confront a testing program as it implements and operates a CBT. These challenges, which constitute the raison dâetre for this colloquium, are summarized in this section.
Developing New Items
One of the great virtues of a CBT is the opportunity for walk-in testing where examinees schedule the test at times convenient for them. Walk-in testing is viable when test construction designs reduce the possibility that one examinee can provide information to other examinees about test items they might encounter. To this end, adaptive item selection or adaptive testlet selection can be used.
Way, Steffen, and Anderson (chap. 7, this volume) provide a thoughtful analysis of item writing requirements for a high-stakes testing program. They describe various constraints on item usage that should enhance security. The bottom line on their analysis is that item writing requirements are even more substantial than one might guess.
Calibrating Items
Item response theory (IRT) is the psychometric theory underlying CAT. IRT uses one or more item parameters to characterize each item; estimates of these item parameters must be obtained before an item can be added to the item pool. Perhaps the most common approach to obtaining these estimates before implementing a CAT has been to administer items to sizable samples of examinees using paper-based tests. Parshall (chap. 6, this volume) notes problems with this method.
After a CAT is implemented, new items can be âseededâ into the operational form: New items are administered during the CAT but are not used to compute official test scores. Examinees would not ordinarily know which items are part of the operational test and which items are experimental. Sample size requirements and factors that influence these requirements are also described by Parshall.
Several years ago, the Office of Naval Research funded four psycho-metricians to work on estimating parameters of seeded items. Both parametric approaches (Darrell Bock; Frederic Lord) and nonparametric approaches (Michael Levine; Fumiko Samejima) were investigated. Despite this initial work, relatively little has appeared in the published literature on this topic. Consequently, we are left with questions: What seeding schemes are effective? What estimation frameworks are superior? What sample sizes are required? What subtle problems can be encountered?
Changes in the Item Pool: Evolution Versus Cataclysm
Way et al. (chap. 7, this volume) describe the set of available items as the item vat and characterize restrictions on item selection for a specific item pool as docking rules. They use system dynamics modeling to examine CAT forms assembly given new items flowing into the item vat, old items being removed from the vat, and various docking rules.
In Way et al.âs approach, item pools are taken as intact entities. When an item pool becomes too old, it is replaced by a new item pool. Beyond test security, a critical concern is how new forms are equated to the reference form. Military researchers, for example, conducted careful studies to equate CAT-ASVAB Forms 1 and 2 to their reference form, paper-based ASVAB Form 8A.
A beautiful old hickory tree in my back lawn died recently. I thought about taking down the tree lumberjack style, but instead had an arborist remove it. Rather than felling the tree, he started near the top and cut it down branch by branch. His explanation was, âSmall cuts make small mistakes.â Way et al.âs treatment of item pools, which might be called item pool cataclysm, seems analogous to the lumberjackâs approach; it may allow large equating errors to occur and seriously disrupt a testing program. An alternative would be item pool evolution where only a few items are replaced at a time. Items that are overexposed or items that show evidence of changing characteristics would be candidates. Would such small changes limit mistakes in equating to being small in magnitude? How many items could be replaced before an equating study is required? Alternatively, can item pool cataclysms be designed in ways that guarantee small equating errors? Moreover, is it necessary to replace entire item pools to deter cheating conspiracies? These are questions of critical significance for CAT programs, and they provide opportunities for future research.
Delivering, Scoring, and Modeling
What is presented to examinees affects how the assessment can be scored and drives the psychometric modeling of examinee behavior. Four chapters on this interrelated set of topics are presented in this book.
Test delivery refers to how items are selected for presentation to examinees. In CAT-ASVAB, for example, the computer adaptively selects individual items for administration. Folk and Smith (chap. 2, this volume) describe a number of alternatives to this approach. For example, branching might be based on carefully constructed testlets rather than on individual items. Testlets are particularly prominent in computerized mastery testing where it is essential to satisfy content specifications.
Test delivery affects how a test is scored. Dodd and Fitzpatrick (chap. 11, this volume) first consider scoring CATs with standard item branching (although some of these scoring methods have wider applicability than this delivery method). They focus on scoring methods that may be easier to explain to examinees and other interested parties than the usual maximum likelihood or Bayesian estimates of ability. Dodd and Fitzpatrick then discuss automated scoring methods for performance-based tasks such as the architectural design exam for which Bejar (1991) developed an interesting scoring algorithm.
Scoring, in turn, can be affected by how examinee responses are modeled. Luecht and Clauser (chap. 3, this volume) argue that we should not let currently popular psychometric models drive the type of data that are collected. Instead, they suggest considering more detailed information about the examinee-stimulus material interaction such as response time or the sequence of responses made by an examinee. They note that current IRT models are useful for estimating overall proficiency in a domain, but evaluating qualitative states of learning is beyond their scope.
Continuing in this direction, Schnipke and Scrams (chap. 12, this volume) provide a detailed review of approaches to modeling response time. It seems likely a reasonably large set of models (or considerable flexibility within a given model) is necessary to accurately characterize items that vary in difficulty, processing requirements, and timing constraints. Schnipke and Scrams point out a critical need in this area of research: âNo validity studies have been offered that address the utility of the resulting scores.â
Item Exposure and Test Security
The convenience of walk-in testing creates a great potential for test compromise; conspirators have many opportunities to steal items. A situation in which a relatively small number of examinees conspire to raise their test scores is unfair to the vast majority of honest examinees, and so testing programs are obliged to take steps to maintain test security. High-stakes testing programs in particular must assume evil intentions on the part of some examinees.
Davey and Nering (chap. 8, this volume) describe methods for item exposure control. Such methods require a delicate balance: They must not let the most discriminating items be overexposed, yet they must maintain the efficiency of adaptive testing. It is probably true in high-stakes testing programs that security needs impose greater requirements on an item poolâs size than do psychometric needs. More research is needed on item exposure control and examineesâ proclivities for cheating.
Item and Test Disclosure
New York state mandates disclosure of educational tests. The legal status of the New York law is in some question (it has been ruled to be in conflict with United States copyright law), but the educational measurement community seems in reasonable agreement that some degree of disclosure is appropriate.
Item and test disclosure in a CAT setting presents a challenge. If the items presented to individual examinees are publicly disclosed, the item pool will be quickly depleted. Moreover, it will be impossible to write and pretest items fast enough to replenish the item pool. The National Council on Measurement in Education (NCME) convened an ad hoc committee to consider the issues involved; this committee was composed of members from educational test publishers, universities, and the public schools.
After review and debate, the NCME ad hoc committee recommended that testing programs provide the option of what they termed secure item review. Here, examinees would have the opportunity to review the items they were administered, their own answers, and the answers scored as correct. This review would take place at a testing site and would be monitored by a test proctor. Examinees would not be allowed to take notes, but some mechanism for challenging items they believe to be miskeyed would be allowed. The committeeâs recommendations were given in a white paper entitled Item and Test Disclosure for Computerized Adaptive Tests, which is contained in the Appendix to this chapter.