A Handbook of Test Construction (Psychology Revivals)
eBook - ePub

A Handbook of Test Construction (Psychology Revivals)

Introduction to Psychometric Design

  1. 260 pages
  2. English
  3. ePUB (mobile friendly)
  4. Available on iOS & Android
eBook - ePub

A Handbook of Test Construction (Psychology Revivals)

Introduction to Psychometric Design

Book details
Book preview
Table of contents
Citations

About This Book

Psychological tests provide reliable and objective standards by which individuals can be evaluated in education and employment. Therefore accurate judgements must depend on the reliability and quality of the tests themselves. Originally published in 1986, this handbook by an internationally acknowledged expert provided an introductory and comprehensive treatment of the business of constructing good tests.

Paul Kline shows how to construct a test and then to check that it is working well. Covering most kinds of tests, including computer presented tests of the time, Rasch scaling and tailored testing, this title offers: a clear introduction to this complex field; a glossary of specialist terms; an explanation of the objective of reliability; step-by-step guidance through the statistical procedures; a description of the techniques used in constructing and standardizing tests; guidelines with examples for writing the test items; computer programs for many of the techniques.

Although the computer testing will inevitably have moved on, students on courses in occupational, educational and clinical psychology, as well as in psychological testing itself, would still find this a valuable source of information, guidance and clear explanation.

Frequently asked questions

Simply head over to the account section in settings and click on ā€œCancel Subscriptionā€ - itā€™s as simple as that. After you cancel, your membership will stay active for the remainder of the time youā€™ve paid for. Learn more here.
At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.
Both plans give you full access to the library and all of Perlegoā€™s features. The only differences are the price and subscription period: With the annual plan youā€™ll save around 30% compared to 12 months on the monthly plan.
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, weā€™ve got you covered! Learn more here.
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.
Yes, you can access A Handbook of Test Construction (Psychology Revivals) by Paul Kline in PDF and/or ePUB format, as well as other popular books in Psychology & Research & Methodology in Psychology. We have over one million books available in our catalogue for you to explore.

Information

Publisher
Routledge
Year
2015
ISBN
9781317444596
Edition
1
1
The characteristics of good tests in psychology
A psychological test may be described justly as a good test if it has certain characteristics. It should be at least an interval scale, be further reliable, valid and discriminating, and either have good norms or fit a Rasch or similar model with high precision, or be expertly tailored to its subjects.
In this handbook I intend to demonstrate how these characteristics can be built into tests by sound and creative test construction. Before this can be done, however, it will be necessary to discuss and define all those terms which must be thoroughly understood if tests are not only to be properly constructed but properly used.
However, there is, as it were, a prior reason for requiring psychological tests to possess these characteristics. This is to improve the precision and accuracy of measurement. These qualities are themselves desirable because such measurement is a sine qua non of science. In the natural sciences progress has depended upon the development of good measures and, in my view, psychology is no exception to this rule. In brief, each of the characteristics which I shall describe below contributes to psychometric efficiency.
Types of scale
There are a number of levels of scales, hierarchically ordered. These are, beginning with the simplest, as follows:
(1) Nominal. This simply classifies subjects: male/female is a nominal scaling classification.
(2) Ordinal. Here subjects are ranked, as by weight or height. This is clearly crude because differences between ranks are ignored.
(3) Interval. Here the differences between scale points at all points of the scale are equal. Equal interval scales can be linearly transformed, thus allowing transformation of scores to common scales and thus comparison of scores. Further, many statistical procedures assume an interval scale of measurement.
(4) Ratio scale. Ratio scales in addition have a meaningful zero point. This is clearly a problem for most psychological variables, although there are methods of test construction which allow for this possibility.
An examination of these four types of scale reveals clearly that, ideally, psychological test constructors should aim to produce ratio scales. Failing that, interval scales are desirable if the results are to be subjected to any form of statistical analysis. Since the study of the validity of tests almost inevitably involves such analysis, and since it is from the quantification of scores that psychological tests derive their advantages over other forms of assessment, the conclusion is obvious: nothing less than interval scales will do. In fact, as Brown (1976) points out, most psychometric tests approximate interval scales, and treating test scores as if they were interval scales produces useful results.
Reliability
In psychometrics, reliability has two meanings. A test is said to be reliable if it is self-consistent. It is also said to be reliable if it yields the same score for each subject (given that subjects have not changed) on retesting. This reliability over time is known as testā€“retest reliability.
The meaning and importance of the internal-consistency reliability
Psychometrists are eager to develop tests which are highly self-consistent, for the obvious reason that if part of a test is measuring a variable, then the other parts, if not consistent with it, cannot be measuring that variable. Thus it would appear that for a test to be valid (i.e. measure what it claims to measure), it must be consistent; hence the psychometric emphasis on internal-consistency reliability. Indeed, the general psychometric view is exactly this, that high reliability is a prerequisite of validity (e.g. Guilford, 1956; Nunnally, 1978). The only dissenting voice of any note is that of Cattell (e.g. Cattell and Kline, 1977). Cattell argues that high internal consistency is actually antithetical to validity on the grounds that any item must cover less ground or be narrower than the criterion we are trying to measure. Thus, if all items are highly consistent, they are also highly correlated, and hence a reliable test will only measure a narrow variable of little variance. As support for this argument it must be noted (1) that it is true that Cronbachā€™s alpha increases with the item intercorrelations, and (2) that in any multivariate predictive study, the maximum multiple correlation between tests and the criterion (in the case of tests items and the total score) is obtained when the variables are uncorrelated. This is obviously the case, for if two variables were perfectly correlated, one would be providing no new information. Thus maximum validity, in Cattellā€™s argument, is obtained where test items do not all correlate with each other, but where each correlates positively with the criterion. Such a test would have only low internal-consistency reliability. In my view, Cattell is theoretically correct. However, to my knowledge no test constructor has managed to write items that, while correlating with the criterion, do not correlate with each other. Barrett and Kline (1982) have examined Cattellā€™s own personality test, the 16 PF test, where such an attempt has been made, but it appears not to be entirely successful. Despite these comments, generally the psychometric claim holds: in practice valid tests are highly consistent.
Testā€“retest reliability
Testā€“retest reliability is obviously essential. If a test fails to yield the same score for a subject (given that they have not changed) on different occasions, all cannot be well. The measurement of testā€“retest reliability is essentially simple. The scores from a set of subjects tested on two occasions are correlated. The minimum satisfying figure for test reliability is 0.7. Below this, as Guilford (1956) points out, a test becomes unsatisfactory for use with individuals because the standard error of an obtained score becomes so large that interpretation of scores is dubious. The meaning and implications of this standard error of score are discussed later in this chapter when I examine what has been called the classical model of test error (Nunnally, 1978), which is implicit in this discussion of reliability.
Although testā€“retest reliability is simple to compute, care must be taken not to raise it artefactually by having the sessions close together, and samples must be representative of the population for whom the test is intended.
Finally, in this connection I must mention parallel-form reliability. Here equivalent or parallel sets of items are constructed. Thus subjects take an entirely different test on subsequent occasions. However, there are difficulties here in demonstrating that the two forms are truly equivalent. Nevertheless, in practice parallel forms of test are found to be useful.
Validity
I shall now briefly examine the nature of validity, the second major characteristic of good tests. As with the treatment of relaibility, the aim in this chapter is to enable readers to grasp the concept sufficiently to understand the problems of test construction with validity as the target. The actual methods of establishing validity will be fully presented later in the book.
A test is valid if it measures what it claims to measure. However, this does not sufficiently explicate the meaning of validity. Instead it raises the new question of how we know whether a test measures what it claims. In fact, there is a variety of ways of demonstrating test validity, and each contributes facets of its meaning. These are set out below:
Face validity
A test is said to be face valid if it appears to measure what it purports to measure, especially to subjects. Face validity bears no relation to true validity and is important only in so far as adults will generally not co-operate on tests that lack face validity, regarding them as silly and insulting. Children, used to school, are not quite so fussy. Face validity, then, is simply to aid co-operation of subjects.
Concurrent validity
This is assessed by correlating the test with other tests. Thus, if we are trying to establish the concurrent validity of an intelligence test, we would correlate it with other tests known to be valid measures. This example clearly illustrates the horns of the dilemma of concurrent validity. If there is already another valid test, good enough to act as a criterion, the new test, to be validated, may be somewhat otiose. Indeed, it will be so unless it has some valuable feature not possessed by other valid tests. Thus, if it were very short, easy to administer, quick to score, or particularly enjoyed by subjects, this would certainly justify the creation of a new test where other criterion tests exist. On the other hand, where no good criterion tests exist, where the new test breaks fresh ground, then clearly concurrent validity studies become difficult.
Sometimes, where no criterion tests exist, we can attempt to use ratings. Here, however, there are severe problems. The validity of the ratings may well be questioned and, in addition, if ratings are possible, there may be little need for a test.
Generally, concurrent validity is useful in that often there are poor tests of the same variable which the new test attempts to improve on. In cases such as these, concurrent validity studies would expect significant but modest correlations. Clearly though, concurrent validity is not an entirely satisfactory aspect of validity. To accept a test as valid we would need further and different evidence in addition to studies of concurrent validity. It is also useful to establish what the test does not measure. The test should have no correlation with tests measuring quite different variables.
Predictive validity
To establish the predictive validity of a test, correlations are obtained between the test given on one occasion and some later criterion. The predictive validity of an intelligence test can be demonstrated, for example, by correlating scores at age 11 with performance at 16 years of age at ā€˜Oā€™ level or even university degree classes. Many psychometrists (e.g. Cronbach, 1970) regard predictive validity as the most convincing evidence for the efficiency of a test.
A major difficulty with this approach to test validity lies in establishing a meaningful criterion. In the case of intelligence tests it makes sense, given our concept of intelligence, to use future academic success or even money earned in jobs. However, since there are clearly other variables than intelligence related to these criteria, such as persistence, the ability to get on with people, together with more random factors ā€“ good teaching and vacancies for jobs at the right time ā€“ correlations with intelligence test scores could be expected only to be moderate. Furthermore, intelligence is perhaps the easiest variable for which predictive validity studies can be designed. Neuroticism or anxiety also lend themselves to predictive-validi...

Table of contents

  1. Cover
  2. Half Title
  3. Title Page
  4. Copyright Page
  5. Original Copyright Page
  6. Table of Contents
  7. Preface
  8. Glossary of terms
  9. 1 The characteristics of good tests in psychology
  10. 2 Making tests reliable I: Intelligence and ability. Item writing
  11. 3 Making tests reliable II: Personality inventories. Item writing
  12. 4 Making tests reliable III: Constructing other types of test
  13. 5 Computing test-reliability
  14. 6 Item trials
  15. 7 Computing the discriminatory power and the validity of tests
  16. 8 Standardizing the test
  17. 9 Other methods of test construction
  18. 10 Computerized testing, tailored testing, Rasch scaling and cognitive process studies
  19. 11 Summary and conclusions
  20. Appendix 1: Item-analysis programs
  21. Appendix 2: Using the programs
  22. Bibliography
  23. Name index
  24. Subject index