PART 2
RANDOM SAMPLING THEORY
CHAPTER 3
Classical Theory
The classical theory (Gulliksen, 1950) is the earliest theory of measurement. Despite the development of the more comprehensive and sophisticated generalizability and item response theories in the past two to three decades, the classical theory of measurement maintains a strong influence among testing and measurement practitioners today. With the exception of some large-scale testing projects, many tests in existence today continue to provide evidence of data quality based on the classical approach.
The classical theory is also referred to as the classical reliability theory because its major task is to estimate the reliability of the observed scores of a test. That is, it attempts to estimate the strength of the relationship between the observed score and the true score. It is also sometimes referred to as the true score theory because its theoretical derivations are based on a mathematical model known as the true score model.
THE TRUE SCORE MODEL
When a test is administered to an individual, the observed score for the individual represents the ability of that individual on that particular sample of items administered at that particular occasion under a particular set of conditions. Many factors may affect the performance of the subject. The subject may perform differently had a different set of items on the same content area been used, had the test been given at a different time or under a different set of personal and environmental conditions.
If we were able to administer the test to the same subject under all possible conditions at different times using different possible items, we would have many different observed scores for that subject. The mean of all these observed scores would be the most unbiased estimate of the subject's ability. This mean is defined as the true score.
The observed score from any single administration of a test with a particular sample of items is most likely different from this true score. This difference is called random error score or simply error. Mathematically, this relationship can be expressed as:
where x is the observed score, t is the true score, and e is the error score. An interesting and somewhat tautological derivation of the true score model in Equation 3.1 is that, in the long run, the expected error is zero. Specifically, if we use the symbol E to represent āthe average ofā or āthe expected value of,ā then, in repeated administrations of the test:
Because E(x) is by definition true score t and E(t) is t, the expected e is zero. Therefore, although the observed score from a single administration of a test contains error, the average over many administrations of the test contains little error.
RELIABILITY ESTIMATION
Reliability is the strength of the relationship between the observed score and the true score. This can be expressed as the Pearson's correlation between the observed score x and the true score t; that is Ļxt. This correlation is referred to as the reliability index (Crocker & Algina, 1986). The stronger the relationship, the better x reflects t. If this relationship is very strong as indicated by a high Pearson's r, one can view x as a linear transformation of t. That is, x is essentially t expressed on a different scale. Unfortunately, we cannot estimate Ļxt directly from observed data because t values are unknown. However, it is possible to estimate the squared value of Ļxt.
ASSUMPTION OF INDEPENDENCE
If we were to use the italicized
t to represent
,
x to represent
, and
e to represent
, then,
Et2 is the variance of t or
true score variance or simply
true variance, Ex2 is the variance of x or
observed score variance or simply
observed variance, Ee2 is the variance of e or
error score variance or simply
error variance, Ext is the covariance between x and t, and
Ete is the covariance between t and e. Because the Pearson's
r between
X and
Y is:
Ļxt can be expressed as:
Given the true score model x = t + e, Equation 3.3 can be rewritten as:
An assumption can be made that true score is unrelated to error score; that is, the amount of error made at any particular single administration of a test to a subject is independent of the true score for that subject. This is referred to as the assumption of independence. This assumption suggests that Ete = 0 or the covariance between t and e is zero. Given this assumption, the square of the reliability index Ļxt as expressed in Equation 3.4 becomes:
In other words, the square of the reliability index becomes the proportion of observed variance which is true variance. This squared reliability index is referred to as the reliability coefficient. Although it is not possible to estimate Ļxt directly from observed data, it is possible to estimate Ļ2xt when a particular set of assumptions known as parallel tests assumptions are met.
PARALLEL TESTS ASSUMPTIONS
If two tests, A and B, designed to measure the same ability, are both given to the same group of subjects, the true score t for each subject remains the same on both tests. The Pearson's r between the two sets of observed scores becomes:
Given the assumption of independence, the second and third terms of the numerator in Equation 3.6 become zero and drop out of the equation. Hence:
This correlation can be used to estimate reliability coefficient
if we assume that the two tests, A and B, meet the parallel tests assumptions. The parallel tests assumptions refer to a set of assumed mathematical relationships between tests A and B. A complete set of these assumptions can be found in Nunnally (1978) and detailed derivations and proofs of these assumptions can be found in Lord and Novick (1968).
Of particular relevance to our discussion here are two specific assumptions: (a) Scores on Tests A and B have the same variance or
, and (b) the errors in Tests A and B are mutually independent or
EeAeB = 0. Given these two assumptions, the second term in the numerator of
Equation 3.7 becomes zero and drops out of the equation. Further, the denominator can be written as a general observed variance
Ex2.
Equation 3.7 becomes:
In other words, if we can identify two tests that can be assumed to meet the parallel tests assumptions, the Pearson's r between the observed scores on the two tests becomes the squared correlation between the observed and the true score.
It is important to point out that when a Pearson's r between two parallel tests is used to estimate the reliability coefficient of either of the two essentially interchangea...