1
Assessing assessment
It is possible that intelligent tadpoles reconcile themselves to the inconveniences of their position, by reflecting that, though most of them will live and die as tadpoles and nothing more, the more fortunate of the species will one day shed their tails, distend their mouths and stomachs, hop nimbly on to dry land, and croak addresses to their former friends on the virtues by means of which tadpoles of character and capacity can rise to be frogs.
(The Tadpole Philosophy, R. H. Tawney, 1951)
Helping a few tadpoles to become frogs has been, from the Chinese Civil Service selection examinations a thousand years ago through to selective university entrance today, one of the key historical roles of assessment. And, across the years, those who were selected, and went on to occupy positions of power, have indeed croaked loudly about the power of assessment to identify ability and merit.
But formal assessments have played other historical roles too: establishing authenticity in pre-scientific times; certificating occupational competence through the guilds and professions; identifying learners in need of special schooling or provision; and, as an accountability tool to judge the effectiveness of institutions.
The intention of this chapter is to make explicit some of our historically embedded assumptions about formal assessments. These taken-for-granted understandings are, in this case, largely the product of British culture and history â one that has impacted on many other cultures. The questions then become whether the original rationales still hold â and what have we learned since then. For example, the appeal of examinations has always been their fairness and their promise of meritocratic selection. What history reminds us is that, while they were certainly fairer than the patronage they replaced, they also reflected social and class assumptions about merit and ability. These social assumptions automatically excluded women from âopenâ examinations until late into the nineteenth century, and âprotectedâ most British working-class children from examinations until the mid-twentieth century (so they would never leave the pond by this route). We will see in chapter two how similar cultural assumptions, including those of racial superiority, played a part in the development of intelligence testing in Britain and the USA.
Any evaluation of the uses and impacts of assessment has to begin with purpose; without knowing this we cannot judge whether the assessment has done what it set out to do. So I will begin with a classification of some key purposes, before moving on to see how they have been expressed historically.
Purposes
The three best questions to ask of any assessment are:
- What is the principal purpose of this assessment?
- Is the form of the assessment fit-for-purpose?
- Does it achieve its purpose?
Their simplicity is deceptive, since what lurks within them are the major theoretical issues of validity and reliability, and the spectre of unintended consequences. The first question implies there may be multiple, and sometimes competing, purposes. Fitness-for-purpose is concerned with how appropriate the form of assessment is. We do not want somebody to be given a driving licence solely on the basis of a theory test. The third question is about the impact of the assessment. This is not just about whether it does what it claims to do, but what the consequences are for the test-takers and others.
With these three questions, we will interrogate some current assessment practices. These are put in a historical context, since what we may take as self-evident may not always have been so, and it may be a consequence of an uncritical acceptance of a cultural legacy. For example, in England, the tradition of written examinations came out of the elite universities and spread to the professions and secondary schools. How did this shape what and how we examine today? Why is this different from the âmultiple-choiceâ tradition in the USA and other countries? This historical approach also illustrates that some of our contemporary concerns, for example the emphasis on using tests for accountability purposes, have precedents. The intention here is to provide a context in which to understand better where we are now â and how we got here.
The principal purpose
This is essentially a âwhyâ question. Why are we seeking this information? It is usually not difficult to come up with an answer for this, although some are feebler than others. Customer- or staff-satisfaction surveys may be a case in point: we are asked to fill them in even though we have no confidence that anything will come of them, since nothing seems to have happened as a consequence of previous ones. A robust examination of the purpose may lead us to conclude that this is less about finding out in order to improve, and more about complying with a requirement to consult the customer.
If there is a single purpose, then the question may be relatively easy to answer. âWhat is the principal purpose of the driving test?â is hardly a brain teaser, although we could get deep and sociological about it. It is when multiple purposes develop, or a purpose mutates, that the question becomes more telling. Where there are multiple purposes, their balance is often in flux. A useful image here is that of foreground/background. The assumption is that several purposes are present, but there may be shifts which bring one into the foreground while another fades into the background. This all takes place within a social framework.
Take, for example, examination results at the end of compulsory schooling. The original purpose was as a means of both certification and progression for the individual student. In England, as in many other countries, good examination grades allowed progression to the next level of academic study, or the opening up of other employment avenues. UK readers of a certain age will remember local newspapers reporting results individually â so that the neighbours could see how you had done â but there was little comment about overall school performance. By the 1990s, the legal requirement for schools to publish their results, in increasingly standardised formats, began to accentuate the importance of the proportion of students with five grades AâC in the GCSE.1 This managerial reporting to parents has since hardened into a full-blown accountability system. This involves national performance tables to rank schools and local education authorities, based on the percentage of students gaining five GCSE grades A*âC, as well as monitoring whether national targets are being met.
This leads to my Principle of Managerial Creep: As assessment purposes multiply, the more managerial the purpose, the more dominant its role. âManagerialâ here is used to cover monitoring and accountability purposes. These concerns are essentially systemic, and social control looms large. So, while individual certification is still important to those who take the exams, in England it is the percentage of the students who got five GCSE A*âC grades that has now moved to the foreground. And, as we shall see later, schools may get up to all sorts of cynical strategies to maximise their percentages, with educational values sometimes subordinated to meeting targets. These concerns echo across all the sectors driven by targets, from hospital waiting lists (take the easy patients first) to train operators (timetable the journeys to take longer, so that punctuality improves). We will return to these accountability systems in chapter six.
Not all assessments are primarily managerial in purpose. Assessments of occupational competency may be seen as primarily individual, although the âlicence to practiceâ is socially regulated â for example, medical training has always sought to restrict the number entering and qualifying, in order to maintain its high status. My 20-metres swimming certificate is essentially personal, at least until the government decides that everybody must be able to swim 20 metres.
Some assessments may have a mainly professional purpose. Classroom assessments may be as much for the teacher to determine where a class is in its learning, as for the individual students. The assumption is that professional purposes will feed back into the teaching and learning processes rather than into bureaucratic monitoring. Formative assessment (âassessment for learningâ) is an example of this that I will develop in chapter seven. The assumption here is that the sole purpose of the informal assessments involved is to lead to further learning, and that any managerial use of these assessments would be inappropriate. The assessment of Learning Styles and Multiple Intelligences can also be seen as part of a professional assessment to aid the teaching and learning process.
To organise the varied uses of assessments, I have opted for three broad groupings, which reflect conventional classifications:
- selection and certification;
- determining and raising standards;
- formative assessment â assessment for learning.
These overlap massively, given that assessments serve several purposes. For example, if university selection is based on a school examination, then this examination will govern the curriculum and how it is taught. âDetermining standardsâ implies both what is taught and the level of performance that will be expected from students.
The inclusion of formative assessment as a separate purpose also raises problems of overlap. Why is it not just part of raising standards? The justification for separate treatment is that there is more to learning than examination grades, yet, for policy makers, âraising standardsâ is often simply about better grades (see chapter six). Each of these broad groupings shelters a range of more specific purposes. These are set out in Table 1.1.
Origins
The intention in this section is to show how the selection and standards functions of assessment have historical pedigrees. It may also illuminate why we accept examinations as a natural, and therefore not-to-be-questioned, part of education. In line with the bookâs argument, assessments have been used to shape not only the individualâs identity, but also to define the status of professions and schools. There has been a common belief that examinations are necessarily fair, even though most of the population used to be excluded from them, and that they can reveal underlying ability. It is this Victorian legacy which now seems so self-evident that it often goes unquestioned.
The unorthodox starting place for this historical review is the authenticity testing of folklore and myth, which here takes priority over the pride of place usually reserved for the Chinese civil-service selection examinations which have been in use for over a thousand years. In Britain, it was the universities that first introduced examinations to improve standards, and these then were introduced into the professions (by successful graduates) and subsequently âtrickled downâ to secondary schools and then to primary schools. This distinctive role of the universities may, in comparison with the state-organised systems of nineteenth-century France and Prussia, account for some of the features of curriculum and assessment that UK readers take for granted (but others might question). These include the school curriculum, and the uneasy relationship of the academic with the vocational. Contemporary debates about the impact of using assessments for accountability purposes echo those of the nineteenth century.
Table 1.1 Assessment purposes
Identity and innocence
This unlikely departure point is the result of accepting anthropologist Allan Hansonâs claim that folklore shows that one of the purposes of tests in pre-scientific communities was to establish authenticity and innocence. This would be an unnecessary digression if it were not the case that this purpose is still around in the form of lie detectors, random drug testing and personality testing. While Hanson devotes most of his Testing Testing: Social Consequences of the Examined Life to these, this line of reasoning will only be briefly summarised here.
Authenticity tests included confirming identity and determining character, for example, King Arthur pulling the sword out of the rock, and the story of the princess and the pea (being of such royal sensibilities, she noticed the pea through her mattress, confirming her true identity). Dunking witches also established identity. Whether or not an individual warrants trust is the basis of stories about riddle-solving and heroic acts. King Solomonâs test of true motherhood, based on observing reactions to the suggestion of a 50â50 split of the child, is typical of these.
Tests of honesty, guilt and innocence were often used to help to decide legal cases. Trial by battle was one approach, with the assumption that victory would go to whoever was in the right. The assumption was that God was on the side of the just. The biblical story of David and Goliath represented such a stunning victory for the underdog that this was clear evidence of divine support. (If Goliath had won, then a very different explanation might have been in order â he was much bigger and better-armed after all.) The other approach was to âlevel the playing fieldâ â an assessment aspiration to which we will repeatedly return, so that those involved could not predict who would win and therefore victory was seen as expressing divine judgement. A Pythonesque extreme of this was the German law that âlevelledâ the combat between a man and a woman:
The chances ⌠were adjusted by burying the man to his waist, tying his left hand behind his back, and arming him with only with a mace, while his fair opponent had the free use of her limbs and was provided with a heavy stone securely fastened with a piece of stuff.2
However, the âlevellingâ between combatants from different social classes was a very different matter â another theme to which we will return. For example, if a noble met a commoner in judicial combat in France, then the former might enjoy the right of fighting on horseback with knightly weapons, while the commoner had to fight on foot using a shield and staff. He really would need God on his side.
Trial by ordeal might, for many, feel like the precursor of examinations or interviews. Historically, some of the favourite assessment instruments were hot and cold water and the hot iron. The most well-known form of this was âswimming the witchâ, a test used well into the seventeenth century, when the famous witch-hunter Matthew Hopkins practised in England. His technique was that the suspect would be tied, right thumb to left big toe, and then lowered into the water by means of a rope tied around the waist. The test was repeated three times, and if the individual floated, then it was proof of witchcraft. The logic has a no-win feel, if you were innocent then this would be signified by God through you sinking. The guilty float because the pure nature of water does not receive the deceitful. If we need any reminding of the social construction of such an interpretation, in southwest Germany during the same period it was the innocent who floated and the guilty who sank. God moves in mysterious, and regional, ways.
Before we distance ourselves too quickly from this suspect logic, it is worth reflecting on whether we have secular âsoftâ equivalents now. Hanson argues that the current use of lie detectors shares many of these features. This time it is science rather than God that reveals what guilty individuals may wish to hide. It does not do this directly, the polygraph reading assumes a causal chain of prevarication leading to anxiety which leads to a measurable physiological response. There is a similar confidence that the lie detector tells the truth.
A good fictional example of these features occurs in the TV comedy Desperate Housewives, when Bree insists on being tested in front of her children, who have been suspicious about their mother after her husband Rexâs surprise ânaturalâ death. When asked whether she killed her husband, the line remains flat (innocent). However, when the interrogator then asks if she is in love with another man, the polygraph spikes â despite her protestations. Bree, never in touch with her feelings at the best of times, accepts that she must be, and therefore welcomes the advances of George. George, who did kill Rex, takes the lie detector test and passes â so we realise that he is a psychopath who shows no emotion or guilt.
This example will not bear too much interpretation other than to illustrate the same principles that drove pre-scientific authenticity tests: that we could get at the truth by going beyond an individualâs claims and behaviour. The deceit makes us spike rather than sink â a causal chain which assumes a particular bodyâmind relationship. It also represents a clear power relationship, where the interrogator is an operative on behalf of society and the justice system.
The use of authenticity tests will not be further developed in this book, although they are part of modern life.3 They do, however, share some features of the forms of assessment that will be considered:
- They involve the application of power. This resides in those who conduct the tests, but they in turn represent the social system of which they are part.
- They are concerned with socially constructed reality rather than with some independently existing reality (for example, nobility).
- They help to form the constructs that they measure. They can even create what it is they then claim to measure (for example, witches) â an argument to be developed in relation to intelligence testing.
Selection by merit
This is the conventional historical starting point. The purpose of formal written and practical assessments was to select individuals on the basis of merit rather than birth. This remains one of the permanent appeals of testing. Historical honours go to the Chinese civil-service selection tests.4 While, as early as the Chou dynasty (c.1122â256 BC), there were tests to identify the talented amongst the common people, it was the Sung dynasty (AD 960â1279) that opened the examination to nearly every male, and it became the passport to power and prestige. The exclusions included slave...