What every buyer should know about forms of assessment

In this post, I'll be describing and comparing three basic forms of assessment—surveys, tests of factual and procedural knowledge, and performative tests.

Surveys—measures of perception, preference, or opinion

checklistWhat is a survey? A survey (a.k.a. inventory) is any assessment that asks the test-taker to choose from a set of options, such as "strongly agree" or "strongly disagree", based on opinion, preference, or perception. Surveys can be used by organizations in several ways. For example, opinion surveys can help maintain employee satisfaction by providing a "safe" way to express dissatisfaction before workplace problems have a chance to escalate.

Surveys have been used by organizations in a variety of ways. Just about everyone who's worked for a large organization has completed a personality inventory as part of a team-building exercise. The results stimulate lots of water cooler discussions about which "type" or "color" employees are, but their impact on employee performance is unclear. (Fair warning: I'm notorious for my discomfort with typologies!) Some personality inventories are even used in high stakes hiring and promotion decisions, a practice that continues despite evidence that they are very poor predictors of employee success [1].

survey_itemAlthough most survey developers don't pretend their assessments measure competence, many do. The item on the left was used in a survey with the words "management skills" in it's title.

Claims that surveys measure competence are most common when "malleable traits"—traits that are subject to change, learning or growth—are targeted. One example of a malleable trait is "EQ" or "emotional intelligence". EQ is viewed as a skill that can be developed, and there are several surveys that purport to measure its development. What they actually measure is attitude.

Another example of surveys masquerading as assessments of skill is in the measurement of "transformational learning". Transformational learning is defined as a learning experience that fundamentally changes the way a person understands something, yet the only way it appears to be measured is with surveys. Transformational learning surveys measure people's perceptions of their learning experience, not how much they are actually changed by it.

The only survey-type assessments that can be said to measure something like skill are assessments—such as 360s—that ask people about their perceptions. Although 360s inadvertently measure other things, like how much a person is liked or whether or not a respondent agrees with that person, they may also document evidence of behavior change. If what you are interested in is behavior change, a 360 may be appropriate in some cases, but it's important to keep in mind that while a 360 may measure change in a target's behavior, it's also likely to measure change in a respondent's attitude that's unrelated to the target's behavior.

360-type assessments may, to some extent, serve as tests of competence, because behavior change may be an indication that someone has learned new skills. When an assessment measures something that might be an indicator of something else, it is said to measure a proxy. A good 360 may measure a proxy (perceptions of behavior) for a skill (competence).

There are literally hundreds of research articles that document the limitations of surveys, but I'll mention only one more of them here: All of the survey types I've discussed are vulnerable to "gaming"—smart people can easily figure out what the most desirable answers are.

Surveys are extremely popular today because, relative to assessments of skill, they are inexpensive to develop and cost almost nothing to administer. Lectica gives away several high quality surveys for free because they are so inexpensive, yet organizations spend millions of dollars every year on surveys, many of which are falsely marketed as assessments of skill or competence.

Tests of factual and procedural knowledge

A test of competence is any test that asks the test taker to demonstrate a skill. Tests of factual and procedural knowledge can legitimately be thought of as tests of competence.

mc_itemThe classic multiple choice test examines factual knowledge, procedural knowledge, and basic comprehension. If you want to know if someone knows the rules, which formulas to apply, the steps in a process, or the vocabulary of a field, a multiple choice test may meet your needs. Often, the developers of multiple choice tests claim that their assessments measure understanding, reasoning, or critical thinking. This is because some multiple choice tests measure skills that are assumed to be proxies for skills like understanding, reasoning, and critical thinking. They are not direct tests of these skills.

Multiple choice tests are widely used, because there is a large industry devoted to making them, but they are increasingly unpopular because of their (mis)use as high stakes assessments. They are often perceived as threatening and unfair because they are often used to rank or select people, and are not helpful to the individual learner. Moreover, their relevance is often brought into question because they don't directly measure what we really care about—the ability to apply knowledge and skills in real-life contexts.

Performative tests

performative_itemTests that ask people to directly demonstrate their skills in (1) the real world, (2) real-world simulations, or (3) as they are applied to real-world scenarios are called performative tests. These tests usually do not have "right" answers. Instead, they employ objective criteria to evaluate performances for the level of skill demonstrated, and often play a formative role by providing feedback designed to improve performance or understanding. This is the kind of assessment you want if what you care about is deep understanding, reasoning skills, or performance in real-world contexts.

Performative tests are the most difficult tests to make, but they are the gold standard if what you want to know is the level of competence a person is likely to demonstrate in real-world conditions—and if you're interested in supporting development. Standardized performative tests are not yet widely used, because the methods and technology required to develop them are relatively new, and there is not yet a large industry devoted to making them. But they are increasingly popular because they support learning.

Unfortunately, performative tests may initially be perceived as threatening because people's attitudes toward tests of knowledge and skill have been shaped by their exposure to high stakes multiple choice tests. The idea of testing for learning is taking hold, but changing the way people think about something as ubiquitous as testing is an ongoing challenge.

Lectical Assessments

Lectical Assessments are performative tests—tests for learning. They are designed to support robust learning—the kind of learning that optimizes the growth of essential real-world skills. We're the leader of the pack when it comes to the sophistication of our methods and technology, our evidence base, and the sheer number of assessments we've developed.

[1] Frederick P. Morgeson, et al. (2007) Are we getting fooled again? Coming to terms with limitations in the use of personality tests for personnel selection, Personnel Psychology, 60, 1029-1033.

The limitations of testing

It is important for those of us who use assessments to ensure that they (1) measure what we say they measure, (2) measure it reliably enough to justify claimed distinctions between and within persons, and (3) are used responsibly. It is relatively easy for testing experts to create assessments that are adequately reliable (2) for individual assessment, and although it is more difficult to show that these tests measure the construct of interest (1), there are reasonable methods for showing that an assessment meets this standard. However, it is more difficult to ensure that assessments are used responsibly (3).

Few consumers of tests are aware of their inherent limitations. Even the best tests, those that are highly reliable and measure what they are supposed to measure, provide only a limited amount of information. This is true of all measures. The more we hone in on a measureable dimension—in other words, the greater our precision becomes—the narrower the construct becomes. Time, weight, height, and distance are all extremely narrow constructs. This means that they provide a very specific piece of information extremely well. When we use a ruler, we can have great confidence in the measurement we make, down to very small lengths (depending on the ruler, of course). No one doubts the great advantages of this kind of precision. But we can’t learn anything else about the measured object. Its length usually cannot tell us what the object is, how it is shaped, its color, its use, its weight, how it feels, how attractive it is, or how useful it is. We only know how long it is. To provide an accurate account of the thing that was measured, we need to know many more things about it, and we need to construct a narrative that brings these things together in a meaningful way.

A really good psychological measure is similar. The LAS (Lectical Assessment System), for example, is designed to go to the heart of development, stripping away everything that does not contribute to the pure developmental “height” of a given performance. Without knowledge of many other things—such as the ways of thinking that are generally associated with this “height” in a particular domain, the specific ideas that are associated with this particular performance, information from other performances on other measures, qualitative observations, and good clinical judgment—we cannot construct a terribly useful narrative.

And this brings me to my final point: A formal measure, no matter how great it is, should always be employed by a knowledgeable mentor, clinician, teacher, consultant, or coach as a single item of information about a given client that may or may not provide useful insights into relevant needs or capabilities. Consider this relatively simple example: a given 2-year-old may be tall for his age, but if he is somewhat under weight for his age, the latter measure may seem more important. However, if he has a broken arm, neither measure may loom large—at least until the bone is set. Once the arm is safely in a cast, all three pieces of information—weight, height, and broken arm—may contribute to a clinical diagnosis that would have been difficult to make without any one of them.

It is my hope that the educational community will choose to adopt high standards for measurement, then put measurement in its place—alongside good clinical judgment, reflective life experience, qualitative observations, and honest feedback from trusted others.

What is a holistic assessment?

Thirty years ago, when I was a hippy midwife, the idea of holism began to slip into the counter-culture. A few years later, this much misunderstood notion was all the rage on college campuses. By the time I was in graduate school in the nineties there was a impassable division between the trendy postmodern holists and the rigidly old fashioned modernists. You may detect a slight mocking tone, and rightly so. People with good ideas on both sides made themselves look pretty silly by refusing, for example, to use any of the tools associated with the other side. One of the more tragic outcomes of this silliness was the emergence of the holistic assessment.

Simply put, the holistic assessment is a multidimensional assessment that is designed to take a more nuanced, textured, or rich approach to assessment. Great idea. Love it.

It’s the next part that’s silly. Having collected rich information on multiple dimensions, the test designers sum up a person’s performance with a single number. Why is this silly? Because the so-called holistic score becomes pretty-much meaningless. Two people with the same score can have very little in common. For example, let’s imagine that a holistic assessment examines emotional maturity, perspective taking, and leadership thinking. Two people receive a score of 10 that may be accompanied by boilerplate descriptions of what emotional maturity, perspective taking, and leadership attitudes look like at level 10. However, person one was actually weak in perspective-taking and strongest in leadership, and person two was weak in emotional maturity and strongest in perspective taking. The score of 10, it turns out, means something quite different for these two people. I would argue that it is relatively meaningless because there is no way to know, based on the single “holistic” score, how best to support the development of these distinct individuals.

Holism has its roots in system dynamics, where measurements are used to build rich models of systems. All of the measurements are unidimensional. They are never lumped together into “holistic” measures. That would be equivalent to talking about the temperaturelength of a day or the lengthweight of an object*. It’s essential to measure time, weight, and length with appropriate metrics and then to describe their interrelationships and the outcomes of these interrelationships. The language used to describe these is the language of probability, which is sensitive to differences in the measurement of different properties.

In psychological assessment, dimensionality is a challenging issue. What constitutes a single dimension is a matter for debate. For DTS, the primary consideration is how useful an assessment will be in helping people learn and grow. So, we tend to construct individual assessments, each of which represents a fairly tightly defined content space, and we use only one metric to determine the level of a performance. The meaning of a given score is both universal (it is an order of hierarchical complexity and phase on the skill scale) and contextual (it is provided to a performance in a particular domain in a particular context, and is associated with particular content.) We independently analyze the content of the performance to determine its strengths and weaknesses—relative to its level and the known range of content associated with that level—and provide feedback about these strengths and weaknesses as well as targeted learning suggestions. We use the level score to help us tell a useful story about a particular performance, without claiming to measure “lenghtweight”. This is accomplished by the rigorous separation of structure (level) and content.

*If we described objects in terms of their lengthweight, an object that was 10 inches long and 2 lbs could have a lengthweight of 12, but so could an object that was 2 inches long and 10 lbs.

What is a developmental assessment?

A developmental assessment is a test of knowledge and thinking that is based on extensive research into how students come to learn specific concepts and skills over time. All good developmental assessments require test-takers to show their thinking by making written or oral arguments in support of their judgments. Developmental assessments are less concerned about “right” answers and more concerned with how students use their knowledge and thinking skills to solve problems. A good developmental assessment should be educative in the sense that taking it is a learning experience in its own right, and each score is accompanied by feedback that tells students what they are most likely to benefit from learning next.

Testing as part of learning 2

I can’t help it, I’m a developmental psychologist. I’ve been lurking about, watching my Granddaughter, Erwin, as she learns to master her environment. She’s about 8 months old now (real age, she was three months premature, so her birth age is 11 months)

Last week, Erwin figured out that complex actions can be used intentionally to make things happen in social situations. For example, she started reaching toward her Mom and Dad to indicate her intention to be picked up. At around same time, she began pointing to objects to indicate interest or draw them to the attention of her others. And she has begun to imitate actions like waving, clapping, and head shaking. Today, when we were Skyping, she clapped her hands to get me to play pat-a-cake, and she shakes her head to get her Mom to do the same—which she finds hilarious. To Mom’s dismay, Erwin is so excited by this new way of influencing her environment that she has stopped napping.

To see an example of Erwin’s attempts at verbal communication and her new reaching behavior, double-click on the picture below. Notice how emphatic her arm extension is, and how she makes eye contact as she reaches out.

A few months ago, most of Erwin’s actions were aimed toward physical mastery—learning to obtain ojects and manipulate them in a variety of ways, learning to move herself toward things she wanted to manipulate, or playing with sound just to hear the results.

When she was learning to do physical things, the physical environment provided most of the feedback. Although her parents were there to give encouragement, we all had the sense that it was the physical feedback that she craved—getting an object to her mouth, inching toward a favorite toy, pulling herself to stand.

Now she craves feedback from her parents; she has shifted her focus from physical mastery to social mastery. She reaches for Mom and gets picked up. She shakes her head and Mom shakes her head back. She points to a banana, and Dad brings it to her. She claps her hands, and Grandma plays pat-a-cake. And every time she undertakes a new action, she is conducting a test.

Testing is part of learning.

Each time any infant tries out a new skill, she is conducting a test. Each attempt is part of an action-feedback loop. Repeated attempts to master a new skill form a series of these action-feedback loops. Each iteration is an exemplary test—in the sense that it is educative—that guides the infant incrementally toward a new level of mastery.

Interestingly, infants never tire of this kind of testing, even when the feedback is not instantly gratifying. In fact, much of the feedback is along the lines of, “almost, but not quite,” or “that didn’t work,” neither of which seem to get in the way of infant learning. For example, when Erwin first started reaching toward her parents to ask to be picked up, her action was not easy to read. It rarely got the desired response. She gradually learned that the reaching needed to be clearly directed toward the parent and accompanied by eye contact. Now the message is, “You’ve got it!” At this point, Erwin takes the skill for granted, and has shifted her attention to things she has not yet mastered, like figuring out how to get adults to do other interesting or gratifying things.

The natural action-feedback mechanism of infancy works perfectly, because the proverbial carrot is usually, due to the very nature of normal human environments, dangled at just the right distance. Good parents respond to early attempts at communication, rewarding them with interesting responses, but success isn’t the only reward; it’s always accompanied by a new “carrot”—another interesting possibility just beyond the infant’s reach. In this way, the action-feedback mechanism functions both as an aid to learning and as a motivator.

Aspects of this “carrot-and-stick” perspective on learning have been expanded and described in a variety of research traditions—e.g., as part of the notion of reinforcement feedback in social learning theory (Bandura, 1977), as zone of proximal development in Vygotsky’s (1986) work, and as part of a complex process of assimilation and accommodation in Piaget’s (1985) work. It is important, because it speaks both to how we learn and to our motivation for learning. Good feedback plays two essential roles. First, it helps the learner decide what to try next. Second, it motivates the learner to keep striving toward mastery. And, as the infant example suggests, feedback cannot be reduced to simple reward or punishment. Ideally, it is information that supports learning by being useful to the learner. Learners are not motivated by reward or punishment per se, but by an optimal combination of “not there yet” “almost” and “you’ve got it”.

DiscoTests are for learning

Most of today’s tests provide feedback in the form of rewards (good grades, advancement, or honors) or punishment (bad grades and failure). My colleagues and I don’t find this acceptable, so we’ve created a nonprofit called DiscoTest. The overarching objective of the DiscoTest Initiative is to contribute to the development of optimal learning environments by creating assessments that deliver the kind of educative feedback that learners need to learn optimally. DiscoTests determine where students are in their individual learning trajectories and provide feedback that points toward the next incremental step toward mastery.

I’ll be writing more about DiscoTest in future posts. For now, if you’d like to know more, please visit the DiscoTest web site.

Test reliability 2: How high should it be?

There is a great deal of confusion in the assessment community about the interpretation of test reliability. This confusion results in part from the different ways in which researchers and test developers approach the issue.

How test scores are usually presented

Researchers learn how to design research instruments which they use to study population trends or compare groups. They evaluate the quality of their instruments with statistics. One of the statistics used is Cronbach's Alpha, an indicator of statistical reliability that ranges from 0 to 1. Researchers are taught that Alphas above .75 or so are acceptable for their instruments, because this level of reliability ensures that their instrument is measuring real differences between people.

Test developers use a special branch of statistics called psychometrics to build assessments. Assessments are designed to evaluate individuals. Like researchers, test developers are concerned about reliability, but for somewhat different reasons. From a psychometric point of view, it's not enough to know that an assessment measures real differences between people. Psychometricians need to be confident that the score awarded to an individual is a good estimate of that particular individual's true score. Because of this, most psychometricians set higher standards for reliability than those set by researchers.

The table below will help to clarify why it is important for assessments to have higher reliabilities than research instruments. It shows the relationship between statistical reliability and the number of distinct levels (strata) a test can be said to have. For example, an assessment with a reliability of .80, has 3 strata, whereas an assessment with a reliability of .94 has 5.

Reliability Strata
.70 2
.80 3
.90 4
.94 5
.95 6
.96 7
.97 8
.98 9

Strata have direct implications for the confidence we can have in a specific person's score on a given assessment, because they tell us about the range within which a person's true score would fall, given a particular score.

Imagine that you have just taken a test of emotional intelligence with a score range of 1 to 10 and a reliability of .95. The number of strata into which this assessment can be divided is about 6, which means that each strata equals about 1.75 points on the 10 point scale. If your score on this test is 8, your true score is likely to fall within the range of 7.1 to 8.9*.

The figure below shows the true score ranges for three test takers, CB, RM, and PR. The fact that these ranges don't overlap gives us confidence that the emotional intelligence of these test-takers is actually different**.

Tests with Alphas in the range of .94 or higher are considered suitable for high-stakes use (assuming that they meet other essential validity requirements).

Alpha equals .95

If the test you have taken has a score range of 1 to 10 and an Alpha (reliability) of .85, the number of strata into which this assessment can be divided is about 3.4, which means that each strata equals about 2.9 points on the 10 point scale. This means your true score is likely to fall within the range of 6.6 to 9.5*.

In the figure below, note that CB's true score range now overlaps RM's true score range and RM's true score range overlaps PR's true score range. This means we cannot say—with confidence—that CB's score is different from RM's score, or that RM's score is different from PR's score.

Assessments with Alphas in the .85 range are suitable for classorrom use or low-stakes contexts. Yet, every day, schools and businesses use tests with reliablilites in the .85 range to make high stakes decisions—such as who will be selected for advancement or promotion.

Alpha equals .85

If the test you have taken has a score range of 1 to 10 and an Alpha (reliability) of .85, the number of strata into which this assessment can be divided is about 2.2, which means that each strata equals about .45 points on the 10 point scale. This means your true score is likely to fall within the range of 6 to 10*.

As shown in the figure below, scores would now have to differ by at least 4.5 points in order for us to distinguish between two people. CB's and PR's scores are different, but RM's score is uninterpretable.

Tests or subscales with alphas in the .75 range are considered suitable for research purposes. Yet, sad to say, schools and businesses now use tests with subscales that have Alphas in or below the .75 range, treating these scores as if they provide useful information, when in most cases the scores—like RM's—are uninterpretable.

Alpha equals .75

If your current test provider is not reporting true score ranges, ask for them. If they only provide Alphas (reliability statistics) you can use the table and figures in this article to figure out true score ranges for yourself.

Be particulary wary about test developers that claim to measure multiple diminsions with 10-15 minute tests. It is not possible to detect individual differences reliably under these conditions.

Statistical reliability is only one of the ways in which assessments should be evaluated. Test developers should also ask how well an assessment measures what it is intended to measure. And those who use an assessment should ask whether or not what it measures is relevant or important.


*This range will be wider at the top and bottom of the scoring range and a bit narrower in the middle of the range.

**It doesn't tell us if emotional intellignece is important. That is determined in other ways.


References

Guilford J. P. (1965). Fundamental statistics in psychology and education. 4th Edn. New York: McGraw-Hill.

Kubiszyn T., Borich G. (1993). Educational testing and measurement. New York: Harper Collins.

Wright B. D. (1996). Reliability and separation. Rasch Measurement Transactions, 9, 472.

A good test

babyIn this post, I explore a way of thinking about testing that would lead to the design of tests that are very different from most of the tests students take today.

Two propositions, an observation, and a third proposition:

Proposition 1. Because adults who do not enjoy learning are at a severe disadvantage in a rapidly changing world, an educational system should do everything possible to nurture children's inborn love of learning.

Proposition 2. In K-12, the specific content of a curriculum is not as important as the development of broadly applicable skills for learning, reasoning, communicating, and participating in a civil society. (The content of the curriculum would be chosen to support the development of these skills and could—perhaps should—differ from classroom to classroom.)

Observation. Testing tends to drive instruction.

Proposition 3. Consequently, tests should evaluate relevant skills and be employed in ways that support students' natural love of learning.

Given these propositions, here is my favorite definition of a "good test."

A good test is part of the conversation between a "student" and a "teacher" that tells the teacher what the student is most likely to benefit from learning next.

I'll unpack this definition and show how it relates to the proposals listed above:

Anyone who has carefully observed an infant in pursuit of knowledge will understand the conversational nature of learning. A parent holds out a shiny spoon and an infant's arms wave wildly. Her hand makes contact with the spoon and a message is sent to her brain, "Something interesting happened!" The next day, her arm movements are a little less random. She makes contact several times, feeling the same sense of satisfaction. Her parents laugh with delight. She coos. In this way, her physical and social environment provide immediate feedback each time she succeeds (or fails). Over time, the infant uses this information to learn how to reach out and touch the spoon at will. Of course, she is not satisfied with merely touching the spoon, and, through the same kind of trial and error, supplemented with a little support from Mom and Dad, she soon learns to bring the spoon to her mouth. And the conversation goes on.

Every attempt to touch the spoon is a kind of test. Every success is an affirmation that the strategy just employed was an effective strategy, but the story does not end here. In her quest to master her environment, the infant keeps moving the bar. Once she can do so at will, touching the spoon is no longer satisfying. She moves on to the next skill—holding the spoon, and the next—bringing it to her mouth, etc. Having observed this process hundreds of times, I strongly suspect that a sense of mastery is the intrinsic reward that motivates learning, while conversation, including both social and physical interactions, acts as the fuel.

Conversation

A good educational test should have the same quality of conversation, in the form of performance and feedback, that is illustrated in the example above. In an ideal testing situation, the student shows a teacher how he or she understands new concepts and skills, then the teacher uses this information to determine what comes next.

Part of the conversation

However, a good test is part of the conversation—not the entire conversation. No single test (or kind of conversation) will do. For example, the infant reaches for the spoon because she finds it interesting, and she must be interested enough to reach out many dozens of times before she can grasp an object at will. Good parents recognize that she expresses more sustained interest if they provide her with a number of different objects—and don't try to force her to manipulate objects when she would rather be nursing or sleeping. Each act is a test embedded in a long conversation that is further embedded in a broader context.

What comes next?

In the story, I suggest that the spoon must be both interesting and within an infant's reach before it can become part of an ongoing conversation. In the same way, a good test should both be engaging and within a student's reach in order to play its role in the conversation between student and teacher.

An engaging test of appropriate skills can tell us how a student understands what he or she is learning, but this knowledge, by itself, does not tell the teacher (or the student) what comes next. To find out, researchers must study how particular concepts and skills are learned over time. Only when we have done a good job describing how particular skills and concepts are learned can we predict what a student is most likely to benefit from learning next.

So, a good test must not only capture the nature of a particular student's understanding, it must also be connected to knowledge about  the pathways through which students come to understand the concepts and skills of the knowledge area it targets.

Back to conversation

I argue above, that in infancy, a sense of mastery is the intrinsic reward that motivates learning, while conversation is the fuel. If conversation is the fuel, tests that do a good job serving the conversational function I outline here are likely to fuel students' natural pursuit of mastery and a lifelong love of learning.

Later: But what about accountability?

IQ and development

IQ is a dimension of ability that has been defined using a form of statistical modeling called psychometrics. It is based entirely on psychometric analysis of results from tests consisting of many items, each of which has one correct answer.

IQ scores are arranged along a scale that is based upon the performances of hundreds of people who have taken the same test.

IQ is considered to be a relatively fixed characteristic of a person. People who score higher on an IQ test are considered to be more intelligent than people who score lower.

Cognitive development is a theoretically defined, evidence based dimension. Developmental level is determined by asking individuals to engage in activities that expose their reasoning. Items on developmental assessments are typically open-ended and do not focus on correct answers. They focus on how people go about seeking answers.

A single developmental dimension has been shown to underlie development in a wide range of cognitive domains, making it possible to define a non-arbitrary scale along which development progresses. Individual performances can be placed within a range on this scale.

Cognitive developmental level is not viewed as a fixed trait and is known to vary within persons, depending on knowledge area and a range of contextual variables. Individuals who demonstrate higher levels of cognitive development are viewed as more cognitively developed than those demonstrating lower levels of cognitive development.

The relation between IQ and cognitive development

Children with higher IQ’s learn the kind of knowledge and skills represented in IQ tests earlier than people with lower IQ’s. There is some evidence that cognitive development is likely to be more rapid (and have a higher “endpoint”) in people who have higher IQ’s.

Limitations of testing

The subject matter of IQ tests is limited, and the skill sets that are tested are narrow, so we have to be careful about making generalizations about people based on test results—especially the results of single tests. The same is true for cognitive developmental assessments. Good cognitive developmental assessments are now providing scores with a level of precision similar to that of conventional assessments, but even the most precise and accurate scores apply to performance on a single assessment in a single subject area, and do not capture the full range of capabilities of a test-taker.

The inability of any single assessment (or type of assessment) to provide an accurate account of the capabilities of an individual suggests that the best (most ethical) use of assessments involves repeated measurements across a wide range of subject areas over time.

Testing as part of learning 1

Learning isn’t easy

Yet all healthy babies pursue it with dogged determination, spending hour after hour exploring—and learning to master—their own bodies, as well as their physical and social environments.

Natural testing

When infants and young children engage their environments, they receive constant feedback about what does and does not work. For example, babies spend months learning how to control the movements of their hands. An infant will spend several weeks just learning how to bring an object to her mouth. She’ll use what she learns from successes and failures to do better next time. Feedback is instant and accurate, and the results of each attempt tell her what to try next.

Babies often act like they are addicted to learning. They will tolerate an amazing amount of failure. But without prompt feedback from their external environment, they wouldn’t get far. The same is true for older children.

Testing in schools

Ideally, educational tests model natural testing by providing students with timely and accurate feedback that tells them (and their teachers) what to try next.