What is a holistic assessment?
Posted by Theo in educational testing, measurement, research, testing in general on December 11, 2009
Thirty years ago, when I was a hippy midwife, the idea of holism began to slip into the counter-culture. A few years later, this much misunderstood notion was all the rage on college campuses. By the time I was in graduate school in the nineties there was a impassable division between the trendy postmodern holists and the rigidly old fashioned modernists. You may detect a slight mocking tone, and rightly so. People with good ideas on both sides made themselves look pretty silly by refusing, for example, to use any of the tools associated with the other side. One of the more tragic outcomes of this silliness was the emergence of the holistic assessment.
Simply put, the holistic assessment is a multidimensional assessment that is designed to take a more nuanced, textured, or rich approach to assessment. Great idea. Love it.
It’s the next part that’s silly. Having collected rich information on multiple dimensions, the test designers sum up a person’s performance with a single number. Why is this silly? Because the so-called holistic score becomes pretty-much meaningless. Two people with the same score can have very little in common. For example, let’s imagine that a holistic assessment examines emotional maturity, perspective taking, and leadership thinking. Two people receive a score of 10 that may be accompanied by boilerplate descriptions of what emotional maturity, perspective taking, and leadership attitudes look like at level 10. However, person one was actually weak in perspective-taking and strongest in leadership, and person two was weak in emotional maturity and strongest in perspective taking. The score of 10, it turns out, means something quite different for these two people. I would argue that it is relatively meaningless because there is no way to know, based on the single “holistic” score, how best to support the development of these distinct individuals.
Holism has its roots in system dynamics, where measurements are used to build rich models of systems. All of the measurements are unidimensional. They are never lumped together into “holistic” measures. That would be equivalent to talking about the temperaturelength of a day or the lengthweight of an object*. It’s essential to measure time, weight, and length with appropriate metrics and then to describe their interrelationships and the outcomes of these interrelationships. The language used to describe these is the language of probability, which is sensitive to differences in the measurement of different properties.
In psychological assessment, dimensionality is a challenging issue. What constitutes a single dimension is a matter for debate. For DTS, the primary consideration is how useful an assessment will be in helping people learn and grow. So, we tend to construct individual assessments, each of which represents a fairly tightly defined content space, and we use only one metric to determine the level of a performance. The meaning of a given score is both universal (it is an order of hierarchical complexity and phase on the skill scale) and contextual (it is provided to a performance in a particular domain in a particular context, and is associated with particular content.) We independently analyze the content of the performance to determine its strengths and weaknesses—relative to its level and the known range of content associated with that level—and provide feedback about these strengths and weaknesses as well as targeted learning suggestions. We use the level score to help us tell a useful story about a particular performance, without claiming to measure “lenghtweight”. This is accomplished by the rigorous separation of structure (level) and content.
*If we described objects in terms of their lengthweight, an object that was 10 inches long and 2 lbs could have a lengthweight of 12, but so could an object that was 2 inches long and 10 lbs.
Teacher pay and standardized test results
Posted by Theo in educational testing, standardized testing, teaching on November 23, 2009
At the end of October, the Century Foundation released a paper entitled, Eight reasons not to tie teacher pay to standardized test results. I agree with their conclusions, and would add that even if all standardized tests were extremely reliable and measured exactly what they intended to measure, this would be a bad idea. This is because success in the adult world requires a multiplicity of skills and forms of knowledge, and tests focus on only some of these, one at a time. Until we can construct multifaceted longitudinal stories about the progress of individual students that are tied to a non-arbitrary standardized metric, we should not even consider linking student evaluations to teacher pay.
Promoting development
Posted by Theo in cognitive development, learning on October 4, 2009
There is a vast literature exploring ways to promote development. Much of this literature focuses on speeding up development, some of it focuses on optimizing development. Although both approaches are intended to support development, there is evidence that approaches focused on optimizing development are likely to do a better job. This is because development involves two intertwined processes, differentiation (broadening and deepening knowledge) and integration. In plain(er) English, you get more adequate integrations at each level if you accomplish rich differentiation at the prior level.
When we code an assessment, we pay close attention to the degree to which the test-taker elaborates each of the sub-skills it targets. In our personal feedback, we note areas of strength and areas that appear to require further growth. The basic idea is to bring all of the sub-skills up to an optimal level of elaboration to support the emergence of next-level integrations.
Most of the readings we suggest are targeted one to two phases (1/4 to 1/2 of a level) above the level of a given performance. This practice has been shown to provide the ideal level of challenge (scaffolding) for optimal growth. We also suggest activities like engaging in discourse with peers, journaling, cultivating a habit of reflection, and improving metacognitive skills, all of which provide support for growth.
We do not teach people to think at higher levels. Higher levels of performance emerge when knowledge is adequately elaborated and the environment supports higher levels of thinking and performance. We focus on helping people to think better at their current level and challenging them to elaborate their current knowledge and skills—including the not-so-sexy nuts-and-bolts knowledge required for success in any context.
Task demands and capabilities
Posted by Theo in cognitive development, decision making, leader development, leadership, measurement on September 11, 2009
Our developmental assessment system, called the Lectical Assessment System (LAS), can be used to score (a) the performances of persons and (b) the task demands of specific situations/contexts. For example, my colleagues and I have analyzed the task demands of levels of management in large organizations, and tested managers’ developmental level of performance in several skill areas—including reasoning about leadership, reflective judgment, and decision-making.

The figure above shows the relation between the task demands of 7 levels of management and the performance levels of managers occupying these management positions. In this oversimplified image, the task demands of most management positions increase in a linear fashion, spanning levels 10-13. The capabilities of managers do not, for the most part, match these task demands.
This pattern is pervasive—we see it everywhere we look—and it reflects a hard truth. None of us is capable of meeting the task demands of the most complex situations in today’s world. I’ve come to believe that in many situations our best hope for meeting these demands is to (1) work strategically on the development of our own skills and knowledge, (2) learn to work closely with others who represent a wide range of perspectives and areas of expertise, and (3) use the best tools available to scaffold our thinking.
About measurement
Posted by Theo in Lectical Assessment System, cognitive development, measurement on July 29, 2009
The story of how measurement permits scientific advance can be illustrated through any number of examples. One such example is the measurement of temperature and its effects on our understanding of the molecular structure of lead and other elemental substances.
The tale begins with an assortment of semi-mythical early scientists, who agreed in their observations that lead only melts when it is very hot—much hotter than the temperature at which ice melts, and quite a bit cooler than the temperature at which iron melts. These observations, made repeatedly, resulted in the hypothesis that lead melts at a particular temperature.
To test this theory it was necessary to develop a standard for measuring temperature. A variety of early thermometers were developed and implemented. Partly because these early temperature-measuring devices were poorly calibrated, and partly because different temperature-measuring devices employed different scales, the temperature at which lead melted seemed to vary from device to device and context to context.
Scientists divided into a number of ‘camps’. One group argued that there were multiple pathways toward melting, which explained why the melting seemed to occur at different temperatures. Another group argued that the melting of lead could not be understood apart from the context in which the melting occurs. Only when a measure of temperature had been adequately developed and widely accepted did it become possible to observe that lead consistently melts at about 327º C.
Armed with this knowledge, scientists asked what it is about lead that causes it to melt at this particular temperature. They then developed hypotheses about the factors contributing to this phenomenon, observing that changes in altitude or air pressure seemed to result in small differences in its melting temperature. So, context did seem to play a role! In order to observe these differences more accurately, the measurement of temperature was further refined. The resulting observations provided information that ultimately contributed to an understanding of lead’s and other elements’ molecular structure.
While parts of this story are fictional, it is true that the thermometer has greatly contributed to our understanding of the properties of lead. Interestingly, the thermometer, like all other measures, emerged from what were originally qualitative observations about the effects of different amounts of heat that were quantified over time. The value of the thermometer, as we all know, extends far beyond its use as a measure of the melting temperature of lead. The thermometer is a measure of temperature in general, meaning that it can be employed to measure temperature in an almost limitless range of substances and contexts. It is this generality, in the end, that makes it possible to investigate the impact of context on the melting temperature of a substance, or to compare the relative melting temperatures of a range of elemental substances. This generality (or context-independence) is one of the primary features of a good measure.
Good measurement requires (1) the identification of a unidimensional, content and context-independent trait (temperature, length, time); (2) a system for assessing the amount of the trait; (3) determinations of the reliability and validity of the assessments; and finally (4) the calibration of a measure. A good thermometer has all of the qualities of a good measure. It is a well-calibrated instrument that can be employed to accurately and reliably measure a general, unidimensional trait across a wide range of contexts.
It was this perspective on measurement that first inspired me to try to find a good general measure of the developmental dimension. To read more about how this way of thinking relates to the Lectical Assessment System (LAS), read About Measurement on the DTS site. Pay special attention to the list of things we can do with the LAS.
What is a developmental assessment?
Posted by Theo in cognitive development, educational testing, testing in general on July 29, 2009
A developmental assessment is a test of knowledge and thinking that is based on extensive research into how students come to learn specific concepts and skills over time. All good developmental assessments require test-takers to show their thinking by making written or oral arguments in support of their judgments. Developmental assessments are less concerned about “right” answers and more concerned with how students use their knowledge and thinking skills to solve problems. A good developmental assessment should be educative in the sense that taking it is a learning experience in its own right, and each score is accompanied by feedback that tells students what they are most likely to benefit from learning next.
Integrative complexity and the LAS
Posted by Theo in Lectical Assessment System, cognitive development on July 15, 2009
Suedfeld and Tetlock’s Integrative Complexity Scale is one of a number of developmental scales—most of which have been informed by Jean Piaget’s cognitive developmental theory—that subscribe to the notion of hierarchical integration. Piagetian and neo-Piagetian theorists view development as a process of differentiation (increasing knowledge) and integration (organizing knowledge). Rather than viewing learning as an additive process in which we simply accumulate bits of knowledge over time, integrative theories propose that learning is an active process through which we organize our knowledge in particular ways, depending on where we are in our development. Moving from one development level to another involves a reorganization of our knowledge that translates into a new way of thinking.
For example, when most 6-year-olds think about lying, they are likely to think about it in terms of a single consequence—keeping out of trouble, getting into trouble, or making Dad sad. An eight-year-old can think about lying in terms of multiple possible consequences—getting in trouble and keeping out of trouble, which makes it possible to decide which outcome is more likely given past experience. You can view a more detailed description of this process in an online article, The Lectical Assessment System.
Suedfeld and Tetlock’s Integrative Complexity Scoring System (ICSS), like the Lectical Assessment System (LAS) and the General Hierarchical Complexity Scoring System (HCSS) is a content-independent scoring system that can be used to score the level of integrative complexity in a wide range of texts. What differs between these scoring systems are the scoring rules. Here, I discuss the difference between the scoring rules of the LAS and the ICSS.
The LAS goes to the heart of differentiation and integration by asking analysts to examine the way arguments are explicitly structured (single elements, linear arguments, or systems) and the way the meanings of their elements are implicitly structured (single elements, linear arguments, or systems). We call this core structure. The LAS has been subjected to a number of psychometric studies and has been shown to be a valid and reliable measure of the cognitive-developmental dimension, reliably (in the statistical sense) distinguishing 20 developmental phases between age 5 and the highest levels of adulthood.
Domain-based developmental assessment systems generally target conceptual content and aspects of surface structure. The ICSS relies primarily upon indicators of surface structure. In other words, instead of directly examining core structures, the developers of this system focus on a number of indicators that point to these core structures—including things like perspective, compartmentalization, setting up “straw men”, inclusion/exclusion rules, conflict avoidance, recognizing “exceptions to the rule”, probability statements, etc. The reliability of this assessment is generally too low to justify its clinical use (i.e., to provide a score for an individual), and some forms of the assessment do not appear to meet the reliability requirements for group studies. (see Reliability 2: How high should it be?)
Testing as part of learning 2
Posted by Theo in cognitive development, learning, motivation, teaching, testing on July 14, 2009
I can’t help it, I’m a developmental psychologist. I’ve been lurking about, watching my Granddaughter, Erwin, as she learns to master her environment. She’s about 8 months old now (real age, she was three months premature, so her birth age is 11 months)
Last week, Erwin figured out that complex actions can be used intentionally to make things happen in social situations. For example, she started reaching toward her Mom and Dad to indicate her intention to be picked up. At around same time, she began pointing to objects to indicate interest or draw them to the attention of her others. And she has begun to imitate actions like waving, clapping, and head shaking. Today, when we were Skyping, she clapped her hands to get me to play pat-a-cake, and she shakes her head to get her Mom to do the same—which she finds hilarious. To Mom’s dismay, Erwin is so excited by this new way of influencing her environment that she has stopped napping.
To see an example of Erwin’s attempts at verbal communication and her new reaching behavior, double-click on the picture below. Notice how emphatic her arm extension is, and how she makes eye contact as she reaches out.
A few months ago, most of Erwin’s actions were aimed toward physical mastery—learning to obtain ojects and manipulate them in a variety of ways, learning to move herself toward things she wanted to manipulate, or playing with sound just to hear the results.
When she was learning to do physical things, the physical environment provided most of the feedback. Although her parents were there to give encouragement, we all had the sense that it was the physical feedback that she craved—getting an object to her mouth, inching toward a favorite toy, pulling herself to stand.
Now she craves feedback from her parents; she has shifted her focus from physical mastery to social mastery. She reaches for Mom and gets picked up. She shakes her head and Mom shakes her head back. She points to a banana, and Dad brings it to her. She claps her hands, and Grandma plays pat-a-cake. And every time she undertakes a new action, she is conducting a test.
Testing is part of learning.
Each time any infant tries out a new skill, she is conducting a test. Each attempt is part of an action-feedback loop. Repeated attempts to master a new skill form a series of these action-feedback loops. Each iteration is an exemplary test—in the sense that it is educative—that guides the infant incrementally toward a new level of mastery.
Interestingly, infants never tire of this kind of testing, even when the feedback is not instantly gratifying. In fact, much of the feedback is along the lines of, “almost, but not quite,” or “that didn’t work,” neither of which seem to get in the way of infant learning. For example, when Erwin first started reaching toward her parents to ask to be picked up, her action was not easy to read. It rarely got the desired response. She gradually learned that the reaching needed to be clearly directed toward the parent and accompanied by eye contact. Now the message is, “You’ve got it!” At this point, Erwin takes the skill for granted, and has shifted her attention to things she has not yet mastered, like figuring out how to get adults to do other interesting or gratifying things.
The natural action-feedback mechanism of infancy works perfectly, because the proverbial carrot is usually, due to the very nature of normal human environments, dangled at just the right distance. Good parents respond to early attempts at communication, rewarding them with interesting responses, but success isn’t the only reward; it’s always accompanied by a new “carrot”—another interesting possibility just beyond the infant’s reach. In this way, the action-feedback mechanism functions both as an aid to learning and as a motivator.
Aspects of this “carrot-and-stick” perspective on learning have been expanded and described in a variety of research traditions—e.g., as part of the notion of reinforcement feedback in social learning theory (Bandura, 1977), as zone of proximal development in Vygotsky’s (1986) work, and as part of a complex process of assimilation and accommodation in Piaget’s (1985) work. It is important, because it speaks both to how we learn and to our motivation for learning. Good feedback plays two essential roles. First, it helps the learner decide what to try next. Second, it motivates the learner to keep striving toward mastery. And, as the infant example suggests, feedback cannot be reduced to simple reward or punishment. Ideally, it is information that supports learning by being useful to the learner. Learners are not motivated by reward or punishment per se, but by an optimal combination of “not there yet” “almost” and “you’ve got it”.
DiscoTests are for learning
Most of today’s tests provide feedback in the form of rewards (good grades, advancement, or honors) or punishment (bad grades and failure). My colleagues and I don’t find this acceptable, so we’ve created a nonprofit called DiscoTest. The overarching objective of the DiscoTest Initiative is to contribute to the development of optimal learning environments by creating assessments that deliver the kind of educative feedback that learners need to learn optimally. DiscoTests determine where students are in their individual learning trajectories and provide feedback that points toward the next incremental step toward mastery.
I’ll be writing more about DiscoTest in future posts. For now, if you’d like to know more, please visit the DiscoTest web site.
Reliability 2: How high should it be?
Posted by Theo in educational testing, standardized testing, testing in general on July 5, 2009
There is a great deal of confusion in the assessment community about the interpretation of statistical reliability. This confusion results in part from the different ways in which researchers and test developers approach the issue. Researchers learn how to design research instruments which they use to study population trends or compare groups. They evaluate the quality of their instruments with statistics. One of the statistics used is Cronbach’s Alpha, an indicator of statistical reliability that ranges from 0 to 1. Researchers are taught that Alphas above .77 or so are acceptable for their instruments, because this level of reliability ensures that their instrument is measuring real differences between people.
Test developers use a special branch of statistics called psychometrics to build assessments. Assessments are designed to evaluate individuals. Like researchers, test developers are concerned about reliability, but for somewhat different reasons. From a psychometric point of view, it is not enough to know that an assessment measures real differences between people. Psychometricians need to be confident that the score awarded to an individual is a good estimate of that particular individual’s true score. Because of this, most psychometricians set higher standards for reliability than those set by researchers.
The table below will help to clarify why it is important for assessments to have higher reliabilities than research instruments. It shows the relationship between statistical reliability and the number of distinct levels (strata) a test can be said to have. For example, an assessment with a reliability of .80, has 3 strata, whereas an assessment with a reliability of .94 has 5.
| Reliability | Strata |
| .70 | 2 |
| .80 | 3 |
| .90 | 4 |
| .94 | 5 |
| .96 | 7 |
| .97 | 8 |
| .98 | 9 |
Strata have direct implications for the confidence we can have in a specific person’s score on a given assessment, because they tell us something about the range within which a person’s true score would fall, given a particular score. Imagine that you have taken a test with a scoring range of 0 to 500 and a reliability of .94. The number of strata into which this assessment can be divided is 5, which means that each strata equals about 100 points on the 500 point scale. If your score on this test is 350, your true score is likely to fall within the range of 300 to 400*.
Statistical reliability is only one of the ways in which assessments should be evaluated. Test developers should also ask how well an assessment measures what it is intended to measure. And those who use an assessment should ask whether or not what it measures is relevant or important.
*This range will be wider at the top and bottom of the scoring range and a bit narrower in the middle of the range.
References
Guilford J. P. (1965). Fundamental statistics in psychology and education. 4th Edn. New York: McGraw-Hill.
Kubiszyn T., Borich G. (1993). Educational testing and measurement. New York: Harper Collins.
Wright B. D. (1996). Reliability and separation. Rasch Measurement Transactions, 9, 472.
A good test
Posted by Theo in cognitive development, educational testing, learning, motivation, testing in general on April 29, 2009
In this post, I explore a way of thinking about testing that would lead to the design of tests that are very different from most of the tests students take today.
Two propositions, an observation, and a third proposition:
Proposition 1. Because adults who do not enjoy learning are at a severe disadvantage in a rapidly changing world, an educational system should do everything possible to nurture children’s inborn love of learning.
Proposition 2. In K-12, the specific content of a curriculum is not as important as the development of broadly applicable skills for learning, reasoning, communicating, and participating in a civil society. (The content of the curriculum would be chosen to support the development of these skills and could—perhaps should—differ from classroom to classroom.)
Observation. Testing tends to drive instruction.
Proposition 3. Consequently, tests should evaluate relevant skills and be employed in ways that support students’ natural love of learning.
Given these propositions, here is my favorite definition of a “good test.”
A good test is part of the conversation between a “student” and a “teacher” that tells the teacher what the student is most likely to benefit from learning next.
I’ll unpack this definition and show how it relates to the proposals listed above:
Anyone who has carefully observed an infant in pursuit of knowledge will understand the conversational nature of learning. A parent holds out a shiny spoon and an infant’s arms wave wildly. Her hand makes contact with the spoon and a message is sent to her brain, “Something interesting happened!” The next day, her arm movements are a little less random. She makes contact several times, feeling the same sense of satisfaction. Her parents laugh with delight. She coos. In this way, her physical and social environment provide immediate feedback each time she succeeds (or fails). Over time, the infant uses this information to learn how to reach out and touch the spoon at will. Of course, she is not satisfied with merely touching the spoon, and, through the same kind of trial and error, supplemented with a little support from Mom and Dad, she soon learns to bring the spoon to her mouth. And the conversation goes on.
Every attempt to touch the spoon is a kind of test. Every success is an affirmation that the strategy just employed was an effective strategy, but the story does not end here. In her quest to master her environment, the infant keeps moving the bar. Once she can do so at will, touching the spoon is no longer satisfying. She moves on to the next skill—holding the spoon, and the next—bringing it to her mouth, etc. Having observed this process hundreds of times, I strongly suspect that a sense of mastery is the intrinsic reward that motivates learning, while conversation, including both social and physical interactions, acts as the fuel.
Conversation
A good educational test should have the same quality of conversation, in the form of performance and feedback, that is illustrated in the example above. In an ideal testing situation, the student shows a teacher how he or she understands new concepts and skills, then the teacher uses this information to determine what comes next.
Part of the conversation
However, a good test is part of the conversation—not the entire conversation. No single test (or kind of conversation) will do. For example, the infant reaches for the spoon because she finds it interesting, and she must be interested enough to reach out many dozens of times before she can grasp an object at will. Good parents recognize that she expresses more sustained interest if they provide her with a number of different objects—and don’t try to force her to manipulate objects when she would rather be nursing or sleeping. Each act is a test embedded in a long conversation that is further embedded in a broader context.
What comes next?
In the story, I suggest that the spoon must be both interesting and within an infant’s reach before it can become part of an ongoing conversation. In the same way, a good test should both be engaging and within a student’s reach in order to play its role in the conversation between student and teacher.
An engaging test of appropriate skills can tell us how a student understands what he or she is learning, but this knowledge, by itself, does not tell the teacher (or the student) what comes next. To find out, researchers must study how particular concepts and skills are learned over time. Only when we have done a good job describing how particular skills and concepts are learned can we predict what a student is most likely to benefit from learning next.
So, a good test must not only capture the nature of a particular student’s understanding, it must also be connected to knowledge about the pathways through which students come to understand the concepts and skills of the knowledge area it targets.
Back to conversation
I argue above, that in infancy, a sense of mastery is the intrinsic reward that motivates learning, while conversation is the fuel. If conversation is the fuel, tests that do a good job serving the conversational function I outline here are likely to fuel students’ natural pursuit of mastery and a lifelong love of learning.
Later: But what about accountability?

Recent Comments