Maintaining inter-rater agreement

How we maintain inter-rater agreement and ensure high reliability at DTS/DiscoTest

First, we design assessments with 5-7 essay questions, partly because this number is required to allow us to achieve a level of reliability that allows us to identify 4 phases per lectical level. This corresponds with a corrected alpha of .95 or greater.

Second, we engage in continuous learning. Certified analysts and trainees attend mandatory weekly scoring meetings (called scoring circles) where they discuss scoring and review challenging cases.

Third, when we begin working with data from a new subject area, the scoring circle always examines a diverse sample of protocols before starting to score in earnest. Then, when we begin scoring a new assessment, two Certified Analysts score every performance until agreement rates are consistently at or above 85% within 1/4 of a level.

Fourth, we second score a percentage of all performances, some selected at random and some selected because the first analyst lacks confidence in his or her score.

  • 5%-10% of all assessments, selected at random, are second-scored by a blind analyst (a higher percentage on newer assessments or when the rate of inter-rater agreement is unacceptable.)
  • A second, blind scorer is required to score an assessment any time the first scorer’s confidence level is below the level we call “confident”.

When the scores of the first and second scorers are different by more than 1 phase, first and second scorers must reconcile through discussion. If they cannot reconcile, they must consult a third Certified Analyst.

Confidence levels
4 = very confident: exemplary, prototypical
3 = confident: no guesswork, not too much variation, no more than 2 responses where scorer wavers, no lack of coherence, no language problems, adequate explanation, no suspicion of plagarism, not idiosyncratic
2 = less than confident: guesswork, too much variation, more than 2 responses where scorer wavers, lack of coherence, language problems, inadequate explanation, suspicion of plagarism, idiosyncratic
1 = not confident at all: unscorable or almost unscorable, very idiosyncratic, very incoherent

The limitations of testing

It is important for those of us who use assessments to ensure that they (1) measure what we say they measure, (2) measure it reliably enough to justify claimed distinctions between and within persons, and (3) are used responsibly. It is relatively easy for testing experts to create assessments that are adequately reliable (2) for individual assessment, and although it is more difficult to show that these tests measure the construct of interest (1), there are reasonable methods for showing that an assessment meets this standard. However, it is more difficult to ensure that assessments are used responsibly (3).

Few consumers of tests are aware of their inherent limitations. Even the best tests, those that are highly reliable and measure what they are supposed to measure, provide only a limited amount of information. This is true of all measures. The more we hone in on a measureable dimension—in other words, the greater our precision becomes—the narrower the construct becomes. Time, weight, height, and distance are all extremely narrow constructs. This means that they provide a very specific piece of information extremely well. When we use a ruler, we can have great confidence in the measurement we make, down to very small lengths (depending on the ruler, of course). No one doubts the great advantages of this kind of precision. But we can’t learn anything else about the measured object. Its length usually cannot tell us what the object is, how it is shaped, its color, its use, its weight, how it feels, how attractive it is, or how useful it is. We only know how long it is. To provide an accurate account of the thing that was measured, we need to know many more things about it, and we need to construct a narrative that brings these things together in a meaningful way.

A really good psychological measure is similar. The LAS (Lectical Assessment System), for example, is designed to go to the heart of development, stripping away everything that does not contribute to the pure developmental “height” of a given performance. Without knowledge of many other things—such as the ways of thinking that are generally associated with this “height” in a particular domain, the specific ideas that are associated with this particular performance, information from other performances on other measures, qualitative observations, and good clinical judgment—we cannot construct a terribly useful narrative.

And this brings me to my final point: A formal measure, no matter how great it is, should always be employed by a knowledgeable mentor, clinician, teacher, consultant, or coach as a single item of information about a given client that may or may not provide useful insights into relevant needs or capabilities. Consider this relatively simple example: a given 2-year-old may be tall for his age, but if he is somewhat under weight for his age, the latter measure may seem more important. However, if he has a broken arm, neither measure may loom large—at least until the bone is set. Once the arm is safely in a cast, all three pieces of information—weight, height, and broken arm—may contribute to a clinical diagnosis that would have been difficult to make without any one of them.

It is my hope that the educational community will choose to adopt high standards for measurement, then put measurement in its place—alongside good clinical judgment, reflective life experience, qualitative observations, and honest feedback from trusted others.

What is a holistic assessment?

Thirty years ago, when I was a hippy midwife, the idea of holism began to slip into the counter-culture. A few years later, this much misunderstood notion was all the rage on college campuses. By the time I was in graduate school in the nineties there was a impassable division between the trendy postmodern holists and the rigidly old fashioned modernists. You may detect a slight mocking tone, and rightly so. People with good ideas on both sides made themselves look pretty silly by refusing, for example, to use any of the tools associated with the other side. One of the more tragic outcomes of this silliness was the emergence of the holistic assessment.

Simply put, the holistic assessment is a multidimensional assessment that is designed to take a more nuanced, textured, or rich approach to assessment. Great idea. Love it.

It’s the next part that’s silly. Having collected rich information on multiple dimensions, the test designers sum up a person’s performance with a single number. Why is this silly? Because the so-called holistic score becomes pretty-much meaningless. Two people with the same score can have very little in common. For example, let’s imagine that a holistic assessment examines emotional maturity, perspective taking, and leadership thinking. Two people receive a score of 10 that may be accompanied by boilerplate descriptions of what emotional maturity, perspective taking, and leadership attitudes look like at level 10. However, person one was actually weak in perspective-taking and strongest in leadership, and person two was weak in emotional maturity and strongest in perspective taking. The score of 10, it turns out, means something quite different for these two people. I would argue that it is relatively meaningless because there is no way to know, based on the single “holistic” score, how best to support the development of these distinct individuals.

Holism has its roots in system dynamics, where measurements are used to build rich models of systems. All of the measurements are unidimensional. They are never lumped together into “holistic” measures. That would be equivalent to talking about the temperaturelength of a day or the lengthweight of an object*. It’s essential to measure time, weight, and length with appropriate metrics and then to describe their interrelationships and the outcomes of these interrelationships. The language used to describe these is the language of probability, which is sensitive to differences in the measurement of different properties.

In psychological assessment, dimensionality is a challenging issue. What constitutes a single dimension is a matter for debate. For DTS, the primary consideration is how useful an assessment will be in helping people learn and grow. So, we tend to construct individual assessments, each of which represents a fairly tightly defined content space, and we use only one metric to determine the level of a performance. The meaning of a given score is both universal (it is an order of hierarchical complexity and phase on the skill scale) and contextual (it is provided to a performance in a particular domain in a particular context, and is associated with particular content.) We independently analyze the content of the performance to determine its strengths and weaknesses—relative to its level and the known range of content associated with that level—and provide feedback about these strengths and weaknesses as well as targeted learning suggestions. We use the level score to help us tell a useful story about a particular performance, without claiming to measure “lenghtweight”. This is accomplished by the rigorous separation of structure (level) and content.

*If we described objects in terms of their lengthweight, an object that was 10 inches long and 2 lbs could have a lengthweight of 12, but so could an object that was 2 inches long and 10 lbs.

Teacher pay and standardized test results

At the end of October, the Century Foundation released a paper entitled, Eight reasons not to tie teacher pay to standardized test results. I agree with their conclusions, and would add that even if all standardized tests were extremely reliable and measured exactly what they intended to measure, this would be a bad idea. This is because success in the adult world requires a multiplicity of skills and forms of knowledge, and tests focus on only some of these, one at a time. Until we can construct multifaceted longitudinal stories about the progress of individual students that are tied to a non-arbitrary standardized metric, we should not even consider linking student evaluations to teacher pay.

Promoting development

There is a vast literature exploring ways to promote development. Much of this literature focuses on speeding up development, some of it focuses on optimizing development. Although both approaches are intended to support development, there is evidence that approaches focused on optimizing development are likely to do a better job. This is because development involves two intertwined processes, differentiation (broadening and deepening knowledge) and integration. In plain(er) English, you get more adequate integrations at each level if you accomplish rich differentiation at the prior level.

When we code an assessment, we pay close attention to the degree to which the test-taker elaborates each of the sub-skills it targets. In our personal feedback, we note areas of strength and areas that appear to require further growth. The basic idea is to bring all of the sub-skills up to an optimal level of elaboration to support the emergence of next-level integrations.

Most of the readings we suggest are targeted one to two phases (1/4 to 1/2 of a level) above the level of a given performance. This practice has been shown to provide the ideal level of challenge (scaffolding) for optimal growth. We also suggest activities like engaging in discourse with peers, journaling, cultivating a habit of reflection, and improving metacognitive skills, all of which provide support for growth.

We do not teach people to think at higher levels. Higher levels of performance emerge when knowledge is adequately elaborated and the environment supports higher levels of thinking and performance. We focus on helping people to think better at their current level and challenging them to elaborate their current knowledge and skills—including the not-so-sexy nuts-and-bolts knowledge required for success in any context.

Task demands and capabilities

Our developmental assessment system, called the Lectical Assessment System (LAS), can be used to score (a) the performances of persons and (b) the task demands of specific situations/contexts. For example, my colleagues and I have analyzed the task demands of levels of management in large organizations, and tested managers’ developmental level of performance in several skill areas—including reasoning about leadership, reflective judgment, and decision-making.

The figure above shows the relation between the task demands of 7 levels of management and the performance levels of managers occupying these management positions. In this oversimplified image, the task demands of most management positions increase in a linear fashion, spanning levels 10-13. The capabilities of managers do not, for the most part, match these task demands.

This pattern is pervasive—we see it everywhere we look—and it reflects a hard truth. None of us is capable of meeting the task demands of the most complex situations in today’s world. I’ve come to believe that in many situations our best hope for meeting these demands is to (1) work strategically on the development of our own skills and knowledge, (2) learn to work closely with others who represent a wide range of perspectives and areas of expertise, and (3) use the best tools available to scaffold our thinking.

We aren’t alone. Others have observed and remarked upon this pattern:

Jaques, E. (1976). A general theory of bureaucracy. London: Heinemann Educational.

Habermas, J. (1975). Legitimation crisis (T. McCarthy, Trans.). Boston: Beacon Press.

Kegan, R. (1994). In over our heads: The mental demands of modern life. Cambridge, MA: Harvard University Press.

Bell, D. (1973) The coming of post-industrial society. New York: Basic Books

About measurement

The story of how measurement permits scientific advance can be illustrated through any number of examples. One such example is the measurement of temperature and its effects on our understanding of the molecular structure of lead and other elemental substances.

The tale begins with an assortment of semi-mythical early scientists, who agreed in their observations that lead only melts when it is very hot—much hotter than the temperature at which ice melts, and quite a bit cooler than the temperature at which iron melts. These observations, made repeatedly, resulted in the hypothesis that lead melts at a particular temperature.

To test this theory it was necessary to develop a standard for measuring temperature. A variety of early thermometers were developed and implemented. Partly because these early temperature-measuring devices were poorly calibrated, and partly because different temperature-measuring devices employed different scales, the temperature at which lead melted seemed to vary from device to device and context to context.

Scientists divided into a number of ‘camps’. One group argued that there were multiple pathways toward melting, which explained why the melting seemed to occur at different temperatures. Another group argued that the melting of lead could not be understood apart from the context in which the melting occurs. Only when a measure of temperature had been adequately developed and widely accepted did it become possible to observe that lead consistently melts at about 327º C.

Armed with this knowledge, scientists asked what it is about lead that causes it to melt at this particular temperature. They then developed hypotheses about the factors contributing to this phenomenon, observing that changes in altitude or air pressure seemed to result in small differences in its melting temperature. So, context did seem to play a role! In order to observe these differences more accurately, the measurement of temperature was further refined. The resulting observations provided information that ultimately contributed to an understanding of lead’s and other elements’ molecular structure.

While parts of this story are fictional, it is true that the thermometer has greatly contributed to our understanding of the properties of lead. Interestingly, the thermometer, like all other measures, emerged from what were originally qualitative observations about the effects of different amounts of heat that were quantified over time. The value of the thermometer, as we all know, extends far beyond its use as a measure of the melting temperature of lead. The thermometer is a measure of temperature in general, meaning that it can be employed to measure temperature in an almost limitless range of substances and contexts. It is this generality, in the end, that makes it possible to investigate the impact of context on the melting temperature of a substance, or to compare the relative melting temperatures of a range of elemental substances. This generality (or context-independence) is one of the primary features of a good measure.

Good measurement requires (1) the identification of a unidimensional, content and context-independent trait (temperature, length, time); (2) a system for assessing the amount of the trait; (3) determinations of the reliability and validity of the assessments; and finally (4) the calibration of a measure. A good thermometer has all of the qualities of a good measure. It is a well-calibrated instrument that can be employed to accurately and reliably measure a general, unidimensional trait across a wide range of contexts.

It was this perspective on measurement that first inspired me to try to find a good general measure of the developmental dimension. To read more about how this way of thinking relates to the Lectical Assessment System (LAS), read About Measurement on the DTS site. Pay special attention to the list of things we can do with the LAS.

What is a developmental assessment?

A developmental assessment is a test of knowledge and thinking that is based on extensive research into how students come to learn specific concepts and skills over time. All good developmental assessments require test-takers to show their thinking by making written or oral arguments in support of their judgments. Developmental assessments are less concerned about “right” answers and more concerned with how students use their knowledge and thinking skills to solve problems. A good developmental assessment should be educative in the sense that taking it is a learning experience in its own right, and each score is accompanied by feedback that tells students what they are most likely to benefit from learning next.

Integrative complexity and the LAS

Suedfeld and Tetlock’s Integrative Complexity Scale is one of a number of developmental scales—most of which have been informed by Jean Piaget’s cognitive developmental theory—that subscribe to the notion of hierarchical integration. Piagetian and neo-Piagetian theorists view development as a process of differentiation (increasing knowledge) and integration (organizing knowledge). Rather than viewing learning as an additive process in which we simply accumulate bits of knowledge over time, integrative theories propose that learning is an active process through which we organize our knowledge in particular ways, depending on where we are in our development. Moving from one development level to another involves a reorganization of our knowledge that translates into a new way of thinking.

For example, when most 6-year-olds think about lying, they are likely to think about it in terms of a single consequence—keeping out of trouble, getting into trouble, or making Dad sad. An eight-year-old can think about lying in terms of multiple possible consequences—getting in trouble and keeping out of trouble, which makes it possible to decide which outcome is more likely given past experience. You can view a more detailed description of this process in an online article, The Lectical Assessment System.

Suedfeld and Tetlock’s Integrative Complexity Scoring System (ICSS), like the Lectical Assessment System (LAS) and the General Hierarchical Complexity Scoring System (HCSS) is a content-independent scoring system that can be used to score the level of integrative complexity in a wide range of texts. What differs between these scoring systems are the scoring rules. Here, I discuss the difference between the scoring rules of the LAS and the ICSS.

The LAS goes to the heart of differentiation and integration by asking analysts to examine the way arguments are explicitly structured (single elements, linear arguments, or systems) and the way the meanings of their elements are implicitly structured (single elements, linear arguments, or systems). We call this core structure. The LAS has been subjected to a number of psychometric studies and has been shown to be a valid and reliable measure of the cognitive-developmental dimension, reliably (in the statistical sense) distinguishing 20 developmental phases between age 5 and the highest levels of adulthood.

Domain-based developmental assessment systems generally target conceptual content and aspects of surface structure. The ICSS relies primarily upon indicators of surface structure. In other words, instead of directly examining core structures, the developers of this system focus on a number of indicators that point to these core structures—including things like perspective, compartmentalization, setting up “straw men”, inclusion/exclusion rules, conflict avoidance, recognizing “exceptions to the rule”, probability statements, etc. The reliability of this assessment is generally too low to justify its clinical use (i.e., to provide a score for an individual), and some forms of the assessment do not appear to meet the reliability requirements for group studies. (see Reliability 2: How high should it be?)

Testing as part of learning 2

I can’t help it, I’m a developmental psychologist. I’ve been lurking about, watching my Granddaughter, Erwin, as she learns to master her environment. She’s about 8 months old now (real age, she was three months premature, so her birth age is 11 months)

Last week, Erwin figured out that complex actions can be used intentionally to make things happen in social situations. For example, she started reaching toward her Mom and Dad to indicate her intention to be picked up. At around same time, she began pointing to objects to indicate interest or draw them to the attention of her others. And she has begun to imitate actions like waving, clapping, and head shaking. Today, when we were Skyping, she clapped her hands to get me to play pat-a-cake, and she shakes her head to get her Mom to do the same—which she finds hilarious. To Mom’s dismay, Erwin is so excited by this new way of influencing her environment that she has stopped napping.

To see an example of Erwin’s attempts at verbal communication and her new reaching behavior, double-click on the picture below. Notice how emphatic her arm extension is, and how she makes eye contact as she reaches out.

A few months ago, most of Erwin’s actions were aimed toward physical mastery—learning to obtain ojects and manipulate them in a variety of ways, learning to move herself toward things she wanted to manipulate, or playing with sound just to hear the results.

When she was learning to do physical things, the physical environment provided most of the feedback. Although her parents were there to give encouragement, we all had the sense that it was the physical feedback that she craved—getting an object to her mouth, inching toward a favorite toy, pulling herself to stand.

Now she craves feedback from her parents; she has shifted her focus from physical mastery to social mastery. She reaches for Mom and gets picked up. She shakes her head and Mom shakes her head back. She points to a banana, and Dad brings it to her. She claps her hands, and Grandma plays pat-a-cake. And every time she undertakes a new action, she is conducting a test.

Testing is part of learning.

Each time any infant tries out a new skill, she is conducting a test. Each attempt is part of an action-feedback loop. Repeated attempts to master a new skill form a series of these action-feedback loops. Each iteration is an exemplary test—in the sense that it is educative—that guides the infant incrementally toward a new level of mastery.

Interestingly, infants never tire of this kind of testing, even when the feedback is not instantly gratifying. In fact, much of the feedback is along the lines of, “almost, but not quite,” or “that didn’t work,” neither of which seem to get in the way of infant learning. For example, when Erwin first started reaching toward her parents to ask to be picked up, her action was not easy to read. It rarely got the desired response. She gradually learned that the reaching needed to be clearly directed toward the parent and accompanied by eye contact. Now the message is, “You’ve got it!” At this point, Erwin takes the skill for granted, and has shifted her attention to things she has not yet mastered, like figuring out how to get adults to do other interesting or gratifying things.

The natural action-feedback mechanism of infancy works perfectly, because the proverbial carrot is usually, due to the very nature of normal human environments, dangled at just the right distance. Good parents respond to early attempts at communication, rewarding them with interesting responses, but success isn’t the only reward; it’s always accompanied by a new “carrot”—another interesting possibility just beyond the infant’s reach. In this way, the action-feedback mechanism functions both as an aid to learning and as a motivator.

Aspects of this “carrot-and-stick” perspective on learning have been expanded and described in a variety of research traditions—e.g., as part of the notion of reinforcement feedback in social learning theory (Bandura, 1977), as zone of proximal development in Vygotsky’s (1986) work, and as part of a complex process of assimilation and accommodation in Piaget’s (1985) work. It is important, because it speaks both to how we learn and to our motivation for learning. Good feedback plays two essential roles. First, it helps the learner decide what to try next. Second, it motivates the learner to keep striving toward mastery. And, as the infant example suggests, feedback cannot be reduced to simple reward or punishment. Ideally, it is information that supports learning by being useful to the learner. Learners are not motivated by reward or punishment per se, but by an optimal combination of “not there yet” “almost” and “you’ve got it”.

DiscoTests are for learning

Most of today’s tests provide feedback in the form of rewards (good grades, advancement, or honors) or punishment (bad grades and failure). My colleagues and I don’t find this acceptable, so we’ve created a nonprofit called DiscoTest. The overarching objective of the DiscoTest Initiative is to contribute to the development of optimal learning environments by creating assessments that deliver the kind of educative feedback that learners need to learn optimally. DiscoTests determine where students are in their individual learning trajectories and provide feedback that points toward the next incremental step toward mastery.

I’ll be writing more about DiscoTest in future posts. For now, if you’d like to know more, please visit the DiscoTest web site.