Jaques’ Strata and Lectical Levels

We often receive queries about the relation between Lectical Levels the Strata defined by Jaques. The following table shows the relation between Lectical Levels and Strata as they were defined by Jaques in Requisite Organization. These relations were determined by using the Lectical Assessment System to score Jaques’ definitions. We have not yet had an opportunity to compare the results of scoring the same material with the Lectical Assessment System and any scoring system based on Jaques’ definitions as we have done with other comparisons of scoring systems. Our interpretation of Jaques’ Strata definitions may differ from the interpretations of other researchers, leading to differences between theoretical and actual comparisons.

Strata by Lectical Level

References

Jaques, E. (1996). Requisite organization (2 ed.). Arlington, VA: Cason Hall.

Maintaining inter-rater agreement

How we maintain inter-rater agreement and ensure high reliability at DTS/DiscoTest

First, we design assessments with 5-7 essay questions, partly because this number is required to allow us to achieve a level of reliability that allows us to identify 4 phases per lectical level. This corresponds with a corrected alpha of .95 or greater.

Second, we engage in continuous learning. Certified analysts and trainees attend mandatory weekly scoring meetings (called scoring circles) where they discuss scoring and review challenging cases.

Third, when we begin working with data from a new subject area, the scoring circle always examines a diverse sample of protocols before starting to score in earnest. Then, when we begin scoring a new assessment, two Certified Analysts score every performance until agreement rates are consistently at or above 85% within 1/4 of a level.

Fourth, we second score a percentage of all performances, some selected at random and some selected because the first analyst lacks confidence in his or her score.

  • 5%-10% of all assessments, selected at random, are second-scored by a blind analyst (a higher percentage on newer assessments or when the rate of inter-rater agreement is unacceptable.)
  • A second, blind scorer is required to score an assessment any time the first scorer’s confidence level is below the level we call “confident”.

When the scores of the first and second scorers are different by more than 1 phase, first and second scorers must reconcile through discussion. If they cannot reconcile, they must consult a third Certified Analyst.

Confidence levels
4 = very confident: exemplary, prototypical
3 = confident: no guesswork, not too much variation, no more than 2 responses where scorer wavers, no lack of coherence, no language problems, adequate explanation, no suspicion of plagarism, not idiosyncratic
2 = less than confident: guesswork, too much variation, more than 2 responses where scorer wavers, lack of coherence, language problems, inadequate explanation, suspicion of plagarism, idiosyncratic
1 = not confident at all: unscorable or almost unscorable, very idiosyncratic, very incoherent

The limitations of testing

It is important for those of us who use assessments to ensure that they (1) measure what we say they measure, (2) measure it reliably enough to justify claimed distinctions between and within persons, and (3) are used responsibly. It is relatively easy for testing experts to create assessments that are adequately reliable (2) for individual assessment, and although it is more difficult to show that these tests measure the construct of interest (1), there are reasonable methods for showing that an assessment meets this standard. However, it is more difficult to ensure that assessments are used responsibly (3).

Few consumers of tests are aware of their inherent limitations. Even the best tests, those that are highly reliable and measure what they are supposed to measure, provide only a limited amount of information. This is true of all measures. The more we hone in on a measureable dimension—in other words, the greater our precision becomes—the narrower the construct becomes. Time, weight, height, and distance are all extremely narrow constructs. This means that they provide a very specific piece of information extremely well. When we use a ruler, we can have great confidence in the measurement we make, down to very small lengths (depending on the ruler, of course). No one doubts the great advantages of this kind of precision. But we can’t learn anything else about the measured object. Its length usually cannot tell us what the object is, how it is shaped, its color, its use, its weight, how it feels, how attractive it is, or how useful it is. We only know how long it is. To provide an accurate account of the thing that was measured, we need to know many more things about it, and we need to construct a narrative that brings these things together in a meaningful way.

A really good psychological measure is similar. The LAS (Lectical Assessment System), for example, is designed to go to the heart of development, stripping away everything that does not contribute to the pure developmental “height” of a given performance. Without knowledge of many other things—such as the ways of thinking that are generally associated with this “height” in a particular domain, the specific ideas that are associated with this particular performance, information from other performances on other measures, qualitative observations, and good clinical judgment—we cannot construct a terribly useful narrative.

And this brings me to my final point: A formal measure, no matter how great it is, should always be employed by a knowledgeable mentor, clinician, teacher, consultant, or coach as a single item of information about a given client that may or may not provide useful insights into relevant needs or capabilities. Consider this relatively simple example: a given 2-year-old may be tall for his age, but if he is somewhat under weight for his age, the latter measure may seem more important. However, if he has a broken arm, neither measure may loom large—at least until the bone is set. Once the arm is safely in a cast, all three pieces of information—weight, height, and broken arm—may contribute to a clinical diagnosis that would have been difficult to make without any one of them.

It is my hope that the educational community will choose to adopt high standards for measurement, then put measurement in its place—alongside good clinical judgment, reflective life experience, qualitative observations, and honest feedback from trusted others.

What is a holistic assessment?

Thirty years ago, when I was a hippy midwife, the idea of holism began to slip into the counter-culture. A few years later, this much misunderstood notion was all the rage on college campuses. By the time I was in graduate school in the nineties there was a impassable division between the trendy postmodern holists and the rigidly old fashioned modernists. You may detect a slight mocking tone, and rightly so. People with good ideas on both sides made themselves look pretty silly by refusing, for example, to use any of the tools associated with the other side. One of the more tragic outcomes of this silliness was the emergence of the holistic assessment.

Simply put, the holistic assessment is a multidimensional assessment that is designed to take a more nuanced, textured, or rich approach to assessment. Great idea. Love it.

It’s the next part that’s silly. Having collected rich information on multiple dimensions, the test designers sum up a person’s performance with a single number. Why is this silly? Because the so-called holistic score becomes pretty-much meaningless. Two people with the same score can have very little in common. For example, let’s imagine that a holistic assessment examines emotional maturity, perspective taking, and leadership thinking. Two people receive a score of 10 that may be accompanied by boilerplate descriptions of what emotional maturity, perspective taking, and leadership attitudes look like at level 10. However, person one was actually weak in perspective-taking and strongest in leadership, and person two was weak in emotional maturity and strongest in perspective taking. The score of 10, it turns out, means something quite different for these two people. I would argue that it is relatively meaningless because there is no way to know, based on the single “holistic” score, how best to support the development of these distinct individuals.

Holism has its roots in system dynamics, where measurements are used to build rich models of systems. All of the measurements are unidimensional. They are never lumped together into “holistic” measures. That would be equivalent to talking about the temperaturelength of a day or the lengthweight of an object*. It’s essential to measure time, weight, and length with appropriate metrics and then to describe their interrelationships and the outcomes of these interrelationships. The language used to describe these is the language of probability, which is sensitive to differences in the measurement of different properties.

In psychological assessment, dimensionality is a challenging issue. What constitutes a single dimension is a matter for debate. For DTS, the primary consideration is how useful an assessment will be in helping people learn and grow. So, we tend to construct individual assessments, each of which represents a fairly tightly defined content space, and we use only one metric to determine the level of a performance. The meaning of a given score is both universal (it is an order of hierarchical complexity and phase on the skill scale) and contextual (it is provided to a performance in a particular domain in a particular context, and is associated with particular content.) We independently analyze the content of the performance to determine its strengths and weaknesses—relative to its level and the known range of content associated with that level—and provide feedback about these strengths and weaknesses as well as targeted learning suggestions. We use the level score to help us tell a useful story about a particular performance, without claiming to measure “lenghtweight”. This is accomplished by the rigorous separation of structure (level) and content.

*If we described objects in terms of their lengthweight, an object that was 10 inches long and 2 lbs could have a lengthweight of 12, but so could an object that was 2 inches long and 10 lbs.

Task demands and capabilities (the complexity gap)

Our developmental assessment system, called the Lectical Assessment System (LAS), can be used to score (a) the performances of persons and (b) the task demands of specific situations/contexts. For example, my colleagues and I have analyzed the task demands of levels of management in large organizations, and tested managers' developmental level of performance in several skill areas—including reasoning about leadership, reflective judgment, and decision-making.

The figure on the left shows the relation between the task demands of 7 levels of management and the performance levels of managers occupying these management positions. In this oversimplified image, the task demands of most management positions increase in a linear fashion, spanning levels 10-13. The capabilities of managers do not, for the most part, match these task demands.

This pattern is pervasive—we see it everywhere we look—and it reflects a hard truth. None of us is capable of meeting the task demands of the most complex situations in today's world. I've come to believe that in many situations our best hope for meeting these demands is to (1) work strategically on the development of our own skills and knowledge, (2) learn to work closely with others who represent a wide range of perspectives and areas of expertise, and (3) use the best tools available to scaffold our thinking.

We aren't alone. Others have observed and remarked upon this pattern:

Jaques, E. (1976). A general theory of bureaucracy. London: Heinemann Educational.

Habermas, J. (1975). Legitimation crisis (T. McCarthy, Trans.). Boston: Beacon Press.

Kegan, R. (1994). In over our heads: The mental demands of modern life. Cambridge, MA: Harvard University Press.

Bell, D. (1973) The coming of post-industrial society. New York: Basic Books

About measurement

The story of how measurement permits scientific advance can be illustrated through any number of examples. One such example is the measurement of temperature and its effects on our understanding of the molecular structure of lead and other elemental substances.

The tale begins with an assortment of semi-mythical early scientists, who agreed in their observations that lead only melts when it is very hot—much hotter than the temperature at which ice melts, and quite a bit cooler than the temperature at which iron melts. These observations, made repeatedly, resulted in the hypothesis that lead melts at a particular temperature.

To test this theory it was necessary to develop a standard for measuring temperature. A variety of early thermometers were developed and implemented. Partly because these early temperature-measuring devices were poorly calibrated, and partly because different temperature-measuring devices employed different scales, the temperature at which lead melted seemed to vary from device to device and context to context.

Scientists divided into a number of ‘camps’. One group argued that there were multiple pathways toward melting, which explained why the melting seemed to occur at different temperatures. Another group argued that the melting of lead could not be understood apart from the context in which the melting occurs. Only when a measure of temperature had been adequately developed and widely accepted did it become possible to observe that lead consistently melts at about 327º C.

Armed with this knowledge, scientists asked what it is about lead that causes it to melt at this particular temperature. They then developed hypotheses about the factors contributing to this phenomenon, observing that changes in altitude or air pressure seemed to result in small differences in its melting temperature. So, context did seem to play a role! In order to observe these differences more accurately, the measurement of temperature was further refined. The resulting observations provided information that ultimately contributed to an understanding of lead’s and other elements’ molecular structure.

While parts of this story are fictional, it is true that the thermometer has greatly contributed to our understanding of the properties of lead. Interestingly, the thermometer, like all other measures, emerged from what were originally qualitative observations about the effects of different amounts of heat that were quantified over time. The value of the thermometer, as we all know, extends far beyond its use as a measure of the melting temperature of lead. The thermometer is a measure of temperature in general, meaning that it can be employed to measure temperature in an almost limitless range of substances and contexts. It is this generality, in the end, that makes it possible to investigate the impact of context on the melting temperature of a substance, or to compare the relative melting temperatures of a range of elemental substances. This generality (or context-independence) is one of the primary features of a good measure.

Good measurement requires (1) the identification of a unidimensional, content and context-independent trait (temperature, length, time); (2) a system for assessing the amount of the trait; (3) determinations of the reliability and validity of the assessments; and finally (4) the calibration of a measure. A good thermometer has all of the qualities of a good measure. It is a well-calibrated instrument that can be employed to accurately and reliably measure a general, unidimensional trait across a wide range of contexts.

It was this perspective on measurement that first inspired me to try to find a good general measure of the developmental dimension. To read more about how this way of thinking relates to the Lectical Assessment System (LAS), read About Measurement on the DTS site. Pay special attention to the list of things we can do with the LAS.