Adaptive learning, big data, and the meaning of learning

Knewton defines adaptive learning as "A teaching method premised on the idea that the curriculum should adapt to each user." In a recent blog post, Knewton's COO, David Liu, expanded on this definition. Here are some extracts:

You have to understand and have real data on content… Is the instructional content teaching what it was intended to teach? Is the assessment accurate in terms of what it’s supposed to assess? Can you calibrate that content at scale so you’re putting the right thing in front of a student, once you understand the state of that student? 

On the other side of the equation, you really have to understand student proficiency… understanding and being able to predict how that student is going to perform, based upon what they’ve done and based upon that content that I talked about before. And if you understand how well the student is performing against that piece of content, then you can actually begin to understand what that student needs to be able to move forward.

The idea of putting the right thing in front of a students is very cool. That's part of what we do here at Lectica. But what does Knewton mean by learning?

Curiosity got the better of me, so I set out to do some investigating. 

What does Knewton mean by learning?

In Knewton's white paper on adaptive learning the authors do a great job describing how their technology works. 

To provide continuously adaptive learning, Knewton analyzes learning materials based on thousands of data points — including concepts, structure, difficulty level, and media format — and uses sophisticated algorithms to piece together the perfect bundle of content for each student, constantly. The system refines recommendations through network effects that harness the power of all the data collected for all students to optimize learning for each individual student.

They go on to discuss several impressive technological innovations. I have to admit, the technology is cool, but what is their learning model and how is Knewton's technology being used to improve learning and teaching?

Unfortunately, Knewton does not seem to operate with a clearly articulated learning model in mind. In any case, I couldn't find one. But based on the sample items and feedback examples shown in their white paper and on their site, what Knewton means by learning is the ability to consistently get right answers on tests and quizzes, and the way to learn (get more answers right) is to get more practice on the kind of items students are not yet consistently getting right.

In fact, Knewton appears to be a high tech application of the content-focused learning model that's dominated public education since No Child Left Behind—another example of what it looks like when we throw technology at a problem without engaging in a deep enough analysis of that problem.

We're in the middle of an education crisis, but it's not because children aren't getting enough answers right on tests and quizzes. It's because our efforts to improve education consistently fail to ask the most important questions, "Why do we educate our children?" and "What are the outcomes that would be genuine evidence of success?"

Don't get me wrong. We love technology, and we leverage it shamelessly. But we don't believe technology is the answer. The answer lies in a deep understanding of how learning works and what we need to do to support the kind of learning that produces outcomes we really care about. 

 

Please follow and like us:

A new kind of report card

report_card_oldWhen I was a kid, the main way school performance was measured was with letter grades. We got letter grades on almost all of our work. Getting an A meant you knew it all, a B meant you didn't quite know it all, C meant you knew enough to pass, D meant you knew so little you were on the verge of faiing, and F meant you failed. If you always got As you were one of the really smart kids, and if you always got Ds and Fs you were one of the dumb kids. Unfortunately, that's how we thought about it, plain and simple. 

If I got a B, my teacher and parents told me I could do better and that I should work harder. If I got a C, I was in deep trouble, and was put on restriction until I brought my grade up. This meant more hours of homework. I suspect this was a common experience. It was certainly what happened on Father Knows Best and The Brady Bunch.

The best teachers also commented on our work, telling us where we could improve our arguments or where and how we had erred, and suggesting actions we could take to improve. In terms of feedback, this was the gold standard. It was the only way we got any real guidance about what we, as individuals, needed to work on next. Letter grades represented rank, punishment, and reward, but they weren't very useful indicators of where we were in our growth as learners. Report cards were for parents. 

Usher in Lectica and DiscoTest

One of our goals here at Lectica has been to make possible a new kind of report card—one that:

  1. delivers scores that have rich meaning for students, parents, and decision-makers,
  2. provides the kind of personal feedback good teachers offer, and
  3. gives students an opportunity to watch themselves grow.

report_cardThis new report card—illustrated on the right—uses a single learning "ruler" for all subjects, so student growth in different subjects can be shown on the same scale. In the example shown here, each assessment is represented by a round button that links to an explanation of the student's learning edge at the time the assessment was taken. 

This new report card also enables direct comparisons between growth trajectories in different subject areas. 

An additional benefit of this new report card is that it delivers a rich portfolio-like account of student growth that can be employed to improve admissions and advancement decisions. 

And finally, we're very curious about the potential psychological benefits of allowing students to watch how they grow. We think it's going to be a powerful motivator.

 

Please follow and like us:

Lectical (CLAS) scores are subject to change

feedback_loopWe incorporate feedback loops called virtuous cycles in everything we do. And I mean everything. Our governance structure is fundamentally iterative. (We're a Sociocracy.) Our project management approach is iterative. (We use Scrum.) We develop ideas iteratively. (We use Design Thinking.) We build our learning tools iteratively. (We use developmental maieutics.) And our learning model is iterative. (We use the virtuous cycle of learning.) One important reason for using all of these iterative processes is that we want every activity in our organization to reward learning. Conveniently, all of the virtuous cycles we iterate through do double duty as virtuous cycles of learning.

All of this virtuous cycling has an interesting (and unprecedented) side effect. The score you receive on one of our assessments is subject to change. Yes, because we learn from every single assessment taken in our system, what we learn could cause your score on any assessment you take here to change. Now, it's unlikley to change very much, probably not enough to affect the feedback you receive, but the fact that scores change from time to time can really shake people up. Some people might even think we've lost the plot!

But there is method in our madness. Allowing your score to fluctuate a bit as our knowledge base grows is our way of reminding everyone that there's uncertainty in any test score, and ourselves that there's always more to learn about how learning works. 

Please follow and like us:

Jaques’ Strata and Lectical Levels

We often receive queries about the relation between Lectical Levels the Strata defined by Jaques. The following table shows the relation between Lectical Levels and Strata as they were defined by Jaques in Requisite Organization. These relations were determined by using the Lectical Assessment System to score Jaques’ definitions. We have not yet had an opportunity to compare the results of scoring the same material with the Lectical Assessment System and any scoring system based on Jaques’ definitions as we have done with other comparisons of scoring systems. Our interpretation of Jaques’ Strata definitions may differ from the interpretations of other researchers, leading to differences between theoretical and actual comparisons.

Strata by Lectical Level

References

Jaques, E. (1996). Requisite organization (2 ed.). Arlington, VA: Cason Hall.

Please follow and like us:

Maintaining inter-rater agreement

How we maintain inter-rater agreement and ensure high reliability at DTS/DiscoTest

First, we design assessments with 5-7 essay questions, partly because this number is required to allow us to achieve a level of reliability that allows us to identify 4 phases per lectical level. This corresponds with a corrected alpha of .95 or greater.

Second, we engage in continuous learning. Certified analysts and trainees attend mandatory weekly scoring meetings (called scoring circles) where they discuss scoring and review challenging cases.

Third, when we begin working with data from a new subject area, the scoring circle always examines a diverse sample of protocols before starting to score in earnest. Then, when we begin scoring a new assessment, two Certified Analysts score every performance until agreement rates are consistently at or above 85% within 1/4 of a level.

Fourth, we second score a percentage of all performances, some selected at random and some selected because the first analyst lacks confidence in his or her score.

  • 5%-10% of all assessments, selected at random, are second-scored by a blind analyst (a higher percentage on newer assessments or when the rate of inter-rater agreement is unacceptable.)
  • A second, blind scorer is required to score an assessment any time the first scorer’s confidence level is below the level we call “confident”.

When the scores of the first and second scorers are different by more than 1 phase, first and second scorers must reconcile through discussion. If they cannot reconcile, they must consult a third Certified Analyst.

Confidence levels
4 = very confident: exemplary, prototypical
3 = confident: no guesswork, not too much variation, no more than 2 responses where scorer wavers, no lack of coherence, no language problems, adequate explanation, no suspicion of plagarism, not idiosyncratic
2 = less than confident: guesswork, too much variation, more than 2 responses where scorer wavers, lack of coherence, language problems, inadequate explanation, suspicion of plagarism, idiosyncratic
1 = not confident at all: unscorable or almost unscorable, very idiosyncratic, very incoherent

Please follow and like us:

The limitations of testing

It is important for those of us who use assessments to ensure that they (1) measure what we say they measure, (2) measure it reliably enough to justify claimed distinctions between and within persons, and (3) are used responsibly. It is relatively easy for testing experts to create assessments that are adequately reliable (2) for individual assessment, and although it is more difficult to show that these tests measure the construct of interest (1), there are reasonable methods for showing that an assessment meets this standard. However, it is more difficult to ensure that assessments are used responsibly (3).

Few consumers of tests are aware of their inherent limitations. Even the best tests, those that are highly reliable and measure what they are supposed to measure, provide only a limited amount of information. This is true of all measures. The more we hone in on a measureable dimension—in other words, the greater our precision becomes—the narrower the construct becomes. Time, weight, height, and distance are all extremely narrow constructs. This means that they provide a very specific piece of information extremely well. When we use a ruler, we can have great confidence in the measurement we make, down to very small lengths (depending on the ruler, of course). No one doubts the great advantages of this kind of precision. But we can’t learn anything else about the measured object. Its length usually cannot tell us what the object is, how it is shaped, its color, its use, its weight, how it feels, how attractive it is, or how useful it is. We only know how long it is. To provide an accurate account of the thing that was measured, we need to know many more things about it, and we need to construct a narrative that brings these things together in a meaningful way.

A really good psychological measure is similar. The LAS (Lectical Assessment System), for example, is designed to go to the heart of development, stripping away everything that does not contribute to the pure developmental “height” of a given performance. Without knowledge of many other things—such as the ways of thinking that are generally associated with this “height” in a particular domain, the specific ideas that are associated with this particular performance, information from other performances on other measures, qualitative observations, and good clinical judgment—we cannot construct a terribly useful narrative.

And this brings me to my final point: A formal measure, no matter how great it is, should always be employed by a knowledgeable mentor, clinician, teacher, consultant, or coach as a single item of information about a given client that may or may not provide useful insights into relevant needs or capabilities. Consider this relatively simple example: a given 2-year-old may be tall for his age, but if he is somewhat under weight for his age, the latter measure may seem more important. However, if he has a broken arm, neither measure may loom large—at least until the bone is set. Once the arm is safely in a cast, all three pieces of information—weight, height, and broken arm—may contribute to a clinical diagnosis that would have been difficult to make without any one of them.

It is my hope that the educational community will choose to adopt high standards for measurement, then put measurement in its place—alongside good clinical judgment, reflective life experience, qualitative observations, and honest feedback from trusted others.

Please follow and like us:

What is a holistic assessment?

Thirty years ago, when I was a hippy midwife, the idea of holism began to slip into the counter-culture. A few years later, this much misunderstood notion was all the rage on college campuses. By the time I was in graduate school in the nineties there was a impassable division between the trendy postmodern holists and the rigidly old fashioned modernists. You may detect a slight mocking tone, and rightly so. People with good ideas on both sides made themselves look pretty silly by refusing, for example, to use any of the tools associated with the other side. One of the more tragic outcomes of this silliness was the emergence of the holistic assessment.

Simply put, the holistic assessment is a multidimensional assessment that is designed to take a more nuanced, textured, or rich approach to assessment. Great idea. Love it.

It’s the next part that’s silly. Having collected rich information on multiple dimensions, the test designers sum up a person’s performance with a single number. Why is this silly? Because the so-called holistic score becomes pretty-much meaningless. Two people with the same score can have very little in common. For example, let’s imagine that a holistic assessment examines emotional maturity, perspective taking, and leadership thinking. Two people receive a score of 10 that may be accompanied by boilerplate descriptions of what emotional maturity, perspective taking, and leadership attitudes look like at level 10. However, person one was actually weak in perspective-taking and strongest in leadership, and person two was weak in emotional maturity and strongest in perspective taking. The score of 10, it turns out, means something quite different for these two people. I would argue that it is relatively meaningless because there is no way to know, based on the single “holistic” score, how best to support the development of these distinct individuals.

Holism has its roots in system dynamics, where measurements are used to build rich models of systems. All of the measurements are unidimensional. They are never lumped together into “holistic” measures. That would be equivalent to talking about the temperaturelength of a day or the lengthweight of an object*. It’s essential to measure time, weight, and length with appropriate metrics and then to describe their interrelationships and the outcomes of these interrelationships. The language used to describe these is the language of probability, which is sensitive to differences in the measurement of different properties.

In psychological assessment, dimensionality is a challenging issue. What constitutes a single dimension is a matter for debate. For DTS, the primary consideration is how useful an assessment will be in helping people learn and grow. So, we tend to construct individual assessments, each of which represents a fairly tightly defined content space, and we use only one metric to determine the level of a performance. The meaning of a given score is both universal (it is an order of hierarchical complexity and phase on the skill scale) and contextual (it is provided to a performance in a particular domain in a particular context, and is associated with particular content.) We independently analyze the content of the performance to determine its strengths and weaknesses—relative to its level and the known range of content associated with that level—and provide feedback about these strengths and weaknesses as well as targeted learning suggestions. We use the level score to help us tell a useful story about a particular performance, without claiming to measure “lenghtweight”. This is accomplished by the rigorous separation of structure (level) and content.

*If we described objects in terms of their lengthweight, an object that was 10 inches long and 2 lbs could have a lengthweight of 12, but so could an object that was 2 inches long and 10 lbs.

Please follow and like us:

Task demands and capabilities (the complexity gap)

For decades, my colleagues and I have been working with and refining a developmental assessment system called the Lectical Assessment System (now also an electronic scoring system called CLAS). It can be used to score (a) the complexity level of people’s arguments and (b) the complexity level—“task demands”—of specific situations or roles. For example, we have analyzed the task demands of levels of work in large organizations and assessed the complexity level of employees’ thinking in several skill areas — including reflective judgment/critical thinking and leadership decision-making.

The figure on the left shows the relation between the task demands of 7 management levels and the complexity level scores received on an assessment of decision making skills taken by leaders occupying these positions. The task demands of most positions increase in a linear fashion, spanning levels 10–13 (a.k.a. 1000–1399).

After work level 2 (entry level management), the capabilities of leaders do not, for the most part, rise to these task demands.

This pattern is pervasive—we see it everywhere we look—and it reflects a hard truth. None of us is capable of meeting the task demands of the most complex situations in today's world. I've come to believe that in many situations our best hope for meeting these demands is to (1) recognize our human limitations, (2) work strategically on the development of our own skills and knowledge, (3) learn to work closely with others who represent a wide range of perspectives and areas of expertise, and (4) use the best tools available to scaffold our thinking.


We aren't alone. Others have observed and remarked upon this pattern:

Jaques, E. (1976). A general theory of bureaucracy. London: Heinemann Educational.

Habermas, J. (1975). Legitimation crisis (T. McCarthy, Trans.). Boston: Beacon Press.

Kegan, R. (1994). In over our heads: The mental demands of modern life. Cambridge, MA: Harvard University Press.

Bell, D. (1973) The coming of post-industrial society. New York: Basic Books

Please follow and like us:

About measurement

The story of how measurement permits scientific advance can be illustrated through any number of examples. One such example is the measurement of temperature and its effects on our understanding of the molecular structure of lead and other elemental substances.

The tale begins with an assortment of semi-mythical early scientists, who agreed in their observations that lead only melts when it is very hot—much hotter than the temperature at which ice melts, and quite a bit cooler than the temperature at which iron melts. These observations, made repeatedly, resulted in the hypothesis that lead melts at a particular temperature.

To test this theory it was necessary to develop a standard for measuring temperature. A variety of early thermometers were developed and implemented. Partly because these early temperature-measuring devices were poorly calibrated, and partly because different temperature-measuring devices employed different scales, the temperature at which lead melted seemed to vary from device to device and context to context.

Scientists divided into a number of ‘camps’. One group argued that there were multiple pathways toward melting, which explained why the melting seemed to occur at different temperatures. Another group argued that the melting of lead could not be understood apart from the context in which the melting occurs. Only when a measure of temperature had been adequately developed and widely accepted did it become possible to observe that lead consistently melts at about 327º C.

Armed with this knowledge, scientists asked what it is about lead that causes it to melt at this particular temperature. They then developed hypotheses about the factors contributing to this phenomenon, observing that changes in altitude or air pressure seemed to result in small differences in its melting temperature. So, context did seem to play a role! In order to observe these differences more accurately, the measurement of temperature was further refined. The resulting observations provided information that ultimately contributed to an understanding of lead’s and other elements’ molecular structure.

While parts of this story are fictional, it is true that the thermometer has greatly contributed to our understanding of the properties of lead. Interestingly, the thermometer, like all other measures, emerged from what were originally qualitative observations about the effects of different amounts of heat that were quantified over time. The value of the thermometer, as we all know, extends far beyond its use as a measure of the melting temperature of lead. The thermometer is a measure of temperature in general, meaning that it can be employed to measure temperature in an almost limitless range of substances and contexts. It is this generality, in the end, that makes it possible to investigate the impact of context on the melting temperature of a substance, or to compare the relative melting temperatures of a range of elemental substances. This generality (or context-independence) is one of the primary features of a good measure.

Good measurement requires (1) the identification of a unidimensional, content and context-independent trait (temperature, length, time); (2) a system for assessing the amount of the trait; (3) determinations of the reliability and validity of the assessments; and finally (4) the calibration of a measure. A good thermometer has all of the qualities of a good measure. It is a well-calibrated instrument that can be employed to accurately and reliably measure a general, unidimensional trait across a wide range of contexts.

It was this perspective on measurement that first inspired me to try to find a good general measure of the developmental dimension. To read more about how this way of thinking relates to the Lectical Assessment System (LAS), read About Measurement on the DTS site. Pay special attention to the list of things we can do with the LAS.

Please follow and like us: